Re: Dynamic UDFs support

Neeraja Rentachintala Thu, 21 Jul 2016 07:22:18 -0700

The whole point of this feature is to avoid Drill cluster restarts as the
name indicates 'Dynamic' UDFs.
So any design that requires restarts I would think would beat the purpose.


I also think this is an example of a feature we start with a simple design
to serve the purpose, take feedback on how it is being deployed/used in
real user situations and improve it in subsequent releases.

-thanks
Neeraja

On Thu, Jul 21, 2016 at 6:32 AM, Keys Botzum <[email protected]> wrote:

> I think there are a lot of great ideas here. My one concern is the lack of
> unload and thus presumably replace functionality. I'm just thinking about
> typical actual usage.
>
> In a typical development cycle someone writes something, tries it, learns,
> changes it, and tries again. Assuming I understand the design that change
> step requires a full Drill cluster restart. That is going to be very
> disruptive and will make UDF work nearly impossible without a dedicated
> "private" cluster for Drill. I realize that people should have access to
> the data they need and Drill in a development cluster but even then
> restarts can be hard since development clusters are often shared - and
> that's assuming such a cluster exists. I realize of course Drill can be run
> as a standalone Drillbit but I'm not convinced that desktops will have
> adequate access to the needed data.
>
> Having dealt with Java classloading over the years, I'm not claiming class
> replacement is an easy thing so I'll defer to others on the priority of
> that, but I'm wondering if there isn't some way to make UDF experimentation
> a bit easier/practical.
>
> Given the above, let me toss out some possibly naive ideas that maybe are
> workable:
> * can I easily run a standalone Drillbit on a Hadoop cluster node that is
> already running Drill servers? I'm sure this can be done, but is it easy?
> Could we perhaps make this clearer as an explicit kind of thing?
> * is there a way that when I deploy a UDF I can constrain the # of bits it
> is loaded into and perhaps even specify the bits?
>   * Obvious correlarary is I'd want my query to run on those bits and a
> not too disruptive way to restart just those bits
>
> The above may be obvious to Drill experts. If it is then perhaps the UDF
> docs could just point out how to easily develop UDFs in an iterative
> fashion.
>
> Keys
> _______________________________
> Keys Botzum
> Senior Principal Technologist
> [email protected] <mailto:[email protected]>
> 443-718-0098
> MapR Technologies
> http://www.mapr.com <http://www.mapr.com/>
> > On Jul 21, 2016, at 3:13 AM, Paul Rogers <[email protected]> wrote:
> >
> > Always good to have options… Another is to try an eventual consistency
> model.
> >
> > The invariant here is the one that was mentioned earlier. Whenever a
> query is submitted with UDF U, that query either fails in planning (because
> U is unknown) or succeeds on all nodes (at least with respect to U.)
> >
> > For this to work, we need a constant view of the world. We can try to
> enforce consistency at function registration time (the original design), or
> via the Foreman (Parth’s design.) We can probably also use an eventual
> consistency model.
> >
> > Suppose we have a global name space of functions. With the global name
> space, we can establish this invariant: If a function is in that name
> space, then the Foreman accepts the query. If a Drillbit receives a
> fragment, but does not yet know of U, then the Drillbit A) knows that some
> foreman must have registered U (or the query would have failed in planning)
> and B) the Drillbit can download the function if not already in place.
> >
> > Folks pointed out that always checking a global name space is expensive,
> which it is. As it turns out, we can first check the local function
> registry. If the Drillbit already knows about the function, we’re done
> checking, no global check needed. It is only on the first use of a new
> function, when it is not yet loaded locally, that the global check must be
> done.
> >
> > For this to work the foreman that registers UDF U must:
> >
> > 1. From Arina’s proposed staging area, check the jar contents to see if
> a name conflict exists with the global registry. (Requires some class
> loader code.)
> > 2. If a conflict exists, refuse to register the function and return an
> error.
> > 3. If no conflict exists, register the function in the global name space
> and move the jar to the registered area in DFS.
> >
> > In this model, it is entirely optional whether the foreman that
> registers U alerts other Drillbits. Instead, Drillbits could poll from time
> to time, or just wait until they see a query with U and do the download at
> that time.
> >
> > When a new Drillbit starts, it can load all functions in the registry
> area because these have all passed the name collision test and can all be
> used in queries. Any new registrations will be found and loaded as above.
> (It is not required to preload functions, but it might help performance.)
> >
> > ZK is the only place we have at present for the global name space, so
> that seems the logical tool. ZK allows atomic operations, which we need
> here. Operations 1, 2, and 3 above should be atomic.
> >
> > Unfortunately, we can’t do the DFS move atomically with a ZK name space
> insertion. So, the global name check & insert should be atomic. If that
> succeeds, copy the jar into the registered folder. There are a few details
> to work out to handle special cases, but we can cover those another time.
> (Hint: what happens if the Foreman crashes after insetting the ZK entry but
> before moving the jar?)
> >
> > None of the proposed designs permit graceful unloading of functions. So,
> deleting functions will require a cluster restart to establish a new stable
> checkpoint.
> >
> > We can recommend that on each cluster restart, any functions in the DFS
> registry be copied to each Drillbit (much easier with the coming YARN
> integration) as a way of keeping the DFS registry a reasonable size.
> >
> > More details to work out, but that’s the gist of the concept.
> >
> > Thanks,
> >
> > - Paul
> >
> >> On Jul 20, 2016, at 2:37 PM, Parth Chandra <[email protected]>
> wrote:
> >>
> >> My notes from the hangout with Arina and Paul -
> >>
> >> Notes -
> >>
> >> There are two invariants for the registration process -
> >> 1) There is a registration/validated directory in the DFS that contains
> >> UDFS that have been validated by the registering foreman. All drillbits
> >> will have access to this directory and on startup and/or UDF
> registration,
> >> the jars in this directory are sync'd up with a local UDF directory
> >> 2) During the process of registration, the registering foreman creates a
> >> Zookeeper node that indicates that one or more drillbits has not yet
> >> registered the UDF.
> >>
> >> The basic workflow is that UDF jars are copied from the staging
> directory
> >> to the registration directory and validated. Once they are validated,
> the
> >> available drillbits are told to register the UDF. Registering the UDF
> >> consists of copying the node to a local UDF directory and updating the
> >> local (in-memory) udf registry. A sentinel node in zookeeper is used to
> >> track when all the drillbits have registered the UDF.
> >>
> >> There were two main suggestions : Immediate registration and lazy
> >> registration,
> >>
> >> Immediate registration -
> >> Foreman tells all drillbits to register. Creates a Zookeeper node to
> >> track.
> >> Every drillbit makes a local copy and updates zookeeper node to show it
> >> is done.
> >> Foreman checks the zookeeper node and when all available drillbits have
> >> acknowledged, sends a message to all drillbits to complete registration.
> >>  Foreman removes ZK node.
> >>  All Drillbits update their local UDF registry
> >>  Drillbit startup will block if there is a ZK node indicating
> >> registration is in progress.
> >>  This approach needs to be validated to see if any race conditions
> exist.
> >>
> >> Lazy registration
> >>  Once a UDF is copied to the registration folder, the UDF is essentially
> >> registered. On first use, a drillbit may hit a classnotfound exception
> in
> >> which case it will look for the UDF in the registration directory. If
> >> found, it will copy to the local directory and add the UDF to it's local
> >> registry.
> >>  This approach should be investigated to see if it fits in with the
> >> current UDF execution code.
> >>
> >>
> >> On Mon, Jul 18, 2016 at 3:36 PM, Parth Chandra <[email protected]>
> >> wrote:
> >>
> >>> +1 on simplifying the design and postpone the items Paul has suggested.
> >>>
> >>> Arina, Paul, I think we need to work out some of the design related to
> >>> registering the UDF. Are you guys open for a quick hangout @10 a.m PDT
> >>> tomorrow?
> >>>
> >>>
> >>>
> >>> On Thu, Jul 14, 2016 at 1:46 PM, Paul Rogers <[email protected]>
> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> We’ve had quite a lively debate in the “comments” section of Arina’s
> >>>> wonderful design doc. Zelaine made a great suggestion: summarize the
> user
> >>>> experience as a way of making sense of the wealth of detailed
> comments.
> >>>>
> >>>> IMHO, the most important user experience goals are:
> >>>>
> >>>> 1. When a user submits a CREATE FUNCTION command, the command returns
> >>>> quickly (within a few seconds at most.)
> >>>> 2. If the above user then issues a query using that function (to the
> same
> >>>> Foreman), that query is guaranteed to successfully use the new
> function on
> >>>> all nodes.
> >>>> 3. Other users, connecting to any Foreman will see a very clean
> behavior
> >>>> when submitting a query with the new function. Before some point in
> time
> >>>> (can be different for each Foreman), a query with the function fails
> in
> >>>> planning. After that point, queries are guaranteed to successfully
> use the
> >>>> new function on all nodes.
> >>>>
> >>>> Basically, this says that CREATE FUNCTION can’t (potentially) take a
> long
> >>>> time. Use of functions can’t result in random failures during the
> time that
> >>>> the function is propagated across Drillbits.
> >>>>
> >>>> The goals we can perhaps postpone are:
> >>>>
> >>>> 1. Class name space isolation. (Allows two data scientists to define
> the
> >>>> same class without collisions.)
> >>>> 2. Function name spaces. (Allows me to define “paul.foo” and you to
> >>>> define “bob.foo” with out collisions. (Needed if many people develop
> >>>> functions independently. Else, we need a global name space.)
> >>>> 3. Dynamic DROP FUNCTION operation. (The issues here are messy, and it
> >>>> requires unloading classes and name space cleanup.) (Just let the
> cleanup
> >>>> happen offline.)
> >>>> 4. Dependency jars (e.g. third party libraries, etc.) (We require
> those
> >>>> to be statically added to the class path before Drill starts.)
> >>>>
> >>>> We are not creating per-user name spaces, or allowing people to use
> >>>> production clusters to try/revise functions. We’re just sampling
> deployment
> >>>> of simple functions.
> >>>>
> >>>> That’s my suggestion, what do others suggest?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> - Paul
> >>>>
> >>>>> On Jul 7, 2016, at 12:32 PM, Arina Yelchiyeva <
> >>>> [email protected]> wrote:
> >>>>>
> >>>>> I also agree on using Zookeeper. I have re-worked dynamic UDF support
> >>>>> document taking into account Zookeeper usage.
> >>>>>
> >>>>> Link to the document -
> >>>>>
> >>>>
> https://docs.google.com/document/d/1MluM17EKajvNP_x8U4aymcOihhUm8BMm8t_hM0jEFWk/edit
> >>>>>
> >>>>> Kind regards
> >>>>> Arina
> >>>>>
> >>>>> On Tue, Jun 28, 2016 at 12:55 AM Paul Rogers <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>>> Great idea! We already use ZK to track storage plugins. ZK is
> perhaps
> >>>>>> better suited to register each jar and/or function that using files
> in
> >>>> DFS.
> >>>>>> Still need to work out the proper sequencing. But you are right,
> this
> >>>> is
> >>>>>> the kind of thing that ZK is supposed to solve.
> >>>>>>
> >>>>>> - Paul
> >>>>>>
> >>>>>>
> >>>>>>> On Jun 27, 2016, at 2:01 PM, Parth Chandra <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>> Reading thru some of Paul's comments on maintaining a consistent
> state
> >>>>>> for
> >>>>>>> the registration of the UDF, it looks like we need a consensus
> >>>> protocol
> >>>>>> for
> >>>>>>> determining that all the Drillbits have the UDF deployed.
> >>>>>>> I believe Zookeeper can provide a stronger guarantee than a 2 phase
> >>>>>>> approach. Should we look into that?
> >>>>>>>
> >>>>>>> On Fri, Jun 24, 2016 at 10:00 AM, Arina Yelchiyeva <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Hi all!
> >>>>>>>>
> >>>>>>>> I have updated design document.
> >>>>>>>> Main changes:
> >>>>>>>> 1. Add to Drill’s config цшер  the staging and registration DFS
> >>>>>> locations.
> >>>>>>>> 2. User is no longer is responsible for copying jars into drillbit
> >>>>>> nodes.
> >>>>>>>> Now user needs to copy jars into staging DFS location from where
> >>>>>> drillbits
> >>>>>>>> will copy them to local fs.
> >>>>>>>> 2. During UDFs registration jars will be moved to DFS registration
> >>>> area.
> >>>>>>>> 3. During start up drillbit will copy all jars from registration
> >>>> area,
> >>>>>> so
> >>>>>>>> newly added drillbit will have all UDFs as others.
> >>>>>>>> 4. Security issues - probably they will be added later as
> >>>> enhancement.
> >>>>>>>>
> >>>>>>>> More detains in the document:
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> https://docs.google.com/document/d/1MluM17EKajvNP_x8U4aymcOihhUm8BMm8t_hM0jEFWk/edit
> >>>>>>>>
> >>>>>>>> Kind regards
> >>>>>>>> Arina
> >>>>>>>>
> >>>>>>>> On Fri, Jun 17, 2016 at 1:25 AM Paul Rogers <[email protected]
> >
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi All,
> >>>>>>>>>
> >>>>>>>>> To answer Arina on item 3: there is actually no good location on
> any
> >>>>>>>> local
> >>>>>>>>> node to put the UDFs. Reason: DoY allows the admin to start a
> >>>> Drillbit
> >>>>>> on
> >>>>>>>>> any available node. When it starts, a new, fresh copy of Drill
> will
> >>>> be
> >>>>>>>>> downloaded, and this can happen after the user issued the CREATE
> >>>>>> command.
> >>>>>>>>>
> >>>>>>>>> What we need is a shared, secure distributed storage location
> from
> >>>>>> which
> >>>>>>>>> Drillbits can download the needed jar files. Something like… DFS!
> >>>>>> Indeed,
> >>>>>>>>> this is how YARN stores the Drill archive from which it creates
> the
> >>>>>> Drill
> >>>>>>>>> install directory on each node. We can’t quite use YARN’s
> mechanism
> >>>>>> (YARN
> >>>>>>>>> is aware only of the files uploaded when launching an app), but
> we
> >>>> can
> >>>>>> do
> >>>>>>>>> something similar.
> >>>>>>>>>
> >>>>>>>>> So, brainstorming a bit…
> >>>>>>>>>
> >>>>>>>>> 1. Store the UDF jar in a pre-defined DFS location.
> >>>>>>>>>
> >>>>>>>>> 2. The CREATE function 1) uploads the jar to the DFS location,
> and
> >>>> 2)
> >>>>>>>>> creates some kind of registry entry.
> >>>>>>>>>
> >>>>>>>>> 3. The DELETE function 1) deregisters the jar (and function),
> but 2)
> >>>>>> does
> >>>>>>>>> not delete the jar (this allows in-flight queries to complete.)
> >>>>>>>>>
> >>>>>>>>> 3. Drillbits periodically check DFS for changed registrations,
> >>>>>>>> downloading
> >>>>>>>>> any needed jars. (YARN, Spark, Storm and others already do
> something
> >>>>>>>>> similar.)
> >>>>>>>>>
> >>>>>>>>> 4. Registry check is “forced” when processing a query with a
> >>>> function
> >>>>>>>> that
> >>>>>>>>> is not currently registered. (Doing so resolves any possible race
> >>>>>>>>> conditions.)
> >>>>>>>>>
> >>>>>>>>> 5. Some process (perhaps time based) removes old, unregistered
> jar
> >>>>>> files.
> >>>>>>>>> (Or, we could get fancy and use reference counts. The reference
> >>>> count
> >>>>>>>> would
> >>>>>>>>> be required if the user wants to delete, then recreate, the same
> >>>>>> function
> >>>>>>>>> and jar to avoid conflict with in-flight queries.)
> >>>>>>>>>
> >>>>>>>>> We can build security on this as follows:
> >>>>>>>>>
> >>>>>>>>> 1. Define permissions for who can write to the DFS location. Or,
> >>>>>> indeed,
> >>>>>>>>> have subdirectories by user and grant each user permission only
> on
> >>>>>> their
> >>>>>>>>> own UDF directory.
> >>>>>>>>>
> >>>>>>>>> 2. Provide separate registries for per-user functions (private)
> and
> >>>>>>>> global
> >>>>>>>>> functions (public). Only the admin can add global functions. But,
> >>>> only
> >>>>>>>> the
> >>>>>>>>> user that uploads a private function can use it.
> >>>>>>>>>
> >>>>>>>>> 3. Leverage the Java class loader to isolate UDFs in their own
> name
> >>>>>> space
> >>>>>>>>> (see Eclipse & Tomcat for examples). That is, Drill can call
> into a
> >>>>>> UDF,
> >>>>>>>>> UDFs can call selected Drill code, but UDFs can’t shadow Drill
> >>>> classes
> >>>>>>>>> (accidentally or maliciously.) Plus, my function Foo won’t clash
> >>>> with
> >>>>>>>> your
> >>>>>>>>> function Foo if both are private.
> >>>>>>>>>
> >>>>>>>>> Sorry that this has wandered a bit far from the original simple
> >>>> design,
> >>>>>>>>> but the above may capture much of what folks expect in modern
> >>>>>> distributed
> >>>>>>>>> big data systems.
> >>>>>>>>>
> >>>>>>>>> I wonder if a good next step might be to review the notes in the
> >>>> design
> >>>>>>>>> doc, in the JIRA, and in this e-mail chain and to prepare a
> summary
> >>>> of
> >>>>>>>>> technical requirements, and a proposed design. Postpone, at least
> >>>> for
> >>>>>>>> now,
> >>>>>>>>> concerns about the amount of work; we can worry about that once
> >>>> folks
> >>>>>>>> agree
> >>>>>>>>> on your revised design.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> - Paul
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Jun 21, 2016, at 9:48 AM, Arina Yelchiyeva <
> >>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> 4. Authorization model mentioned by Julia and John
> >>>>>>>>>> If user won't have rights to copy jars to UDF classpath, which
> can
> >>>> be
> >>>>>>>>>> restricted by file system, he won't be able to do much harm by
> >>>> running
> >>>>>>>>>> CREATE command. If UDFs from jar were already registered, CREATE
> >>>>>>>>> statement
> >>>>>>>>>> will fail. CREATE OR REPLACE will just re-register UDFs.
> >>>>>>>>>> But DELETE command is not safe. If user knows jar name, he can
> >>>> delete
> >>>>>>>> all
> >>>>>>>>>> associated with it UDFs, as well as the binary and source jars.
> >>>> That's
> >>>>>>>>>> where we'll probably need to impose restrictions.
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jun 21, 2016 at 7:34 PM Arina Yelchiyeva <
> >>>>>>>>> [email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> 1. DELETE command - I missed to indicate it document but had it
> >>>> in my
> >>>>>>>>>>> mind. When user issues DELETE command, all UDF associated with
> >>>>>>>> indicated
> >>>>>>>>>>> jar is removed from DrillFunctionRegistry. And then binary and
> >>>> source
> >>>>>>>>>>> files are also deleted from UDF classpath.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. Distribution race condition described by Paul
> >>>>>>>>>>> User issues CREATE command and gets confirmation that UDFs is
> >>>>>>>> registered
> >>>>>>>>>>> only if all drilllbits have confirmed that registration was
> >>>>>>>> successful.
> >>>>>>>>>>> I don't expect user to start using UDFs in queries prior to
> CREATE
> >>>>>>>>> command
> >>>>>>>>>>> success / failure result, which is possible but strange.
> >>>>>>>>>>>
> >>>>>>>>>>> 3. DoY
> >>>>>>>>>>> @Paul
> >>>>>>>>>>> If instead of using $DRILL_HOME/jars/3rdparty/udf directly we
> use
> >>>>>>>>>>> $DRILL_UDF environment variable which will be set during
> drillbit
> >>>>>>>> start
> >>>>>>>>>>> (like $DRILL_LOG_DIR). Location stored in this variable will be
> >>>> added
> >>>>>>>> to
> >>>>>>>>>>> Drill classpath during start.
> >>>>>>>>>>> Will it ease DoY integration somehow?
> >>>>>>>>>>>
> >>>>>>>>>>> Kind regards
> >>>>>>>>>>> Arina
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Jun 21, 2016 at 7:15 PM yuliya Feldman
> >>>>>>>>> <[email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Just thoughts:
> >>>>>>>>>>>> You can try to reuse distributed cache Let Drill AM do the
> >>>> needful
> >>>>>> in
> >>>>>>>>>>>> terms of orchestrating UDF jars distribution.
> >>>>>>>>>>>> But
> >>>>>>>>>>>> I would be inclined to have a common path that is independent
> of
> >>>> the
> >>>>>>>>> fact
> >>>>>>>>>>>> that it is Drill on YARN or not, as maintaining two separate
> >>>> ways of
> >>>>>>>>>>>> dealing with loading/unloading UDFs will be painful and error
> >>>> prone.
> >>>>>>>>>>>> One more note (I left a comment in the doc) - not sure about
> >>>>>>>>>>>> authorization model here - we need to have some.
> >>>>>>>>>>>> Just my 2cThanks
> >>>>>>>>>>>>
> >>>>>>>>>>>>  From: Paul Rogers <[email protected]>
> >>>>>>>>>>>> To: "[email protected]" <[email protected]>
> >>>>>>>>>>>> Sent: Monday, June 20, 2016 7:32 PM
> >>>>>>>>>>>> Subject: Re: Dynamic UDFs support
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Neeraja,
> >>>>>>>>>>>>
> >>>>>>>>>>>> The proposal calls for the user to copy the jar file to each
> >>>>>> Drillbit
> >>>>>>>>>>>> node. The jar would go into a new
> $DRILL_HOME/jars/3rdparty/udf
> >>>>>>>>> directory.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In Drill-on-YARN (DoY), YARN is responsible for copying Drill
> >>>> code
> >>>>>> to
> >>>>>>>>>>>> each node (which is good.) YARN puts that code in a location
> >>>> known
> >>>>>>>>> only to
> >>>>>>>>>>>> YARN. Since the location is private to YARN, the user can’t
> >>>> easily
> >>>>>>>> hunt
> >>>>>>>>>>>> down the location in order to add the udf jar. Even if the
> user
> >>>> did
> >>>>>>>>> find
> >>>>>>>>>>>> the location, the next Drillbit to start would create a new
> copy
> >>>> of
> >>>>>>>> the
> >>>>>>>>>>>> Drill software, without the udf jar.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Second, in DoY we have separated user files from Drill
> software.
> >>>>>> This
> >>>>>>>>>>>> makes it much easier to distribute the software to each node:
> we
> >>>>>> give
> >>>>>>>>> the
> >>>>>>>>>>>> Drill distribution tar archive to YARN, and YARN copies it to
> >>>> each
> >>>>>>>>> node and
> >>>>>>>>>>>> untars the Drill files. We make a separate copy of the (far
> >>>> smaller)
> >>>>>>>>> set of
> >>>>>>>>>>>> user config files.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If the udf jar goes into a Drill folder
> >>>>>>>>> ($DRILL_HOME/jars/3rdparty/udf),
> >>>>>>>>>>>> then the user would have to rebuild the Drill tar file each
> time
> >>>>>> they
> >>>>>>>>> add a
> >>>>>>>>>>>> udf jar. When I tried this myself when building DoY, I found
> it
> >>>> to
> >>>>>> be
> >>>>>>>>> slow
> >>>>>>>>>>>> and error-prone.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So, the solution is to place the udf code in the new “site”
> >>>>>>>> directory:
> >>>>>>>>>>>> $DRILL_SITE/jars. That’s what that is for. Then, let DoY
> >>>>>>>> automatically
> >>>>>>>>>>>> distribute the code to every node. Perfect! Except that it
> does
> >>>> not
> >>>>>>>>> work to
> >>>>>>>>>>>> dynamically distribute code after Drill starts.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For DoY, the solution requirements are:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Distribute code using Drill itself, rather than manually
> >>>> copying
> >>>>>>>>> jars
> >>>>>>>>>>>> to (unknown) Drill directories.
> >>>>>>>>>>>> 2. Ensure the solution works even if another Drillbit is spun
> up
> >>>>>>>> later,
> >>>>>>>>>>>> and uses the original Drill tar file.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I’m thinking we want to leverage DFS: place udf files into a
> >>>>>>>> well-known
> >>>>>>>>>>>> DFS directory. Register the udf into, say, ZK. When a new
> >>>> Drillbit
> >>>>>>>>> starts,
> >>>>>>>>>>>> it looks for new udf jars in ZK, copies the file to a
> temporary
> >>>>>>>>> location,
> >>>>>>>>>>>> and launches. An existing Drill is notified of the change and
> >>>> does
> >>>>>>>> the
> >>>>>>>>> same
> >>>>>>>>>>>> download process. Clean-up is needed at some point to remove
> ZK
> >>>>>>>>> entries if
> >>>>>>>>>>>> the udf jar becomes statically available on the next launch.
> That
> >>>>>>>> needs
> >>>>>>>>>>>> more thought.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We’d still need the phases mentioned earlier to ensure
> >>>> consistency.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Suggestions anyone as to how to do this super simply & still
> get
> >>>> it
> >>>>>>>> to
> >>>>>>>>>>>> work with DoY?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Paul
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Jun 20, 2016, at 7:18 PM, Neeraja Rentachintala <
> >>>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This will need to work with YARN (Once Drill is YARN
> enabled, I
> >>>>>>>> would
> >>>>>>>>>>>>> expect a lot of users using it in conjunction with YARN).
> >>>>>>>>>>>>> Paul, I am not clear why this wouldn't work with YARN. Can
> you
> >>>>>>>>>>>> elaborate.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Neeraja
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Jun 20, 2016 at 7:01 PM, Paul Rogers <
> >>>> [email protected]
> >>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Good enough, as long as we document the limitation that this
> >>>>>>>> feature
> >>>>>>>>>>>> can’t
> >>>>>>>>>>>>>> work with YARN deployment as users generally do not have
> >>>> access to
> >>>>>>>>> the
> >>>>>>>>>>>>>> temporary “localization” directories where the Drill code is
> >>>>>> placed
> >>>>>>>>> by
> >>>>>>>>>>>> YARN.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Note that the jar distribution race condition issue occurs
> with
> >>>>>> the
> >>>>>>>>>>>>>> proposed design: I believe I sketched out a scenario in one
> of
> >>>> the
> >>>>>>>>>>>> earlier
> >>>>>>>>>>>>>> comments. Drillbit A receives the CREATE FUNCTION command.
> It
> >>>>>> tells
> >>>>>>>>>>>>>> Drillbit B. While informing the other Drillbits, Drillbit B
> >>>> plans
> >>>>>>>> and
> >>>>>>>>>>>>>> launches a query that uses the function. Drillbit Z starts
> >>>>>>>> execution
> >>>>>>>>>>>> of the
> >>>>>>>>>>>>>> query before it learns from A about the new function. This
> >>>> will be
> >>>>>>>>>>>> rare —
> >>>>>>>>>>>>>> just rare enough to create very hard to reproduce bugs.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The only reliable solution is to do the work in multiple
> >>>> passes:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Pass 1: Ask each node to load the function, but not make it
> >>>>>>>> available
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>> the planner. (it would be available to the execution
> engine.)
> >>>>>>>>>>>>>> Pass 2: Await confirmation from each node that this is done.
> >>>>>>>>>>>>>> Pass 3: Alert every node that it is now free to plan queries
> >>>> with
> >>>>>>>> the
> >>>>>>>>>>>>>> function.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Finally, I wonder if we should design the SQL syntax based
> on a
> >>>>>>>>>>>> long-term
> >>>>>>>>>>>>>> design, even if the feature itself is a short-term
> work-around.
> >>>>>>>>>>>> Changing
> >>>>>>>>>>>>>> the syntax later might break scripts that users might write.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So, the question for the group is this: is the value of
> >>>>>>>> semi-complete
> >>>>>>>>>>>>>> feature sufficient to justify the potential problems?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - Paul
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Jun 20, 2016, at 6:15 PM, Parth Chandra <
> >>>>>> [email protected]
> >>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Moving discussion to dev.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I believe the aim is to do a simple implementation without
> the
> >>>>>>>>>>>> complexity
> >>>>>>>>>>>>>>> of distributing the UDF. I think the document should make
> this
> >>>>>>>>>>>> limitation
> >>>>>>>>>>>>>>> clear.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Per Paul's point on there being a simpler solution of just
> >>>> having
> >>>>>>>>> each
> >>>>>>>>>>>>>>> drillbit detect the if a UDF is present, I think the
> problem
> >>>> is
> >>>>>>>> if a
> >>>>>>>>>>>> UDF
> >>>>>>>>>>>>>>> get's deployed to some but not all drillbits. A query can
> then
> >>>>>>>> start
> >>>>>>>>>>>>>>> executing but not run successfully. The intent of the
> create
> >>>>>>>>> commands
> >>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>> be to ensure that all drillbits have the UDF or none would.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think Jacques' point about ownership conflicts is not
> >>>> addressed
> >>>>>>>>>>>>>> clearly.
> >>>>>>>>>>>>>>> Also, the unloading is not clear. The delete command should
> >>>>>>>> probably
> >>>>>>>>>>>>>> remove
> >>>>>>>>>>>>>>> the UDF and unload it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Jun 17, 2016 at 11:19 AM, Paul Rogers <
> >>>>>>>> [email protected]
> >>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Reviewed the spec; many comments posted. Three primary
> >>>> comments
> >>>>>>>> for
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> community to consider.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1. The design conflicts with the Drill-on-YARN project. Is
> >>>> this
> >>>>>> a
> >>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>>> fix for one unique problem, or is it worth expanding the
> >>>>>> solution
> >>>>>>>>> to
> >>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>> with Drill-on-YARN deployments? Might be hard to make the
> two
> >>>>>>>> work
> >>>>>>>>>>>>>> together
> >>>>>>>>>>>>>>>> later. See comments in docs for details.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2. Have we, by chance, looked at how other projects handle
> >>>> code
> >>>>>>>>>>>>>>>> distribution? Spark, Storm and others automatically deploy
> >>>> code
> >>>>>>>>>>>> across
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> cluster; no manual distribution to each node. The key
> >>>> difference
> >>>>>>>>>>>> between
> >>>>>>>>>>>>>>>> Drill and others is that, for Storm, say, code is
> associated
> >>>>>>>> with a
> >>>>>>>>>>>> job
> >>>>>>>>>>>>>>>> (“topology” in Storm terms.) But, in Drill, functions are
> >>>> global
> >>>>>>>>> and
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> no obvious life cycle that suggests when the code can be
> >>>>>>>> unloaded.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 3. Have considered the class loader, dependency and name
> >>>> space
> >>>>>>>>>>>> isolation
> >>>>>>>>>>>>>>>> issues addressed by such products as Tomcat (web apps) or
> >>>>>> Eclipse
> >>>>>>>>>>>>>>>> (plugins)? Putting user code in the same namespace as
> Drill
> >>>> code
> >>>>>>>>> is
> >>>>>>>>>>>>>> quick
> >>>>>>>>>>>>>>>> & dirty. It turns out, however, that doing so leads to
> >>>> problems
> >>>>>>>>> that
> >>>>>>>>>>>>>>>> require long, frustrating debugging sessions to resolve.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Addressing item 1 might expand scope a bit. Addressing
> items
> >>>> 2
> >>>>>>>> and
> >>>>>>>>> 3
> >>>>>>>>>>>>>> are a
> >>>>>>>>>>>>>>>> big increase in scope, so I won’t be surprised if we leave
> >>>> those
> >>>>>>>>>>>> issues
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> later. (Though, addressing item 2 might be the best way to
> >>>>>>>> address
> >>>>>>>>>>>> item
> >>>>>>>>>>>>>> 1.)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If we want a very simple solution that requires minimal
> >>>> change,
> >>>>>>>>>>>> perhaps
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>> can use an even simpler solution. In the proposed design,
> the
> >>>>>>>> user
> >>>>>>>>>>>> still
> >>>>>>>>>>>>>>>> must distribute code to all the nodes. The primary change
> is
> >>>> to
> >>>>>>>>> tell
> >>>>>>>>>>>>>> Drill
> >>>>>>>>>>>>>>>> to load (or unload) that code. Can accomplish the same
> result
> >>>>>>>>> easier
> >>>>>>>>>>>>>> simply
> >>>>>>>>>>>>>>>> by having Drill periodically scan certain directories
> looking
> >>>>>> for
> >>>>>>>>> new
> >>>>>>>>>>>>>> (or
> >>>>>>>>>>>>>>>> removed) jars? Still won’t work with YARN, or solve the
> name
> >>>>>>>> space
> >>>>>>>>>>>>>> issues,
> >>>>>>>>>>>>>>>> but will work for existing non-YARN Drill users without
> new
> >>>> SQL
> >>>>>>>>>>>> syntax.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> - Paul
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Jun 16, 2016, at 2:07 PM, Jacques Nadeau <
> >>>>>> [email protected]
> >>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Two quick thoughts:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> - (user) In the design document I didn't see any
> discussion
> >>>> of
> >>>>>>>>>>>>>>>>> ownership/conflicts or unloading. Would be helpful to see
> >>>> the
> >>>>>>>>>>>> thinking
> >>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>> - (dev) There is a row oriented facade via the
> >>>>>>>>>>>>>>>>> FieldReader/FieldWriter/ComplexWriter classes. That would
> >>>> be a
> >>>>>>>>> good
> >>>>>>>>>>>>>> place
> >>>>>>>>>>>>>>>>> to start when trying to implement an alternative
> interface.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>> Jacques Nadeau
> >>>>>>>>>>>>>>>>> CTO and Co-Founder, Dremio
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Thu, Jun 16, 2016 at 11:32 AM, John Omernik <
> >>>>>>>> [email protected]>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Honestly, I don't see it as a priority issue. I think
> some
> >>>> of
> >>>>>>>> the
> >>>>>>>>>>>>>> ideas
> >>>>>>>>>>>>>>>>>> around community java UDFs could be a better approach.
> I'd
> >>>>>> hate
> >>>>>>>>> to
> >>>>>>>>>>>>>> take
> >>>>>>>>>>>>>>>>>> away from other work to hack in something like this.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Thu, Jun 16, 2016 at 1:19 PM, Paul Rogers <
> >>>>>>>>> [email protected]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Ted refers to source code transformation. Drill gains
> its
> >>>>>>>> speed
> >>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>> value
> >>>>>>>>>>>>>>>>>>> vectors. However, VVs are a far cry from the row-based
> >>>>>>>> interface
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>> mere mortals are accustomed to using. Since VVs are
> very
> >>>> type
> >>>>>>>>>>>>>> specific,
> >>>>>>>>>>>>>>>>>>> code is typically generated to handle the specifics of
> >>>> each
> >>>>>>>>> type.
> >>>>>>>>>>>>>>>>>> Accessing
> >>>>>>>>>>>>>>>>>>> VVs in Jython may be a bit of a challenge because of
> the
> >>>>>>>>>>>> "impedence
> >>>>>>>>>>>>>>>>>>> mismatch" between how VVs work and the row-and-column
> view
> >>>>>>>>>>>> expected
> >>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>>>> (non-Drill) developers.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I wonder if we've considered providing a row-oriented
> >>>>>> "facade"
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> used by roll-your own data sources and user-defined row
> >>>>>>>>>>>> transforms?
> >>>>>>>>>>>>>>>> Might
> >>>>>>>>>>>>>>>>>>> be a hiccup in the fast VV pipeline, but might be handy
> >>>> for
> >>>>>>>>> users
> >>>>>>>>>>>>>>>> willing
> >>>>>>>>>>>>>>>>>>> to trade a bit of speed for convenience. With such a
> >>>> facade,
> >>>>>>>> the
> >>>>>>>>>>>>>> Jython
> >>>>>>>>>>>>>>>>>> row
> >>>>>>>>>>>>>>>>>>> transforms that John mentions could be quite simple.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Thu, Jun 16, 2016 at 10:36 AM, Ted Dunning <
> >>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Since UDF's use source code transformation, using
> Jython
> >>>>>>>> would
> >>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> difficult.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Thu, Jun 16, 2016 at 9:42 AM, Arina Yelchiyeva <
> >>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Charles,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> not that I am aware of. Proposed solution doesn't
> invent
> >>>>>>>>>>>> anything
> >>>>>>>>>>>>>>>>>> new,
> >>>>>>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>> adds possibility to add UDFs without drillbit
> restart.
> >>>> But
> >>>>>>>>>>>>>>>>>>> contributions
> >>>>>>>>>>>>>>>>>>>>> are welcomed.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Thu, Jun 16, 2016 at 4:52 PM Charles Givre <
> >>>>>>>>> [email protected]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Arina,
> >>>>>>>>>>>>>>>>>>>>>> Has there been any discussion about making it
> possible
> >>>> via
> >>>>>>>>>>>> Jython
> >>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>> something for users to write simple UDFs in Python?
> >>>>>>>>>>>>>>>>>>>>>> My ideal would be to have this capability
> integrated in
> >>>>>> the
> >>>>>>>>> web
> >>>>>>>>>>>>>> GUI
> >>>>>>>>>>>>>>>>>>>> such
> >>>>>>>>>>>>>>>>>>>>>> that a user could write their UDF (in Python) right
> >>>> there,
> >>>>>>>>>>>> submit
> >>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>> would be deployed to Drill if it passes validation
> >>>> tests.
> >>>>>>>>>>>>>>>>>>>>>> —C
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Jun 16, 2016, at 09:34, Arina Yelchiyeva <
> >>>>>>>>>>>>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi all!
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I have created Jira to allow dynamic UDFs support
> in
> >>>>>>>> Drill (
> >>>>>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4726).
> >>>> There
> >>>>>>>>> is a
> >>>>>>>>>>>>>>>>>> link
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> design document in Jira description.
> >>>>>>>>>>>>>>>>>>>>>>> Comments or suggestions are welcomed.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Kind regards
> >>>>>>>>>>>>>>>>>>>>>>> Arina
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >
>
>

Re: Dynamic UDFs support

Reply via email to