Re: Dynamic UDFs support

Parth Chandra Mon, 27 Jun 2016 14:01:26 -0700

Reading thru some of Paul's comments on maintaining a consistent state for
the registration of the UDF, it looks like we need a consensus protocol for
determining that all the Drillbits have the UDF deployed.
I believe Zookeeper can provide a stronger guarantee than a 2 phase
approach. Should we look into that?


On Fri, Jun 24, 2016 at 10:00 AM, Arina Yelchiyeva <
[email protected]> wrote:

> Hi all!
>
> I have updated design document.
> Main changes:
> 1. Add to Drill’s config цшер  the staging and registration DFS locations.
> 2. User is no longer is responsible for copying jars into drillbit nodes.
> Now user needs to copy jars into staging DFS location from where drillbits
> will copy them to local fs.
> 2. During UDFs registration jars will be moved to DFS registration area.
> 3. During start up drillbit will copy all jars from registration area, so
> newly added drillbit will have all UDFs as others.
> 4. Security issues - probably they will be added later as enhancement.
>
> More detains in the document:
>
> https://docs.google.com/document/d/1MluM17EKajvNP_x8U4aymcOihhUm8BMm8t_hM0jEFWk/edit
>
> Kind regards
> Arina
>
> On Fri, Jun 17, 2016 at 1:25 AM Paul Rogers <[email protected]> wrote:
>
> > Hi All,
> >
> > To answer Arina on item 3: there is actually no good location on any
> local
> > node to put the UDFs. Reason: DoY allows the admin to start a Drillbit on
> > any available node. When it starts, a new, fresh copy of Drill will be
> > downloaded, and this can happen after the user issued the CREATE command.
> >
> > What we need is a shared, secure distributed storage location from which
> > Drillbits can download the needed jar files. Something like… DFS! Indeed,
> > this is how YARN stores the Drill archive from which it creates the Drill
> > install directory on each node. We can’t quite use YARN’s mechanism (YARN
> > is aware only of the files uploaded when launching an app), but we can do
> > something similar.
> >
> > So, brainstorming a bit…
> >
> > 1. Store the UDF jar in a pre-defined DFS location.
> >
> > 2. The CREATE function 1) uploads the jar to the DFS location, and 2)
> > creates some kind of registry entry.
> >
> > 3. The DELETE function 1) deregisters the jar (and function), but 2) does
> > not delete the jar (this allows in-flight queries to complete.)
> >
> > 3. Drillbits periodically check DFS for changed registrations,
> downloading
> > any needed jars. (YARN, Spark, Storm and others already do something
> > similar.)
> >
> > 4. Registry check is “forced” when processing a query with a function
> that
> > is not currently registered. (Doing so resolves any possible race
> > conditions.)
> >
> > 5. Some process (perhaps time based) removes old, unregistered jar files.
> > (Or, we could get fancy and use reference counts. The reference count
> would
> > be required if the user wants to delete, then recreate, the same function
> > and jar to avoid conflict with in-flight queries.)
> >
> > We can build security on this as follows:
> >
> > 1. Define permissions for who can write to the DFS location. Or, indeed,
> > have subdirectories by user and grant each user permission only on their
> > own UDF directory.
> >
> > 2. Provide separate registries for per-user functions (private) and
> global
> > functions (public). Only the admin can add global functions. But, only
> the
> > user that uploads a private function can use it.
> >
> > 3. Leverage the Java class loader to isolate UDFs in their own name space
> > (see Eclipse & Tomcat for examples). That is, Drill can call into a UDF,
> > UDFs can call selected Drill code, but UDFs can’t shadow Drill classes
> > (accidentally or maliciously.) Plus, my function Foo won’t clash with
> your
> > function Foo if both are private.
> >
> > Sorry that this has wandered a bit far from the original simple design,
> > but the above may capture much of what folks expect in modern distributed
> > big data systems.
> >
> > I wonder if a good next step might be to review the notes in the design
> > doc, in the JIRA, and in this e-mail chain and to prepare a summary of
> > technical requirements, and a proposed design. Postpone, at least for
> now,
> > concerns about the amount of work; we can worry about that once folks
> agree
> > on your revised design.
> >
> > Thanks,
> >
> > - Paul
> >
> >
> > > On Jun 21, 2016, at 9:48 AM, Arina Yelchiyeva <
> > [email protected]> wrote:
> > >
> > > 4. Authorization model mentioned by Julia and John
> > > If user won't have rights to copy jars to UDF classpath, which can be
> > > restricted by file system, he won't be able to do much harm by running
> > > CREATE command. If UDFs from jar were already registered, CREATE
> > statement
> > > will fail. CREATE OR REPLACE will just re-register UDFs.
> > > But DELETE command is not safe. If user knows jar name, he can delete
> all
> > > associated with it UDFs, as well as the binary and source jars. That's
> > > where we'll probably need to impose restrictions.
> > >
> > > On Tue, Jun 21, 2016 at 7:34 PM Arina Yelchiyeva <
> > [email protected]>
> > > wrote:
> > >
> > >> 1. DELETE command - I missed to indicate it document but had it in my
> > >> mind. When user issues DELETE command, all UDF associated with
> indicated
> > >> jar is removed from DrillFunctionRegistry. And then binary and source
> > >> files are also deleted from UDF classpath.
> > >>
> > >> 2. Distribution race condition described by Paul
> > >> User issues CREATE command and gets confirmation that UDFs is
> registered
> > >> only if all drilllbits have confirmed that registration was
> successful.
> > >> I don't expect user to start using UDFs in queries prior to CREATE
> > command
> > >> success / failure result, which is possible but strange.
> > >>
> > >> 3. DoY
> > >> @Paul
> > >> If instead of using $DRILL_HOME/jars/3rdparty/udf directly we use
> > >> $DRILL_UDF environment variable which will be set during drillbit
> start
> > >> (like $DRILL_LOG_DIR). Location stored in this variable will be added
> to
> > >> Drill classpath during start.
> > >> Will it ease DoY integration somehow?
> > >>
> > >> Kind regards
> > >> Arina
> > >>
> > >> On Tue, Jun 21, 2016 at 7:15 PM yuliya Feldman
> > <[email protected]>
> > >> wrote:
> > >>
> > >>> Just thoughts:
> > >>> You can try to reuse distributed cache Let Drill AM do the needful in
> > >>> terms of orchestrating UDF jars distribution.
> > >>> But
> > >>> I would be inclined to have a common path that is independent of the
> > fact
> > >>> that it is Drill on YARN or not, as maintaining two separate ways of
> > >>> dealing with loading/unloading UDFs will be painful and error prone.
> > >>> One more note (I left a comment in the doc) - not sure about
> > >>> authorization model here - we need to have some.
> > >>> Just my 2cThanks
> > >>>
> > >>>      From: Paul Rogers <[email protected]>
> > >>> To: "[email protected]" <[email protected]>
> > >>> Sent: Monday, June 20, 2016 7:32 PM
> > >>> Subject: Re: Dynamic UDFs support
> > >>>
> > >>> Hi Neeraja,
> > >>>
> > >>> The proposal calls for the user to copy the jar file to each Drillbit
> > >>> node. The jar would go into a new $DRILL_HOME/jars/3rdparty/udf
> > directory.
> > >>>
> > >>> In Drill-on-YARN (DoY), YARN is responsible for copying Drill code to
> > >>> each node (which is good.) YARN puts that code in a location known
> > only to
> > >>> YARN. Since the location is private to YARN, the user can’t easily
> hunt
> > >>> down the location in order to add the udf jar. Even if the user did
> > find
> > >>> the location, the next Drillbit to start would create a new copy of
> the
> > >>> Drill software, without the udf jar.
> > >>>
> > >>> Second, in DoY we have separated user files from Drill software. This
> > >>> makes it much easier to distribute the software to each node: we give
> > the
> > >>> Drill distribution tar archive to YARN, and YARN copies it to each
> > node and
> > >>> untars the Drill files. We make a separate copy of the (far smaller)
> > set of
> > >>> user config files.
> > >>>
> > >>> If the udf jar goes into a Drill folder
> > ($DRILL_HOME/jars/3rdparty/udf),
> > >>> then the user would have to rebuild the Drill tar file each time they
> > add a
> > >>> udf jar. When I tried this myself when building DoY, I found it to be
> > slow
> > >>> and error-prone.
> > >>>
> > >>> So, the solution is to place the udf code in the new “site”
> directory:
> > >>> $DRILL_SITE/jars. That’s what that is for. Then, let DoY
> automatically
> > >>> distribute the code to every node. Perfect! Except that it does not
> > work to
> > >>> dynamically distribute code after Drill starts.
> > >>>
> > >>> For DoY, the solution requirements are:
> > >>>
> > >>> 1. Distribute code using Drill itself, rather than manually copying
> > jars
> > >>> to (unknown) Drill directories.
> > >>> 2. Ensure the solution works even if another Drillbit is spun up
> later,
> > >>> and uses the original Drill tar file.
> > >>>
> > >>> I’m thinking we want to leverage DFS: place udf files into a
> well-known
> > >>> DFS directory. Register the udf into, say, ZK. When a new Drillbit
> > starts,
> > >>> it looks for new udf jars in ZK, copies the file to a temporary
> > location,
> > >>> and launches. An existing Drill is notified of the change and does
> the
> > same
> > >>> download process. Clean-up is needed at some point to remove ZK
> > entries if
> > >>> the udf jar becomes statically available on the next launch. That
> needs
> > >>> more thought.
> > >>>
> > >>> We’d still need the phases mentioned earlier to ensure consistency.
> > >>>
> > >>> Suggestions anyone as to how to do this super simply & still get it
> to
> > >>> work with DoY?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> - Paul
> > >>>
> > >>>> On Jun 20, 2016, at 7:18 PM, Neeraja Rentachintala <
> > >>> [email protected]> wrote:
> > >>>>
> > >>>> This will need to work with YARN (Once Drill is YARN enabled, I
> would
> > >>>> expect a lot of users using it in conjunction with YARN).
> > >>>> Paul, I am not clear why this wouldn't work with YARN. Can you
> > >>> elaborate.
> > >>>>
> > >>>> -Neeraja
> > >>>>
> > >>>> On Mon, Jun 20, 2016 at 7:01 PM, Paul Rogers <[email protected]>
> > >>> wrote:
> > >>>>
> > >>>>> Good enough, as long as we document the limitation that this
> feature
> > >>> can’t
> > >>>>> work with YARN deployment as users generally do not have access to
> > the
> > >>>>> temporary “localization” directories where the Drill code is placed
> > by
> > >>> YARN.
> > >>>>>
> > >>>>> Note that the jar distribution race condition issue occurs with the
> > >>>>> proposed design: I believe I sketched out a scenario in one of the
> > >>> earlier
> > >>>>> comments. Drillbit A receives the CREATE FUNCTION command. It tells
> > >>>>> Drillbit B. While informing the other Drillbits, Drillbit B plans
> and
> > >>>>> launches a query that uses the function. Drillbit Z starts
> execution
> > >>> of the
> > >>>>> query before it learns from A about the new function. This will be
> > >>> rare —
> > >>>>> just rare enough to create very hard to reproduce bugs.
> > >>>>>
> > >>>>> The only reliable solution is to do the work in multiple passes:
> > >>>>>
> > >>>>> Pass 1: Ask each node to load the function, but not make it
> available
> > >>> to
> > >>>>> the planner. (it would be available to the execution engine.)
> > >>>>> Pass 2: Await confirmation from each node that this is done.
> > >>>>> Pass 3: Alert every node that it is now free to plan queries with
> the
> > >>>>> function.
> > >>>>>
> > >>>>> Finally, I wonder if we should design the SQL syntax based on a
> > >>> long-term
> > >>>>> design, even if the feature itself is a short-term work-around.
> > >>> Changing
> > >>>>> the syntax later might break scripts that users might write.
> > >>>>>
> > >>>>> So, the question for the group is this: is the value of
> semi-complete
> > >>>>> feature sufficient to justify the potential problems?
> > >>>>>
> > >>>>> - Paul
> > >>>>>
> > >>>>>> On Jun 20, 2016, at 6:15 PM, Parth Chandra <[email protected]
> >
> > >>>>> wrote:
> > >>>>>>
> > >>>>>> Moving discussion to dev.
> > >>>>>>
> > >>>>>> I believe the aim is to do a simple implementation without the
> > >>> complexity
> > >>>>>> of distributing the UDF. I think the document should make this
> > >>> limitation
> > >>>>>> clear.
> > >>>>>>
> > >>>>>> Per Paul's point on there being a simpler solution of just having
> > each
> > >>>>>> drillbit detect the if a UDF is present, I think the problem is
> if a
> > >>> UDF
> > >>>>>> get's deployed to some but not all drillbits. A query can then
> start
> > >>>>>> executing but not run successfully. The intent of the create
> > commands
> > >>>>> would
> > >>>>>> be to ensure that all drillbits have the UDF or none would.
> > >>>>>>
> > >>>>>> I think Jacques' point about ownership conflicts is not addressed
> > >>>>> clearly.
> > >>>>>> Also, the unloading is not clear. The delete command should
> probably
> > >>>>> remove
> > >>>>>> the UDF and unload it.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Fri, Jun 17, 2016 at 11:19 AM, Paul Rogers <
> [email protected]
> > >
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Reviewed the spec; many comments posted. Three primary comments
> for
> > >>> the
> > >>>>>>> community to consider.
> > >>>>>>>
> > >>>>>>> 1. The design conflicts with the Drill-on-YARN project. Is this a
> > >>>>> specific
> > >>>>>>> fix for one unique problem, or is it worth expanding the solution
> > to
> > >>>>> work
> > >>>>>>> with Drill-on-YARN deployments? Might be hard to make the two
> work
> > >>>>> together
> > >>>>>>> later. See comments in docs for details.
> > >>>>>>>
> > >>>>>>> 2. Have we, by chance, looked at how other projects handle code
> > >>>>>>> distribution? Spark, Storm and others automatically deploy code
> > >>> across
> > >>>>> the
> > >>>>>>> cluster; no manual distribution to each node. The key difference
> > >>> between
> > >>>>>>> Drill and others is that, for Storm, say, code is associated
> with a
> > >>> job
> > >>>>>>> (“topology” in Storm terms.) But, in Drill, functions are global
> > and
> > >>>>> have
> > >>>>>>> no obvious life cycle that suggests when the code can be
> unloaded.
> > >>>>>>>
> > >>>>>>> 3. Have considered the class loader, dependency and name space
> > >>> isolation
> > >>>>>>> issues addressed by such products as Tomcat (web apps) or Eclipse
> > >>>>>>> (plugins)? Putting user code in the same namespace as Drill code
> > is
> > >>>>> quick
> > >>>>>>> & dirty. It turns out, however, that doing so leads to problems
> > that
> > >>>>>>> require long, frustrating debugging sessions to resolve.
> > >>>>>>>
> > >>>>>>> Addressing item 1 might expand scope a bit. Addressing items 2
> and
> > 3
> > >>>>> are a
> > >>>>>>> big increase in scope, so I won’t be surprised if we leave those
> > >>> issues
> > >>>>> for
> > >>>>>>> later. (Though, addressing item 2 might be the best way to
> address
> > >>> item
> > >>>>> 1.)
> > >>>>>>>
> > >>>>>>> If we want a very simple solution that requires minimal change,
> > >>> perhaps
> > >>>>> we
> > >>>>>>> can use an even simpler solution. In the proposed design, the
> user
> > >>> still
> > >>>>>>> must distribute code to all the nodes. The primary change is to
> > tell
> > >>>>> Drill
> > >>>>>>> to load (or unload) that code. Can accomplish the same result
> > easier
> > >>>>> simply
> > >>>>>>> by having Drill periodically scan certain directories looking for
> > new
> > >>>>> (or
> > >>>>>>> removed) jars? Still won’t work with YARN, or solve the name
> space
> > >>>>> issues,
> > >>>>>>> but will work for existing non-YARN Drill users without new SQL
> > >>> syntax.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>>
> > >>>>>>> - Paul
> > >>>>>>>
> > >>>>>>>> On Jun 16, 2016, at 2:07 PM, Jacques Nadeau <[email protected]
> >
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Two quick thoughts:
> > >>>>>>>>
> > >>>>>>>> - (user) In the design document I didn't see any discussion of
> > >>>>>>>> ownership/conflicts or unloading. Would be helpful to see the
> > >>> thinking
> > >>>>>>> there
> > >>>>>>>> - (dev) There is a row oriented facade via the
> > >>>>>>>> FieldReader/FieldWriter/ComplexWriter classes. That would be a
> > good
> > >>>>> place
> > >>>>>>>> to start when trying to implement an alternative interface.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Jacques Nadeau
> > >>>>>>>> CTO and Co-Founder, Dremio
> > >>>>>>>>
> > >>>>>>>> On Thu, Jun 16, 2016 at 11:32 AM, John Omernik <
> [email protected]>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Honestly, I don't see it as a priority issue. I think some of
> the
> > >>>>> ideas
> > >>>>>>>>> around community java UDFs could be a better approach. I'd hate
> > to
> > >>>>> take
> > >>>>>>>>> away from other work to hack in something like this.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Jun 16, 2016 at 1:19 PM, Paul Rogers <
> > [email protected]
> > >>>>
> > >>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Ted refers to source code transformation. Drill gains its
> speed
> > >>> from
> > >>>>>>>>> value
> > >>>>>>>>>> vectors. However, VVs are a far cry from the row-based
> interface
> > >>> that
> > >>>>>>>>> most
> > >>>>>>>>>> mere mortals are accustomed to using. Since VVs are very type
> > >>>>> specific,
> > >>>>>>>>>> code is typically generated to handle the specifics of each
> > type.
> > >>>>>>>>> Accessing
> > >>>>>>>>>> VVs in Jython may be a bit of a challenge because of the
> > >>> "impedence
> > >>>>>>>>>> mismatch" between how VVs work and the row-and-column view
> > >>> expected
> > >>>>> by
> > >>>>>>>>> most
> > >>>>>>>>>> (non-Drill) developers.
> > >>>>>>>>>>
> > >>>>>>>>>> I wonder if we've considered providing a row-oriented "facade"
> > >>> that
> > >>>>> can
> > >>>>>>>>> be
> > >>>>>>>>>> used by roll-your own data sources and user-defined row
> > >>> transforms?
> > >>>>>>> Might
> > >>>>>>>>>> be a hiccup in the fast VV pipeline, but might be handy for
> > users
> > >>>>>>> willing
> > >>>>>>>>>> to trade a bit of speed for convenience. With such a facade,
> the
> > >>>>> Jython
> > >>>>>>>>> row
> > >>>>>>>>>> transforms that John mentions could be quite simple.
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Jun 16, 2016 at 10:36 AM, Ted Dunning <
> > >>> [email protected]
> > >>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Since UDF's use source code transformation, using Jython
> would
> > be
> > >>>>>>>>>>> difficult.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Jun 16, 2016 at 9:42 AM, Arina Yelchiyeva <
> > >>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Charles,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> not that I am aware of. Proposed solution doesn't invent
> > >>> anything
> > >>>>>>>>> new,
> > >>>>>>>>>>> just
> > >>>>>>>>>>>> adds possibility to add UDFs without drillbit restart. But
> > >>>>>>>>>> contributions
> > >>>>>>>>>>>> are welcomed.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Jun 16, 2016 at 4:52 PM Charles Givre <
> > [email protected]
> > >>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Arina,
> > >>>>>>>>>>>>> Has there been any discussion about making it possible via
> > >>> Jython
> > >>>>>>>>> or
> > >>>>>>>>>>>>> something for users to write simple UDFs in Python?
> > >>>>>>>>>>>>> My ideal would be to have this capability integrated in the
> > web
> > >>>>> GUI
> > >>>>>>>>>>> such
> > >>>>>>>>>>>>> that a user could write their UDF (in Python) right there,
> > >>> submit
> > >>>>>>>>> it
> > >>>>>>>>>>> and
> > >>>>>>>>>>>> it
> > >>>>>>>>>>>>> would be deployed to Drill if it passes validation tests.
> > >>>>>>>>>>>>> —C
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Jun 16, 2016, at 09:34, Arina Yelchiyeva <
> > >>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi all!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I have created Jira to allow dynamic UDFs support in
> Drill (
> > >>>>>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4726). There
> > is a
> > >>>>>>>>> link
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>> design document in Jira description.
> > >>>>>>>>>>>>>> Comments or suggestions are welcomed.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Kind regards
> > >>>>>>>>>>>>>> Arina
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> >
> >
>

Re: Dynamic UDFs support

Reply via email to