Re: Dynamic UDFs support

Subbu Srinivasan Tue, 21 Jun 2016 09:49:22 -0700

Dev ops needs some control on when/how to deploy UDF's.  From an
operational perspective we need to provide some control
on how these jars can be loaded into a running system.



On Tue, Jun 21, 2016 at 9:46 AM, Neeraja Rentachintala <
[email protected]> wrote:

> While trying to figure out the design of where to load the jars from and
> how to distribute across Drillbits, we need to keep one thing mind.
> The primary goal of the Dynamic UDFs feature is that Central IT has
> deployed a Drill cluster and users of the environment that are working with
> the data on the cluster need to be able write their own UDFs and deploy
> them onto the cluster without having to work with the IT/deployments teams
> to restart Drill cluster.
>
> To this extent, one question I have is who is responsible to place the UDF
> jar on the specific locations on Drillbits Are we expecting end users to
> keep the jars accessible for Drill to load. Or does the user simply supply
> a local directory of the jar which is taken by Drill and deployed on all
> the Drillbits in the cluster either with YARN or without YARN.
>
>
>
> On Tue, Jun 21, 2016 at 9:34 AM, Arina Yelchiyeva <
> [email protected]> wrote:
>
> > 1. DELETE command - I missed to indicate it document but had it in my
> mind.
> > When user issues DELETE command, all UDF associated with indicated jar is
> > removed from DrillFunctionRegistry. And then binary and source files are
> > also deleted from UDF classpath.
> >
> > 2. Distribution race condition described by Paul
> > User issues CREATE command and gets confirmation that UDFs is registered
> > only if all drilllbits have confirmed that registration was successful.
> > I don't expect user to start using UDFs in queries prior to CREATE
> command
> > success / failure result, which is possible but strange.
> >
> > 3. DoY
> > @Paul
> > If instead of using $DRILL_HOME/jars/3rdparty/udf directly we use
> > $DRILL_UDF environment variable which will be set during drillbit start
> > (like $DRILL_LOG_DIR). Location stored in this variable will be added to
> > Drill classpath during start.
> > Will it ease DoY integration somehow?
> >
> > Kind regards
> > Arina
> >
> > On Tue, Jun 21, 2016 at 7:15 PM yuliya Feldman
> <[email protected]
> > >
> > wrote:
> >
> > > Just thoughts:
> > > You can try to reuse distributed cache Let Drill AM do the needful in
> > > terms of orchestrating UDF jars distribution.
> > > But
> > > I would be inclined to have a common path that is independent of the
> fact
> > > that it is Drill on YARN or not, as maintaining two separate ways of
> > > dealing with loading/unloading UDFs will be painful and error prone.
> > > One more note (I left a comment in the doc) - not sure about
> > authorization
> > > model here - we need to have some.
> > > Just my 2cThanks
> > >
> > >       From: Paul Rogers <[email protected]>
> > >  To: "[email protected]" <[email protected]>
> > >  Sent: Monday, June 20, 2016 7:32 PM
> > >  Subject: Re: Dynamic UDFs support
> > >
> > > Hi Neeraja,
> > >
> > > The proposal calls for the user to copy the jar file to each Drillbit
> > > node. The jar would go into a new $DRILL_HOME/jars/3rdparty/udf
> > directory.
> > >
> > > In Drill-on-YARN (DoY), YARN is responsible for copying Drill code to
> > each
> > > node (which is good.) YARN puts that code in a location known only to
> > YARN.
> > > Since the location is private to YARN, the user can’t easily hunt down
> > the
> > > location in order to add the udf jar. Even if the user did find the
> > > location, the next Drillbit to start would create a new copy of the
> Drill
> > > software, without the udf jar.
> > >
> > > Second, in DoY we have separated user files from Drill software. This
> > > makes it much easier to distribute the software to each node: we give
> the
> > > Drill distribution tar archive to YARN, and YARN copies it to each node
> > and
> > > untars the Drill files. We make a separate copy of the (far smaller)
> set
> > of
> > > user config files.
> > >
> > > If the udf jar goes into a Drill folder
> ($DRILL_HOME/jars/3rdparty/udf),
> > > then the user would have to rebuild the Drill tar file each time they
> > add a
> > > udf jar. When I tried this myself when building DoY, I found it to be
> > slow
> > > and error-prone.
> > >
> > > So, the solution is to place the udf code in the new “site” directory:
> > > $DRILL_SITE/jars. That’s what that is for. Then, let DoY automatically
> > > distribute the code to every node. Perfect! Except that it does not
> work
> > to
> > > dynamically distribute code after Drill starts.
> > >
> > > For DoY, the solution requirements are:
> > >
> > > 1. Distribute code using Drill itself, rather than manually copying
> jars
> > > to (unknown) Drill directories.
> > > 2. Ensure the solution works even if another Drillbit is spun up later,
> > > and uses the original Drill tar file.
> > >
> > > I’m thinking we want to leverage DFS: place udf files into a well-known
> > > DFS directory. Register the udf into, say, ZK. When a new Drillbit
> > starts,
> > > it looks for new udf jars in ZK, copies the file to a temporary
> location,
> > > and launches. An existing Drill is notified of the change and does the
> > same
> > > download process. Clean-up is needed at some point to remove ZK entries
> > if
> > > the udf jar becomes statically available on the next launch. That needs
> > > more thought.
> > >
> > > We’d still need the phases mentioned earlier to ensure consistency.
> > >
> > > Suggestions anyone as to how to do this super simply & still get it to
> > > work with DoY?
> > >
> > > Thanks,
> > >
> > > - Paul
> > >
> > > > On Jun 20, 2016, at 7:18 PM, Neeraja Rentachintala <
> > > [email protected]> wrote:
> > > >
> > > > This will need to work with YARN (Once Drill is YARN enabled, I would
> > > > expect a lot of users using it in conjunction with YARN).
> > > > Paul, I am not clear why this wouldn't work with YARN. Can you
> > elaborate.
> > > >
> > > > -Neeraja
> > > >
> > > > On Mon, Jun 20, 2016 at 7:01 PM, Paul Rogers <[email protected]>
> > > wrote:
> > > >
> > > >> Good enough, as long as we document the limitation that this feature
> > > can’t
> > > >> work with YARN deployment as users generally do not have access to
> the
> > > >> temporary “localization” directories where the Drill code is placed
> by
> > > YARN.
> > > >>
> > > >> Note that the jar distribution race condition issue occurs with the
> > > >> proposed design: I believe I sketched out a scenario in one of the
> > > earlier
> > > >> comments. Drillbit A receives the CREATE FUNCTION command. It tells
> > > >> Drillbit B. While informing the other Drillbits, Drillbit B plans
> and
> > > >> launches a query that uses the function. Drillbit Z starts execution
> > of
> > > the
> > > >> query before it learns from A about the new function. This will be
> > rare
> > > —
> > > >> just rare enough to create very hard to reproduce bugs.
> > > >>
> > > >> The only reliable solution is to do the work in multiple passes:
> > > >>
> > > >> Pass 1: Ask each node to load the function, but not make it
> available
> > to
> > > >> the planner. (it would be available to the execution engine.)
> > > >> Pass 2: Await confirmation from each node that this is done.
> > > >> Pass 3: Alert every node that it is now free to plan queries with
> the
> > > >> function.
> > > >>
> > > >> Finally, I wonder if we should design the SQL syntax based on a
> > > long-term
> > > >> design, even if the feature itself is a short-term work-around.
> > Changing
> > > >> the syntax later might break scripts that users might write.
> > > >>
> > > >> So, the question for the group is this: is the value of
> semi-complete
> > > >> feature sufficient to justify the potential problems?
> > > >>
> > > >> - Paul
> > > >>
> > > >>> On Jun 20, 2016, at 6:15 PM, Parth Chandra <[email protected]>
> > > >> wrote:
> > > >>>
> > > >>> Moving discussion to dev.
> > > >>>
> > > >>> I believe the aim is to do a simple implementation without the
> > > complexity
> > > >>> of distributing the UDF. I think the document should make this
> > > limitation
> > > >>> clear.
> > > >>>
> > > >>> Per Paul's point on there being a simpler solution of just having
> > each
> > > >>> drillbit detect the if a UDF is present, I think the problem is if
> a
> > > UDF
> > > >>> get's deployed to some but not all drillbits. A query can then
> start
> > > >>> executing but not run successfully. The intent of the create
> commands
> > > >> would
> > > >>> be to ensure that all drillbits have the UDF or none would.
> > > >>>
> > > >>> I think Jacques' point about ownership conflicts is not addressed
> > > >> clearly.
> > > >>> Also, the unloading is not clear. The delete command should
> probably
> > > >> remove
> > > >>> the UDF and unload it.
> > > >>>
> > > >>>
> > > >>> On Fri, Jun 17, 2016 at 11:19 AM, Paul Rogers <
> [email protected]>
> > > >> wrote:
> > > >>>
> > > >>>> Reviewed the spec; many comments posted. Three primary comments
> for
> > > the
> > > >>>> community to consider.
> > > >>>>
> > > >>>> 1. The design conflicts with the Drill-on-YARN project. Is this a
> > > >> specific
> > > >>>> fix for one unique problem, or is it worth expanding the solution
> to
> > > >> work
> > > >>>> with Drill-on-YARN deployments? Might be hard to make the two work
> > > >> together
> > > >>>> later. See comments in docs for details.
> > > >>>>
> > > >>>> 2. Have we, by chance, looked at how other projects handle code
> > > >>>> distribution? Spark, Storm and others automatically deploy code
> > across
> > > >> the
> > > >>>> cluster; no manual distribution to each node. The key difference
> > > between
> > > >>>> Drill and others is that, for Storm, say, code is associated with
> a
> > > job
> > > >>>> (“topology” in Storm terms.) But, in Drill, functions are global
> and
> > > >> have
> > > >>>> no obvious life cycle that suggests when the code can be unloaded.
> > > >>>>
> > > >>>> 3. Have considered the class loader, dependency and name space
> > > isolation
> > > >>>> issues addressed by such products as Tomcat (web apps) or Eclipse
> > > >>>> (plugins)? Putting user code in the same namespace as Drill code
> is
> > > >> quick
> > > >>>> & dirty. It turns out, however, that doing so leads to problems
> that
> > > >>>> require long, frustrating debugging sessions to resolve.
> > > >>>>
> > > >>>> Addressing item 1 might expand scope a bit. Addressing items 2
> and 3
> > > >> are a
> > > >>>> big increase in scope, so I won’t be surprised if we leave those
> > > issues
> > > >> for
> > > >>>> later. (Though, addressing item 2 might be the best way to address
> > > item
> > > >> 1.)
> > > >>>>
> > > >>>> If we want a very simple solution that requires minimal change,
> > > perhaps
> > > >> we
> > > >>>> can use an even simpler solution. In the proposed design, the user
> > > still
> > > >>>> must distribute code to all the nodes. The primary change is to
> tell
> > > >> Drill
> > > >>>> to load (or unload) that code. Can accomplish the same result
> easier
> > > >> simply
> > > >>>> by having Drill periodically scan certain directories looking for
> > new
> > > >> (or
> > > >>>> removed) jars? Still won’t work with YARN, or solve the name space
> > > >> issues,
> > > >>>> but will work for existing non-YARN Drill users without new SQL
> > > syntax.
> > > >>>>
> > > >>>> Thanks,
> > > >>>>
> > > >>>> - Paul
> > > >>>>
> > > >>>>> On Jun 16, 2016, at 2:07 PM, Jacques Nadeau <[email protected]>
> > > >> wrote:
> > > >>>>>
> > > >>>>> Two quick thoughts:
> > > >>>>>
> > > >>>>> - (user) In the design document I didn't see any discussion of
> > > >>>>> ownership/conflicts or unloading. Would be helpful to see the
> > > thinking
> > > >>>> there
> > > >>>>> - (dev) There is a row oriented facade via the
> > > >>>>> FieldReader/FieldWriter/ComplexWriter classes. That would be a
> good
> > > >> place
> > > >>>>> to start when trying to implement an alternative interface.
> > > >>>>>
> > > >>>>>
> > > >>>>> --
> > > >>>>> Jacques Nadeau
> > > >>>>> CTO and Co-Founder, Dremio
> > > >>>>>
> > > >>>>> On Thu, Jun 16, 2016 at 11:32 AM, John Omernik <[email protected]
> >
> > > >> wrote:
> > > >>>>>
> > > >>>>>> Honestly, I don't see it as a priority issue. I think some of
> the
> > > >> ideas
> > > >>>>>> around community java UDFs could be a better approach. I'd hate
> to
> > > >> take
> > > >>>>>> away from other work to hack in something like this.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Thu, Jun 16, 2016 at 1:19 PM, Paul Rogers <
> > [email protected]>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>>> Ted refers to source code transformation. Drill gains its speed
> > > from
> > > >>>>>> value
> > > >>>>>>> vectors. However, VVs are a far cry from the row-based
> interface
> > > that
> > > >>>>>> most
> > > >>>>>>> mere mortals are accustomed to using. Since VVs are very type
> > > >> specific,
> > > >>>>>>> code is typically generated to handle the specifics of each
> type.
> > > >>>>>> Accessing
> > > >>>>>>> VVs in Jython may be a bit of a challenge because of the
> > "impedence
> > > >>>>>>> mismatch" between how VVs work and the row-and-column view
> > expected
> > > >> by
> > > >>>>>> most
> > > >>>>>>> (non-Drill) developers.
> > > >>>>>>>
> > > >>>>>>> I wonder if we've considered providing a row-oriented "facade"
> > that
> > > >> can
> > > >>>>>> be
> > > >>>>>>> used by roll-your own data sources and user-defined row
> > transforms?
> > > >>>> Might
> > > >>>>>>> be a hiccup in the fast VV pipeline, but might be handy for
> users
> > > >>>> willing
> > > >>>>>>> to trade a bit of speed for convenience. With such a facade,
> the
> > > >> Jython
> > > >>>>>> row
> > > >>>>>>> transforms that John mentions could be quite simple.
> > > >>>>>>>
> > > >>>>>>> On Thu, Jun 16, 2016 at 10:36 AM, Ted Dunning <
> > > [email protected]
> > > >>>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Since UDF's use source code transformation, using Jython would
> > be
> > > >>>>>>>> difficult.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Thu, Jun 16, 2016 at 9:42 AM, Arina Yelchiyeva <
> > > >>>>>>>> [email protected]> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Charles,
> > > >>>>>>>>>
> > > >>>>>>>>> not that I am aware of. Proposed solution doesn't invent
> > anything
> > > >>>>>> new,
> > > >>>>>>>> just
> > > >>>>>>>>> adds possibility to add UDFs without drillbit restart. But
> > > >>>>>>> contributions
> > > >>>>>>>>> are welcomed.
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Jun 16, 2016 at 4:52 PM Charles Givre <
> > [email protected]>
> > > >>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Arina,
> > > >>>>>>>>>> Has there been any discussion about making it possible via
> > > Jython
> > > >>>>>> or
> > > >>>>>>>>>> something for users to write simple UDFs in Python?
> > > >>>>>>>>>> My ideal would be to have this capability integrated in the
> > web
> > > >> GUI
> > > >>>>>>>> such
> > > >>>>>>>>>> that a user could write their UDF (in Python) right there,
> > > submit
> > > >>>>>> it
> > > >>>>>>>> and
> > > >>>>>>>>> it
> > > >>>>>>>>>> would be deployed to Drill if it passes validation tests.
> > > >>>>>>>>>> —C
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On Jun 16, 2016, at 09:34, Arina Yelchiyeva <
> > > >>>>>>>>> [email protected]>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Hi all!
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I have created Jira to allow dynamic UDFs support in Drill
> (
> > > >>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4726). There
> is
> > a
> > > >>>>>> link
> > > >>>>>>>> to
> > > >>>>>>>>>>> design document in Jira description.
> > > >>>>>>>>>>> Comments or suggestions are welcomed.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Kind regards
> > > >>>>>>>>>>> Arina
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> > >
> >
>

Re: Dynamic UDFs support

Reply via email to