Dev ops needs some control on when/how to deploy UDF's. From an operational perspective we need to provide some control on how these jars can be loaded into a running system.
On Tue, Jun 21, 2016 at 9:46 AM, Neeraja Rentachintala < nrentachint...@maprtech.com> wrote: > While trying to figure out the design of where to load the jars from and > how to distribute across Drillbits, we need to keep one thing mind. > The primary goal of the Dynamic UDFs feature is that Central IT has > deployed a Drill cluster and users of the environment that are working with > the data on the cluster need to be able write their own UDFs and deploy > them onto the cluster without having to work with the IT/deployments teams > to restart Drill cluster. > > To this extent, one question I have is who is responsible to place the UDF > jar on the specific locations on Drillbits Are we expecting end users to > keep the jars accessible for Drill to load. Or does the user simply supply > a local directory of the jar which is taken by Drill and deployed on all > the Drillbits in the cluster either with YARN or without YARN. > > > > On Tue, Jun 21, 2016 at 9:34 AM, Arina Yelchiyeva < > arina.yelchiy...@gmail.com> wrote: > > > 1. DELETE command - I missed to indicate it document but had it in my > mind. > > When user issues DELETE command, all UDF associated with indicated jar is > > removed from DrillFunctionRegistry. And then binary and source files are > > also deleted from UDF classpath. > > > > 2. Distribution race condition described by Paul > > User issues CREATE command and gets confirmation that UDFs is registered > > only if all drilllbits have confirmed that registration was successful. > > I don't expect user to start using UDFs in queries prior to CREATE > command > > success / failure result, which is possible but strange. > > > > 3. DoY > > @Paul > > If instead of using $DRILL_HOME/jars/3rdparty/udf directly we use > > $DRILL_UDF environment variable which will be set during drillbit start > > (like $DRILL_LOG_DIR). Location stored in this variable will be added to > > Drill classpath during start. > > Will it ease DoY integration somehow? > > > > Kind regards > > Arina > > > > On Tue, Jun 21, 2016 at 7:15 PM yuliya Feldman > <yufeld...@yahoo.com.invalid > > > > > wrote: > > > > > Just thoughts: > > > You can try to reuse distributed cache Let Drill AM do the needful in > > > terms of orchestrating UDF jars distribution. > > > But > > > I would be inclined to have a common path that is independent of the > fact > > > that it is Drill on YARN or not, as maintaining two separate ways of > > > dealing with loading/unloading UDFs will be painful and error prone. > > > One more note (I left a comment in the doc) - not sure about > > authorization > > > model here - we need to have some. > > > Just my 2cThanks > > > > > > From: Paul Rogers <prog...@maprtech.com> > > > To: "dev@drill.apache.org" <dev@drill.apache.org> > > > Sent: Monday, June 20, 2016 7:32 PM > > > Subject: Re: Dynamic UDFs support > > > > > > Hi Neeraja, > > > > > > The proposal calls for the user to copy the jar file to each Drillbit > > > node. The jar would go into a new $DRILL_HOME/jars/3rdparty/udf > > directory. > > > > > > In Drill-on-YARN (DoY), YARN is responsible for copying Drill code to > > each > > > node (which is good.) YARN puts that code in a location known only to > > YARN. > > > Since the location is private to YARN, the user can’t easily hunt down > > the > > > location in order to add the udf jar. Even if the user did find the > > > location, the next Drillbit to start would create a new copy of the > Drill > > > software, without the udf jar. > > > > > > Second, in DoY we have separated user files from Drill software. This > > > makes it much easier to distribute the software to each node: we give > the > > > Drill distribution tar archive to YARN, and YARN copies it to each node > > and > > > untars the Drill files. We make a separate copy of the (far smaller) > set > > of > > > user config files. > > > > > > If the udf jar goes into a Drill folder > ($DRILL_HOME/jars/3rdparty/udf), > > > then the user would have to rebuild the Drill tar file each time they > > add a > > > udf jar. When I tried this myself when building DoY, I found it to be > > slow > > > and error-prone. > > > > > > So, the solution is to place the udf code in the new “site” directory: > > > $DRILL_SITE/jars. That’s what that is for. Then, let DoY automatically > > > distribute the code to every node. Perfect! Except that it does not > work > > to > > > dynamically distribute code after Drill starts. > > > > > > For DoY, the solution requirements are: > > > > > > 1. Distribute code using Drill itself, rather than manually copying > jars > > > to (unknown) Drill directories. > > > 2. Ensure the solution works even if another Drillbit is spun up later, > > > and uses the original Drill tar file. > > > > > > I’m thinking we want to leverage DFS: place udf files into a well-known > > > DFS directory. Register the udf into, say, ZK. When a new Drillbit > > starts, > > > it looks for new udf jars in ZK, copies the file to a temporary > location, > > > and launches. An existing Drill is notified of the change and does the > > same > > > download process. Clean-up is needed at some point to remove ZK entries > > if > > > the udf jar becomes statically available on the next launch. That needs > > > more thought. > > > > > > We’d still need the phases mentioned earlier to ensure consistency. > > > > > > Suggestions anyone as to how to do this super simply & still get it to > > > work with DoY? > > > > > > Thanks, > > > > > > - Paul > > > > > > > On Jun 20, 2016, at 7:18 PM, Neeraja Rentachintala < > > > nrentachint...@maprtech.com> wrote: > > > > > > > > This will need to work with YARN (Once Drill is YARN enabled, I would > > > > expect a lot of users using it in conjunction with YARN). > > > > Paul, I am not clear why this wouldn't work with YARN. Can you > > elaborate. > > > > > > > > -Neeraja > > > > > > > > On Mon, Jun 20, 2016 at 7:01 PM, Paul Rogers <prog...@maprtech.com> > > > wrote: > > > > > > > >> Good enough, as long as we document the limitation that this feature > > > can’t > > > >> work with YARN deployment as users generally do not have access to > the > > > >> temporary “localization” directories where the Drill code is placed > by > > > YARN. > > > >> > > > >> Note that the jar distribution race condition issue occurs with the > > > >> proposed design: I believe I sketched out a scenario in one of the > > > earlier > > > >> comments. Drillbit A receives the CREATE FUNCTION command. It tells > > > >> Drillbit B. While informing the other Drillbits, Drillbit B plans > and > > > >> launches a query that uses the function. Drillbit Z starts execution > > of > > > the > > > >> query before it learns from A about the new function. This will be > > rare > > > — > > > >> just rare enough to create very hard to reproduce bugs. > > > >> > > > >> The only reliable solution is to do the work in multiple passes: > > > >> > > > >> Pass 1: Ask each node to load the function, but not make it > available > > to > > > >> the planner. (it would be available to the execution engine.) > > > >> Pass 2: Await confirmation from each node that this is done. > > > >> Pass 3: Alert every node that it is now free to plan queries with > the > > > >> function. > > > >> > > > >> Finally, I wonder if we should design the SQL syntax based on a > > > long-term > > > >> design, even if the feature itself is a short-term work-around. > > Changing > > > >> the syntax later might break scripts that users might write. > > > >> > > > >> So, the question for the group is this: is the value of > semi-complete > > > >> feature sufficient to justify the potential problems? > > > >> > > > >> - Paul > > > >> > > > >>> On Jun 20, 2016, at 6:15 PM, Parth Chandra <pchan...@maprtech.com> > > > >> wrote: > > > >>> > > > >>> Moving discussion to dev. > > > >>> > > > >>> I believe the aim is to do a simple implementation without the > > > complexity > > > >>> of distributing the UDF. I think the document should make this > > > limitation > > > >>> clear. > > > >>> > > > >>> Per Paul's point on there being a simpler solution of just having > > each > > > >>> drillbit detect the if a UDF is present, I think the problem is if > a > > > UDF > > > >>> get's deployed to some but not all drillbits. A query can then > start > > > >>> executing but not run successfully. The intent of the create > commands > > > >> would > > > >>> be to ensure that all drillbits have the UDF or none would. > > > >>> > > > >>> I think Jacques' point about ownership conflicts is not addressed > > > >> clearly. > > > >>> Also, the unloading is not clear. The delete command should > probably > > > >> remove > > > >>> the UDF and unload it. > > > >>> > > > >>> > > > >>> On Fri, Jun 17, 2016 at 11:19 AM, Paul Rogers < > prog...@maprtech.com> > > > >> wrote: > > > >>> > > > >>>> Reviewed the spec; many comments posted. Three primary comments > for > > > the > > > >>>> community to consider. > > > >>>> > > > >>>> 1. The design conflicts with the Drill-on-YARN project. Is this a > > > >> specific > > > >>>> fix for one unique problem, or is it worth expanding the solution > to > > > >> work > > > >>>> with Drill-on-YARN deployments? Might be hard to make the two work > > > >> together > > > >>>> later. See comments in docs for details. > > > >>>> > > > >>>> 2. Have we, by chance, looked at how other projects handle code > > > >>>> distribution? Spark, Storm and others automatically deploy code > > across > > > >> the > > > >>>> cluster; no manual distribution to each node. The key difference > > > between > > > >>>> Drill and others is that, for Storm, say, code is associated with > a > > > job > > > >>>> (“topology” in Storm terms.) But, in Drill, functions are global > and > > > >> have > > > >>>> no obvious life cycle that suggests when the code can be unloaded. > > > >>>> > > > >>>> 3. Have considered the class loader, dependency and name space > > > isolation > > > >>>> issues addressed by such products as Tomcat (web apps) or Eclipse > > > >>>> (plugins)? Putting user code in the same namespace as Drill code > is > > > >> quick > > > >>>> & dirty. It turns out, however, that doing so leads to problems > that > > > >>>> require long, frustrating debugging sessions to resolve. > > > >>>> > > > >>>> Addressing item 1 might expand scope a bit. Addressing items 2 > and 3 > > > >> are a > > > >>>> big increase in scope, so I won’t be surprised if we leave those > > > issues > > > >> for > > > >>>> later. (Though, addressing item 2 might be the best way to address > > > item > > > >> 1.) > > > >>>> > > > >>>> If we want a very simple solution that requires minimal change, > > > perhaps > > > >> we > > > >>>> can use an even simpler solution. In the proposed design, the user > > > still > > > >>>> must distribute code to all the nodes. The primary change is to > tell > > > >> Drill > > > >>>> to load (or unload) that code. Can accomplish the same result > easier > > > >> simply > > > >>>> by having Drill periodically scan certain directories looking for > > new > > > >> (or > > > >>>> removed) jars? Still won’t work with YARN, or solve the name space > > > >> issues, > > > >>>> but will work for existing non-YARN Drill users without new SQL > > > syntax. > > > >>>> > > > >>>> Thanks, > > > >>>> > > > >>>> - Paul > > > >>>> > > > >>>>> On Jun 16, 2016, at 2:07 PM, Jacques Nadeau <jacq...@dremio.com> > > > >> wrote: > > > >>>>> > > > >>>>> Two quick thoughts: > > > >>>>> > > > >>>>> - (user) In the design document I didn't see any discussion of > > > >>>>> ownership/conflicts or unloading. Would be helpful to see the > > > thinking > > > >>>> there > > > >>>>> - (dev) There is a row oriented facade via the > > > >>>>> FieldReader/FieldWriter/ComplexWriter classes. That would be a > good > > > >> place > > > >>>>> to start when trying to implement an alternative interface. > > > >>>>> > > > >>>>> > > > >>>>> -- > > > >>>>> Jacques Nadeau > > > >>>>> CTO and Co-Founder, Dremio > > > >>>>> > > > >>>>> On Thu, Jun 16, 2016 at 11:32 AM, John Omernik <j...@omernik.com > > > > > >> wrote: > > > >>>>> > > > >>>>>> Honestly, I don't see it as a priority issue. I think some of > the > > > >> ideas > > > >>>>>> around community java UDFs could be a better approach. I'd hate > to > > > >> take > > > >>>>>> away from other work to hack in something like this. > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> On Thu, Jun 16, 2016 at 1:19 PM, Paul Rogers < > > prog...@maprtech.com> > > > >>>> wrote: > > > >>>>>> > > > >>>>>>> Ted refers to source code transformation. Drill gains its speed > > > from > > > >>>>>> value > > > >>>>>>> vectors. However, VVs are a far cry from the row-based > interface > > > that > > > >>>>>> most > > > >>>>>>> mere mortals are accustomed to using. Since VVs are very type > > > >> specific, > > > >>>>>>> code is typically generated to handle the specifics of each > type. > > > >>>>>> Accessing > > > >>>>>>> VVs in Jython may be a bit of a challenge because of the > > "impedence > > > >>>>>>> mismatch" between how VVs work and the row-and-column view > > expected > > > >> by > > > >>>>>> most > > > >>>>>>> (non-Drill) developers. > > > >>>>>>> > > > >>>>>>> I wonder if we've considered providing a row-oriented "facade" > > that > > > >> can > > > >>>>>> be > > > >>>>>>> used by roll-your own data sources and user-defined row > > transforms? > > > >>>> Might > > > >>>>>>> be a hiccup in the fast VV pipeline, but might be handy for > users > > > >>>> willing > > > >>>>>>> to trade a bit of speed for convenience. With such a facade, > the > > > >> Jython > > > >>>>>> row > > > >>>>>>> transforms that John mentions could be quite simple. > > > >>>>>>> > > > >>>>>>> On Thu, Jun 16, 2016 at 10:36 AM, Ted Dunning < > > > ted.dunn...@gmail.com > > > >>> > > > >>>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> Since UDF's use source code transformation, using Jython would > > be > > > >>>>>>>> difficult. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Thu, Jun 16, 2016 at 9:42 AM, Arina Yelchiyeva < > > > >>>>>>>> arina.yelchiy...@gmail.com> wrote: > > > >>>>>>>> > > > >>>>>>>>> Hi Charles, > > > >>>>>>>>> > > > >>>>>>>>> not that I am aware of. Proposed solution doesn't invent > > anything > > > >>>>>> new, > > > >>>>>>>> just > > > >>>>>>>>> adds possibility to add UDFs without drillbit restart. But > > > >>>>>>> contributions > > > >>>>>>>>> are welcomed. > > > >>>>>>>>> > > > >>>>>>>>> On Thu, Jun 16, 2016 at 4:52 PM Charles Givre < > > cgi...@gmail.com> > > > >>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Arina, > > > >>>>>>>>>> Has there been any discussion about making it possible via > > > Jython > > > >>>>>> or > > > >>>>>>>>>> something for users to write simple UDFs in Python? > > > >>>>>>>>>> My ideal would be to have this capability integrated in the > > web > > > >> GUI > > > >>>>>>>> such > > > >>>>>>>>>> that a user could write their UDF (in Python) right there, > > > submit > > > >>>>>> it > > > >>>>>>>> and > > > >>>>>>>>> it > > > >>>>>>>>>> would be deployed to Drill if it passes validation tests. > > > >>>>>>>>>> —C > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>>> On Jun 16, 2016, at 09:34, Arina Yelchiyeva < > > > >>>>>>>>> arina.yelchiy...@gmail.com> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>> Hi all! > > > >>>>>>>>>>> > > > >>>>>>>>>>> I have created Jira to allow dynamic UDFs support in Drill > ( > > > >>>>>>>>>>> https://issues.apache.org/jira/browse/DRILL-4726). There > is > > a > > > >>>>>> link > > > >>>>>>>> to > > > >>>>>>>>>>> design document in Jira description. > > > >>>>>>>>>>> Comments or suggestions are welcomed. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Kind regards > > > >>>>>>>>>>> Arina > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>> > > > >>>> > > > >> > > > >> > > > > > > > > > > > >