But don't call it DELETE. In SQL the opposite of CREATE is DROP. Julian
> On Jul 25, 2016, at 8:48 AM, Keys Botzum <[email protected]> wrote: > > I like the approach to handling DELETE. This is very useful. I think an > implementation that does not guarantee consistent behavior is perfectly fine > for use that is targeted at developers that are working on UDFs. As long as > the docs make the intent clear this makes me very happy. > > I'll defer to others more expert than I on the remainder of the design. > > Keys > _______________________________ > Keys Botzum > Senior Principal Technologist > [email protected] <mailto:[email protected]> > 443-718-0098 > MapR Technologies > http://www.mapr.com <http://www.mapr.com/> >> On Jul 25, 2016, at 9:55 AM, Arina Yelchiyeva <[email protected]> >> wrote: >> >> Taking into account all previous comments and discussion we had with Parth >> and Paul, please find below my design notes (I am going to prepare proper >> design document, just want to see if all agree with raw version). >> I propose will use lazy-init to dynamically loaded UDFs, in such case when >> user issues CREATE UDF command, foreman will only validate jar and update >> ZK function registry, and only if function is needed it will be loaded to >> appropriate drillbit (during planning stage or fragment execution). We >> might add listeners (as Paul proposed) to pre-load UDFs but I didn't >> include it to current release to simplify solution but we might re-consider >> this. >> I have looked at issue with class loading and unloading and if we ship each >> jar with its own classloader, DELETE functionality can be introduced in >> current release, at least marked as experimental or for developers use >> only, to ease UDF development process. >> >> Any comments are welcomed. >> >> *Invariants* >> >> 1. DFS staging area where user copies jar to be loaded >> >> 2. DFS udf area (former registration area) where all validated jars are >> present >> >> 3. ZK function registry - contains list of all dynamically loaded UDFs and >> their jars. UDF name will be represented as combination of name and input >> parameters. >> >> 4. Lazy-init - all dynamically loaded UDFs will be loaded to drillbit upon >> request, i.e. if drillbits receives query or fragment that contains such UDF >> >> 5. Currently only CREATE and DELETE statements are supported >> >> >> *Adding UDFs* >> >> 1. User copies source and binary (hereinafter jar) to DFS staging area >> 2. User issues CREATE UDF command >> 3. Foreman receives request to create UDF: >> a) checks if jar is present in staging area >> b) copies jar to temporary DFS location >> c) validates UDFs present in jar locally: >> 1) copies jar to temporary local fs >> 2) scans jar using temporary classloader >> 3) checks if there are any duplicates in local function registry >> 4) returns list of UDFs to be registered >> d) validates UDFs present in jar in ZK: >> 1) takes list of dynamically loaded UDFs from ZK >> 2) checks if there are no duplicates either by jar name or among UDFs >> 3) moves jar from DFS temporary area to DFS udf area >> 4) updates ZK with list of new dynamic UDFs >> 5) removes jar from staging area >> 6) returns confirmation to user that UDFs were registered >> >> >> *Lazy-init* >> >> 1. User issues query with dynamically loaded UDF. >> >> 2. During planning stage or fragment execution, if UDF is not present in >> local function registry, drillbit: >> >> a) checks if such UDF is present in ZK function registry >> >> b) if present, loads UDF using jar name, otherwise return an error >> >> c) proceeds planning stage or fragment execution >> >> >> *New drillbit registration / Drillbit re-start* >> >> Local udf directory is re-created, to clean up previously loaded jars if any >> >> >> *Delete UDF* >> >> Each jar that going to be loaded dynamically will have its own classloader >> which will solve problem with loading and unloading classes with the same >> name. >> >> >> 1. User issues DELETE command (delete will operate on jar name level) >> >> 2. Foreman receives DELETE request: >> >> a) checks if such jar is present in ZK function registry >> >> b) creates ephemeral znode /udf/delete/jar_name >> >> c) removes record in ZK function registry >> >> d) removes jar from DFS udf area >> >> e) removes ephemeral znode from /udf/delete/jar_name >> >> f) returns confirmation to user that UDFs were deleted >> >> 3. Drillbits are subscribed to /udf/delete znode, when new znode with jar >> name appears, drillbit: >> >> a) removes all UDFs associated with jar name from local function registry >> >> b) removes jar from local udf directory >> >> >> *Limitations* >> >> 1. When user runs DELETE command, some queries that are using deleted UDFs >> may fail during fragment execution if by that time UDF has been deleted >> from local registry. Ideally, before submitting DELETE command, user needs >> to make sure, no one is running queries using UDFs from that particular jar. >> >> >> 2. We encourage users not to delete any jars from DFS udf area manually, as >> it may lead to inconsistency between ZK function registry and DFS udf area. >> >> >> 3. CREATE statement is not atomic in part when we copy validated jar to DFS >> udf area and updating ZK function registry with list of new UDFs. In case >> of failure between these two steps, some unused jars may be left in DFS udf >> area but they won’t harm current process. LIST JARS command can be >> introduced to show used jars. >> >> >> Kind regards >> Arina >> >>> On Fri, Jul 22, 2016 at 7:15 PM Keys Botzum <[email protected]> wrote: >>> >>> No disagreement on deferral but I raised my initial concern precisely >>> because I'm concerned about the practicality of the "restart the cluster" >>> option. I sighted my concerns about laptops and development clusters. I >>> was wondering if there might be some small things Drill could do to help. >>> If there is nothing that can be done to make this easier, so be it, but I >>> think that's going to be a big impedance. >>> >>> Keys >>> _______________________________ >>> Keys Botzum >>> Senior Principal Technologist >>> [email protected] <mailto:[email protected]> >>> 443-718-0098 >>> MapR Technologies >>> http://www.mapr.com <http://www.mapr.com/> >>>>> On Jul 22, 2016, at 1:37 AM, Neeraja Rentachintala < >>>> [email protected]> wrote: >>>> >>>> It seems like we are reaching a conclusion here in terms of starting >>> with a >>>> simpler implementation i.e being able to deploy UDFs dynamically without >>>> Drillbit restarts based off a jars in DFS location. Dropping functions >>>> dynamically is out of scope for version 1 of this feature (we assume >>>> development of UDFs is happening on user laptop or a dev cluster where >>> its >>>> ok to have restart). >>>> >>>> -Neeraja >>>> >>>>> On Thu, Jul 21, 2016 at 11:56 AM, Keys Botzum <[email protected]> >>>> wrote: >>>> >>>>> Recognize the difficulty. Not suggesting this be addressed in first >>>>> version. Just suggesting some thought about how a real user will >>>>> workaround. Maybe some doc and/or small changes can make this easier. >>>>> >>>>> Keys >>>>> _______________________________ >>>>> Keys Botzum >>>>> Senior Principal Technologist >>>>> [email protected] >>>>> 443-718-0098 >>>>> MapR Technologies >>>>> http://www.mapr.com >>>>>> On Jul 21, 2016 1:45 PM, "Paul Rogers" <[email protected]> wrote: >>>>>> >>>>>> Hi All, >>>>>> >>>>>> Adding a dynamic DROP would, of course, be a great addition! The reason >>>>>> for suggesting we skip that was to control project scope. >>>>>> >>>>>> Dynamic DROP requires a synchronization step. Here’s the scenario: >>>>>> >>>>>> * Foreman A starts a query using UDF U. >>>>>> * Foreman B receives a request to drop UDF U, followed by a request to >>>>> add >>>>>> a new version of U, U’. >>>>>> >>>>>> How do we drop a function that may be in use? There are some tricky >>> bits >>>>>> to work out, which seemed too overwhelming to consider all in one go. >>>>>> >>>>>> Clearly just dropping U and adding a new version of U with the same >>> name >>>>>> leads to issues if not synchronized. If a Drillbit D is running a query >>>>>> with U when it receives notice to drop U, should D complete the query >>> or >>>>>> fail it? If the query completes, then how does D deal with the request >>> to >>>>>> register U’, which has the same name? >>>>>> >>>>>> Do we globally synchronize function deletion? (The foreman B that >>>>> receives >>>>>> the drop request waits for all queries using U to finish.) But, how do >>> we >>>>>> know which queries use U? >>>>>> >>>>>> An eventually consistent approach is to track the age of the oldest >>>>>> running query. Suppose B drops U at time T. Any query received after T >>>>> that >>>>>> uses U will fail in planning. A new U’ can’t be registered until all >>>>>> queries that started before T complete. >>>>>> >>>>>> The primary challenge we face in both the CREATE and DROP cases is that >>>>>> Drill is distributed with little central coordination. That’s great for >>>>>> scale, but makes it hard to design features that require coordination. >>>>> Some >>>>>> other tools solve this problem with a data dictionary (or “metastore"). >>>>>> Alas, Drill does not have such a concept. So a seemingly simple feature >>>>>> like dynamic UDF becomes a major design challenge to get right. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> - Paul >>>>>> >>>>>>>> On Jul 21, 2016, at 7:21 AM, Neeraja Rentachintala < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>> The whole point of this feature is to avoid Drill cluster restarts as >>>>> the >>>>>>> name indicates 'Dynamic' UDFs. >>>>>>> So any design that requires restarts I would think would beat the >>>>>> purpose. >>>>>>> >>>>>>> I also think this is an example of a feature we start with a simple >>>>>> design >>>>>>> to serve the purpose, take feedback on how it is being deployed/used >>> in >>>>>>> real user situations and improve it in subsequent releases. >>>>>>> >>>>>>> -thanks >>>>>>> Neeraja >>>>>>> >>>>>>>> On Thu, Jul 21, 2016 at 6:32 AM, Keys Botzum <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I think there are a lot of great ideas here. My one concern is the >>>>> lack >>>>>> of >>>>>>>> unload and thus presumably replace functionality. I'm just thinking >>>>>> about >>>>>>>> typical actual usage. >>>>>>>> >>>>>>>> In a typical development cycle someone writes something, tries it, >>>>>> learns, >>>>>>>> changes it, and tries again. Assuming I understand the design that >>>>>> change >>>>>>>> step requires a full Drill cluster restart. That is going to be very >>>>>>>> disruptive and will make UDF work nearly impossible without a >>>>> dedicated >>>>>>>> "private" cluster for Drill. I realize that people should have access >>>>> to >>>>>>>> the data they need and Drill in a development cluster but even then >>>>>>>> restarts can be hard since development clusters are often shared - >>> and >>>>>>>> that's assuming such a cluster exists. I realize of course Drill can >>>>> be >>>>>> run >>>>>>>> as a standalone Drillbit but I'm not convinced that desktops will >>> have >>>>>>>> adequate access to the needed data. >>>>>>>> >>>>>>>> Having dealt with Java classloading over the years, I'm not claiming >>>>>> class >>>>>>>> replacement is an easy thing so I'll defer to others on the priority >>>>> of >>>>>>>> that, but I'm wondering if there isn't some way to make UDF >>>>>> experimentation >>>>>>>> a bit easier/practical. >>>>>>>> >>>>>>>> Given the above, let me toss out some possibly naive ideas that maybe >>>>>> are >>>>>>>> workable: >>>>>>>> * can I easily run a standalone Drillbit on a Hadoop cluster node >>> that >>>>>> is >>>>>>>> already running Drill servers? I'm sure this can be done, but is it >>>>>> easy? >>>>>>>> Could we perhaps make this clearer as an explicit kind of thing? >>>>>>>> * is there a way that when I deploy a UDF I can constrain the # of >>>>> bits >>>>>> it >>>>>>>> is loaded into and perhaps even specify the bits? >>>>>>>> * Obvious correlarary is I'd want my query to run on those bits and a >>>>>>>> not too disruptive way to restart just those bits >>>>>>>> >>>>>>>> The above may be obvious to Drill experts. If it is then perhaps the >>>>> UDF >>>>>>>> docs could just point out how to easily develop UDFs in an iterative >>>>>>>> fashion. >>>>>>>> >>>>>>>> Keys >>>>>>>> _______________________________ >>>>>>>> Keys Botzum >>>>>>>> Senior Principal Technologist >>>>>>>> [email protected] <mailto:[email protected]> >>>>>>>> 443-718-0098 >>>>>>>> MapR Technologies >>>>>>>> http://www.mapr.com <http://www.mapr.com/> >
