So Charles here's how I'd set it up (I am not tied to this as I would love to have others from Drill community feel like it's an open community ala Apache, however, I am not sure the best way to approach)
So, up to me, and this is just spitballing 1. Create a Github repo (I'd use my account just because, but if it makes sense under the apache one I am not tied to it) 2. Create a Readme that describes what we have - I think UDFs should be grouped into folders under the repo, thing of these as "groups of UDFs" This is a human based grouping that makes it easy to organize by some general types, say string, language processing. Not sure the best way to approach this, but I want to make it a little bit grouped, rather than flat to make it easy. - Each UDF would have it's own folder. - We could create a basic requirements for UDFs to be accepted, perhaps certain tests, a README.md, LICENSE (we'd need people to submit to the apache license) package.info, (explained below). Readme would have certain data about how to use etc. - package.info Here would be a file that has a json record that has name, description, how to use, and tags. It's kinda like the grouping by folders, but it's used from an install perspective and from a package management perspective. See below) 3. We won't keep jars in the repo, only source. But we will include a docker file that will be as small as possible, and this will be used to build on demand and UDF that someone wants to install. Thus we can ensure the UDFs build well on anyone system AND that people who want to use UDFs don't have to be Java Experts 4. The package manager could have settings, like your Drill install directory, and basically, it would build and install any UDF you want Now, to keep things simple, the package manager can use the tags on the UDFs to determine which udfs to build and then deploy, so you could build install UDFs by tag (so you can say build all with tag X so you don't have to individual ones) or you could build by name. 5. The package manager would have list and search features that would use the description, name and tags to help you search through the packages, and provide a list of packages. This could be a "pre" step prior to installing allowing you to search, and only install whats needed based on what you want. 6. We can remove packages based on the install dir. 7. How to we handle across nodes? Shared locations are great, or we could create "install packages" i.e. after build we can bundle all jars into tgz that can be deployed etc. Shrug, perhaps it's a bit verbose, but the idea here is we want to encourage people to submit here, we want issues to be tracked, and we want to have one place to send folks. I would still like to use the drill user list for discussion (at first) but if it the UDF discussion grows to be to much noise, we'd need a new list. All UDFs would have to be Apache Licensed, and like I said, maybe we prove this out with the idea that we can get this moved to Apache. I am not sure, does Apache do "related projects" I.e. this on its own may not be an Apache project, but to keep it within the Apache guidelines would be really cool. So, that's a lot of stuff, but I am trying to toss out ideas more for critique/discussion. So please, critique/discuss :) John On Wed, May 25, 2016 at 12:11 PM, MattK <m...@hybriddba.com> wrote: > UDFs scare me in that the only Java I've conquered is evident from my empty >> french press... >> > > Same issue here. I have solved this in other platforms by pre-processing > the data with a set of regex replacements in Awk: > > ~~~ > # "Repair" invalid dates as stored in MySQL (3 replacements for > readability no slower than one nested) > $0 = gensub(/0000-([0-9]{2}-[0-9]{2})/, "0001-\\1", "g", $0) > $0 = gensub(/([0-9]{4})-00-([0-9]{2})/, "\\1-01-\\2", "g", $0) > $0 = gensub(/([0-9]{4}-[0-9]{2})-00/, "\\1-01", "g", $0) > ~~~ > > But of course this adds another step in the pipeline. Perhaps something > similar to could be implemented via > https://drill.apache.org/docs/string-manipulation/#regexp_replace ? > > > > > On 25 May 2016, at 12:55, John Omernik wrote: > > Cool, I wasn't aware of SIMILAR to (I learned something) However, that >> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02 and >> 2015-00-23 but 00 doesn't work (bad data) . >> >> UDFs scare me in that the only Java I've conquered is evident from my >> empty >> french press... >> >> I know I've brought it up in the past, but has anyone seen any community >> around UDFs start? I'd love to have a community that follows Apache like >> rules, and allows us to create and track UDFs to share... that would be >> pretty neat. I guess if we were to do something like that, should one of >> us (I can volunteer) just start a Github project and encourage folks to >> come to the table or is there better way via Apache to do something like >> that? >> >> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram < >> vnaranammalpu...@maprtech.com> wrote: >> >> You could write a UDF. Or you could do something like this: >>> >>> cat data.csv >>> 05/25/2016 >>> 20160525 >>> May 25th 2016 >>> >>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____' then >>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________' >>> then >>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`; >>> +-------------+ >>> | EXPR$0 | >>> +-------------+ >>> | 2016-05-25 | >>> | 2016-05-25 | >>> | null | >>> +-------------+ >>> 3 rows selected (0.4 seconds) >>> 0: jdbc:drill:> >>> >>> -Veera >>> >>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vgonza...@mapr.com> >>> wrote: >>> >>> Sounds like a job for a UDF? >>>> >>>> You could do the try/catch inside the UDF. >>>> >>>> ---- >>>> Vince Gonzalez >>>> Systems Engineer >>>> 212.694.3879 >>>> >>>> mapr.com >>>> >>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <j...@omernik.com> >>>> wrote: >>>> >>>> I have some DOBs, and some fields are empty others apparently were >>>>> >>>> filled >>> >>>> by trained monkeys, but while most data is accurate, some data is not. >>>>> >>>>> As you saw from my other post, I am trying to get the age for those >>>>> >>>> DOBs >>> >>>> that are valid... >>>>> >>>>> My function works, until I get to a record that is not valid and I get >>>>> something like this: >>>>> >>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for >>>>> >>>> monthOfYear >>> >>>> must be in the range [1,12] >>>>> >>>>> >>>>> Is there a good "Try -> Except" type solution that will grant me the >>>>> >>>> valid >>>> >>>>> data if things worked, and just return 0 or whatever I specify if there >>>>> >>>> is >>>> >>>>> an error? >>>>> >>>>> I could try casting the data, but if it fails won't it kill my query? >>>>> Basically I want it to keep going if it fails... not sure if Drill has >>>>> >>>> this >>>> >>>>> ability, but thought I would ask. >>>>> >>>>> >>>> >>> >>> >>> -- >>> Veera Naranammalpuram >>> Product Specialist - SQL on Hadoop >>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)* >>> *(Email) vnaranammalpu...@maprtech.com <naranammalpu...@maprtech.com>* >>> *(Mobile) 917 683 8116 - can text * >>> *Timezone: ET (UTC -5:00 / -4:00)* >>> >>>