Re: Is there a good way to handle bad date data?

John Omernik Wed, 25 May 2016 17:23:21 -0700

Ya, I've found that Repo and there are some others out there. Having a
place for them vs. having a community with some rules/guidelines are
different beasts all together. If I am over thinking this though, and the
community aspect is overkill for something like this, that's valid and
honest feedback.  I guess, I am looking for opinions on something like
this, I'd be happy to champion something like this, but want to see if
there some, at least initial interest around it.


John

On Wed, May 25, 2016 at 7:00 PM, Vince Gonzalez <vgonza...@mapr.com> wrote:

> John,
>
> Some MapR folk have collected some UDFs here:
> https://github.com/mapr-demos/simple-drill-functions
>
> Additions via pull request are welcome, I'm sure.
>
> On Wednesday, May 25, 2016, John Omernik <j...@omernik.com> wrote:
>
> > So Charles here's how I'd set it up (I am not tied to this as I would
> love
> > to have others from Drill community feel like it's an open community ala
> > Apache, however, I am not sure the best way to approach)
> >
> > So, up to me, and this is just spitballing
> >
> > 1. Create a Github repo (I'd use my account just because, but if it makes
> > sense under the apache one I am not tied to it)
> > 2. Create a Readme that  describes what we have
> >    - I think UDFs should be grouped into folders under the repo, thing of
> > these as "groups of UDFs"  This is a human based grouping that makes it
> > easy to organize by some general types, say string, language processing.
> > Not sure the best way to approach this, but I want to make it a little
> bit
> > grouped, rather than flat to make it easy.
> >   - Each UDF would have it's own folder.
> >   - We could create a basic requirements for UDFs to be accepted, perhaps
> > certain tests, a README.md, LICENSE (we'd need people to submit to the
> > apache license) package.info, (explained below).  Readme would have
> > certain
> > data about how to use etc.
> >   - package.info  Here would be a file that has a json record that has
> > name, description, how to use, and tags.  It's kinda like the grouping by
> > folders, but it's used from an install perspective and from a package
> > management perspective.  See below)
> > 3. We won't keep jars in the repo, only source. But we will include a
> > docker file that will be as small as possible, and this will be used to
> > build on demand and UDF that someone wants to install.  Thus we can
> ensure
> > the UDFs build well on anyone system AND that people who want to use UDFs
> > don't have to be Java Experts
> > 4.  The package manager could have settings, like your Drill install
> > directory, and basically, it would build and install any UDF you want
> Now,
> > to keep things simple, the package manager can use the tags on the UDFs
> to
> > determine which udfs to build and then deploy, so you could build install
> > UDFs by tag (so you can say build all with tag X so you don't have to
> > individual ones) or you could build by name.
> > 5. The package manager would have list and search features that would use
> > the description, name and tags to help you search through the packages,
> and
> > provide a list of packages.  This could be a "pre" step prior to
> installing
> > allowing you to search, and only install whats needed based on what you
> > want.
> > 6. We can remove packages based on the install dir.
> > 7. How to we handle across nodes? Shared locations are great, or we could
> > create "install packages" i.e. after build we can bundle all jars into
> tgz
> > that can be deployed etc.
> >
> > Shrug, perhaps it's a bit verbose, but the idea here is we want to
> > encourage people to submit here, we want issues to be tracked, and we
> want
> > to have one place to send folks.
> >
> > I would still like to use the drill user list for discussion (at first)
> but
> > if it the UDF discussion grows to be to much noise, we'd need a new list.
> > All UDFs would have to be Apache Licensed, and like I said, maybe we
> prove
> > this out with the idea that we can get this moved to Apache. I am not
> sure,
> > does Apache do "related projects"  I.e. this on its own may not be an
> > Apache project, but to keep it within the Apache guidelines would be
> really
> > cool.
> >
> > So, that's a lot of stuff, but I am trying to toss out ideas more for
> > critique/discussion.
> >
> > So please, critique/discuss :)
> >
> > John
> >
> >
> >
> >
> >
> > On Wed, May 25, 2016 at 12:11 PM, MattK <m...@hybriddba.com
> <javascript:;>>
> > wrote:
> >
> > > UDFs scare me in that the only Java I've conquered is evident from my
> > empty
> > >> french press...
> > >>
> > >
> > > Same issue here. I have solved this in other platforms by
> pre-processing
> > > the data with a set of regex replacements in Awk:
> > >
> > > ~~~
> > > # "Repair" invalid dates as stored in MySQL (3 replacements for
> > > readability no slower than one nested)
> > > $0 = gensub(/0000-([0-9]{2}-[0-9]{2})/, "0001-\\1", "g", $0)
> > > $0 = gensub(/([0-9]{4})-00-([0-9]{2})/, "\\1-01-\\2", "g", $0)
> > > $0 = gensub(/([0-9]{4}-[0-9]{2})-00/, "\\1-01", "g", $0)
> > > ~~~
> > >
> > > But of course this adds another step in the pipeline. Perhaps something
> > > similar to could be implemented via
> > > https://drill.apache.org/docs/string-manipulation/#regexp_replace ?
> > >
> > >
> > >
> > >
> > > On 25 May 2016, at 12:55, John Omernik wrote:
> > >
> > > Cool, I wasn't aware of SIMILAR to (I learned something)  However, that
> > >> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02
> > and
> > >> 2015-00-23  but 00 doesn't work (bad data) .
> > >>
> > >> UDFs scare me in that the only Java I've conquered is evident from my
> > >> empty
> > >> french press...
> > >>
> > >> I know I've brought it up in the past, but has anyone seen any
> community
> > >> around UDFs start? I'd love to have a community that follows Apache
> like
> > >> rules, and allows us to create and track UDFs to share... that would
> be
> > >> pretty neat.   I guess if we were to do something like that, should
> one
> > of
> > >> us (I can volunteer) just start a Github project and encourage folks
> to
> > >> come to the table or is there better way via Apache to do something
> like
> > >> that?
> > >>
> > >> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
> > >> vnaranammalpu...@maprtech.com <javascript:;>> wrote:
> > >>
> > >> You could write a UDF. Or you could do something like this:
> > >>>
> > >>> cat data.csv
> > >>> 05/25/2016
> > >>> 20160525
> > >>> May 25th 2016
> > >>>
> > >>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____'
> > then
> > >>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to
> '________'
> > >>> then
> > >>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
> > >>> +-------------+
> > >>> |   EXPR$0    |
> > >>> +-------------+
> > >>> | 2016-05-25  |
> > >>> | 2016-05-25  |
> > >>> | null        |
> > >>> +-------------+
> > >>> 3 rows selected (0.4 seconds)
> > >>> 0: jdbc:drill:>
> > >>>
> > >>> -Veera
> > >>>
> > >>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <vgonza...@mapr.com
> > <javascript:;>>
> > >>> wrote:
> > >>>
> > >>> Sounds like a job for a UDF?
> > >>>>
> > >>>> You could do the try/catch inside the UDF.
> > >>>>
> > >>>>  ----
> > >>>>  Vince Gonzalez
> > >>>>  Systems Engineer
> > >>>>  212.694.3879
> > >>>>
> > >>>>  mapr.com
> > >>>>
> > >>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <j...@omernik.com
> > <javascript:;>>
> > >>>> wrote:
> > >>>>
> > >>>> I have some DOBs, and some fields are empty others apparently were
> > >>>>>
> > >>>> filled
> > >>>
> > >>>> by trained monkeys, but while most data is accurate, some data is
> not.
> > >>>>>
> > >>>>> As you saw from my other post, I am trying to get the age for those
> > >>>>>
> > >>>> DOBs
> > >>>
> > >>>> that are valid...
> > >>>>>
> > >>>>> My function works, until I get to a record that is not valid and I
> > get
> > >>>>> something like this:
> > >>>>>
> > >>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
> > >>>>>
> > >>>> monthOfYear
> > >>>
> > >>>> must be in the range [1,12]
> > >>>>>
> > >>>>>
> > >>>>> Is there a good "Try -> Except" type solution that will grant me
> the
> > >>>>>
> > >>>> valid
> > >>>>
> > >>>>> data if things worked, and just return 0 or whatever I specify if
> > there
> > >>>>>
> > >>>> is
> > >>>>
> > >>>>> an error?
> > >>>>>
> > >>>>> I could try casting the data, but if it fails won't it kill my
> query?
> > >>>>> Basically I want it to keep going if it fails... not sure if Drill
> > has
> > >>>>>
> > >>>> this
> > >>>>
> > >>>>> ability, but thought I would ask.
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Veera Naranammalpuram
> > >>> Product Specialist - SQL on Hadoop
> > >>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> > >>> *(Email) vnaranammalpu...@maprtech.com <javascript:;> <
> > naranammalpu...@maprtech.com <javascript:;>>*
> > >>> *(Mobile) 917 683 8116 - can text *
> > >>> *Timezone: ET (UTC -5:00 / -4:00)*
> > >>>
> > >>>
> >
>
>
> --
>  ----
>  Vince Gonzalez
>  Systems Engineer
>  212.694.3879
>
>  mapr.com
>

Re: Is there a good way to handle bad date data?

Reply via email to