Re: Is there a good way to handle bad date data?

John Omernik Wed, 25 May 2016 10:42:16 -0700

So Charles here's how I'd set it up (I am not tied to this as I would love
to have others from Drill community feel like it's an open community ala
Apache, however, I am not sure the best way to approach)


So, up to me, and this is just spitballing

1. Create a Github repo (I'd use my account just because, but if it makes
sense under the apache one I am not tied to it)
2. Create a Readme that  describes what we have
   - I think UDFs should be grouped into folders under the repo, thing of
these as "groups of UDFs"  This is a human based grouping that makes it
easy to organize by some general types, say string, language processing.
Not sure the best way to approach this, but I want to make it a little bit
grouped, rather than flat to make it easy.
  - Each UDF would have it's own folder.
  - We could create a basic requirements for UDFs to be accepted, perhaps
certain tests, a README.md, LICENSE (we'd need people to submit to the
apache license) package.info, (explained below).  Readme would have certain
data about how to use etc.
  - package.info  Here would be a file that has a json record that has
name, description, how to use, and tags.  It's kinda like the grouping by
folders, but it's used from an install perspective and from a package
management perspective.  See below)
3. We won't keep jars in the repo, only source. But we will include a
docker file that will be as small as possible, and this will be used to
build on demand and UDF that someone wants to install.  Thus we can ensure
the UDFs build well on anyone system AND that people who want to use UDFs
don't have to be Java Experts
4.  The package manager could have settings, like your Drill install
directory, and basically, it would build and install any UDF you want  Now,
to keep things simple, the package manager can use the tags on the UDFs to
determine which udfs to build and then deploy, so you could build install
UDFs by tag (so you can say build all with tag X so you don't have to
individual ones) or you could build by name.
5. The package manager would have list and search features that would use
the description, name and tags to help you search through the packages, and
provide a list of packages.  This could be a "pre" step prior to installing
allowing you to search, and only install whats needed based on what you
want.
6. We can remove packages based on the install dir.
7. How to we handle across nodes? Shared locations are great, or we could
create "install packages" i.e. after build we can bundle all jars into tgz
that can be deployed etc.

Shrug, perhaps it's a bit verbose, but the idea here is we want to
encourage people to submit here, we want issues to be tracked, and we want
to have one place to send folks.

I would still like to use the drill user list for discussion (at first) but
if it the UDF discussion grows to be to much noise, we'd need a new list.
All UDFs would have to be Apache Licensed, and like I said, maybe we prove
this out with the idea that we can get this moved to Apache. I am not sure,
does Apache do "related projects"  I.e. this on its own may not be an
Apache project, but to keep it within the Apache guidelines would be really
cool.

So, that's a lot of stuff, but I am trying to toss out ideas more for
critique/discussion.

So please, critique/discuss :)

John





On Wed, May 25, 2016 at 12:11 PM, MattK <[email protected]> wrote:

> UDFs scare me in that the only Java I've conquered is evident from my empty
>> french press...
>>
>
> Same issue here. I have solved this in other platforms by pre-processing
> the data with a set of regex replacements in Awk:
>
> ~~~
> # "Repair" invalid dates as stored in MySQL (3 replacements for
> readability no slower than one nested)
> $0 = gensub(/0000-([0-9]{2}-[0-9]{2})/, "0001-\\1", "g", $0)
> $0 = gensub(/([0-9]{4})-00-([0-9]{2})/, "\\1-01-\\2", "g", $0)
> $0 = gensub(/([0-9]{4}-[0-9]{2})-00/, "\\1-01", "g", $0)
> ~~~
>
> But of course this adds another step in the pipeline. Perhaps something
> similar to could be implemented via
> https://drill.apache.org/docs/string-manipulation/#regexp_replace ?
>
>
>
>
> On 25 May 2016, at 12:55, John Omernik wrote:
>
> Cool, I wasn't aware of SIMILAR to (I learned something)  However, that
>> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02 and
>> 2015-00-23  but 00 doesn't work (bad data) .
>>
>> UDFs scare me in that the only Java I've conquered is evident from my
>> empty
>> french press...
>>
>> I know I've brought it up in the past, but has anyone seen any community
>> around UDFs start? I'd love to have a community that follows Apache like
>> rules, and allows us to create and track UDFs to share... that would be
>> pretty neat.   I guess if we were to do something like that, should one of
>> us (I can volunteer) just start a Github project and encourage folks to
>> come to the table or is there better way via Apache to do something like
>> that?
>>
>> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
>> [email protected]> wrote:
>>
>> You could write a UDF. Or you could do something like this:
>>>
>>> cat data.csv
>>> 05/25/2016
>>> 20160525
>>> May 25th 2016
>>>
>>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____' then
>>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________'
>>> then
>>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
>>> +-------------+
>>> |   EXPR$0    |
>>> +-------------+
>>> | 2016-05-25  |
>>> | 2016-05-25  |
>>> | null        |
>>> +-------------+
>>> 3 rows selected (0.4 seconds)
>>> 0: jdbc:drill:>
>>>
>>> -Veera
>>>
>>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <[email protected]>
>>> wrote:
>>>
>>> Sounds like a job for a UDF?
>>>>
>>>> You could do the try/catch inside the UDF.
>>>>
>>>>  ----
>>>>  Vince Gonzalez
>>>>  Systems Engineer
>>>>  212.694.3879
>>>>
>>>>  mapr.com
>>>>
>>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <[email protected]>
>>>> wrote:
>>>>
>>>> I have some DOBs, and some fields are empty others apparently were
>>>>>
>>>> filled
>>>
>>>> by trained monkeys, but while most data is accurate, some data is not.
>>>>>
>>>>> As you saw from my other post, I am trying to get the age for those
>>>>>
>>>> DOBs
>>>
>>>> that are valid...
>>>>>
>>>>> My function works, until I get to a record that is not valid and I get
>>>>> something like this:
>>>>>
>>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
>>>>>
>>>> monthOfYear
>>>
>>>> must be in the range [1,12]
>>>>>
>>>>>
>>>>> Is there a good "Try -> Except" type solution that will grant me the
>>>>>
>>>> valid
>>>>
>>>>> data if things worked, and just return 0 or whatever I specify if there
>>>>>
>>>> is
>>>>
>>>>> an error?
>>>>>
>>>>> I could try casting the data, but if it fails won't it kill my query?
>>>>> Basically I want it to keep going if it fails... not sure if Drill has
>>>>>
>>>> this
>>>>
>>>>> ability, but thought I would ask.
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Veera Naranammalpuram
>>> Product Specialist - SQL on Hadoop
>>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
>>> *(Email) [email protected] <[email protected]>*
>>> *(Mobile) 917 683 8116 - can text *
>>> *Timezone: ET (UTC -5:00 / -4:00)*
>>>
>>>

Re: Is there a good way to handle bad date data?

Reply via email to