Good catch on empty string Veera!

Wouldn't it be cheaper to check for an empty string?
case when columns[] ='' then null else to_date(columns[],'yyyy-MM-dd') end

I don't think the option to read csv empty columns (or empty string in any text 
reader) as null is in the reader yet. So we can't check with columns[] is null.


--Andries


> On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram 
> <vnaranammalpu...@maprtech.com> wrote:
> 
> Do you have zero length strings in your data? I have seen cases where the
> system option to cast empty strings to NULL doesn't work as advertised. You
> should re-open DRILL-3214.
> 
> When I run into this problem, I usually use a regex to workaround. The
> PROJECT takes a performance hit when you do this for larger data sets but
> it works.
> 
> $cat nulls.psv
> date_col|string_col
> |test
> 2016-10-28|test2
> $ sqlline
> apache drill 1.8.0
> "a little sql for your nosql"
> 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
> +-------------+-------------+
> |  date_col   | string_col  |
> +-------------+-------------+
> |             | test        |
> | 2016-10-28  | test2       |
> +-------------+-------------+
> 2 rows selected (0.303 seconds)
> 0: jdbc:drill:> select to_date(date_col,'yyyy-mm-dd') from `nulls.psv`;
> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> 
> Fragment 0:0
> 
> [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on se-node10.se.lab:31010]
> (state=,code=0)
> 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
> to_date(date_col,'yyyy-MM-dd') else null end as date_col_converted from
> `nulls.psv`;
> +---------------------+
> | date_col_converted  |
> +---------------------+
> | null                |
> | 2016-10-28          |
> +---------------------+
> 2 rows selected (0.521 seconds)
> 0: jdbc:drill:> alter system set
> `drill.exec.functions.cast_empty_string_to_null` = true;
> +-------+----------------------------------------------------------+
> |  ok   |                         summary                          |
> +-------+----------------------------------------------------------+
> | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
> +-------+----------------------------------------------------------+
> 1 row selected (0.304 seconds)
> 0: jdbc:drill:>  select to_date(date_col,'yyyy-mm-dd') from `nulls.psv`;
> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> 
> Fragment 0:0
> 
> [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on se-node10.se.lab:31010]
> (state=,code=0)
> 0: jdbc:drill:>
> 
> -Veera
> 
> On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz <kfar...@maprtech.com>
> wrote:
> 
>> All,
>> 
>> Question is - why does it work for a parquet column and fails when CSV
>> column is used ?
>> 
>> Drill 1.9.0 commit : a29f1e29
>> 
>> This is a simple project of column from a csv file, works.
>> {noformat}
>> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv` t1
>> limit 5;
>> +-------------+
>> |   EXPR$0    |
>> +-------------+
>> | 2011-11-04  |
>> | 1986-10-22  |
>> | 1992-09-10  |
>> | 2016-08-07  |
>> | 1986-01-25  |
>> +-------------+
>> 5 rows selected (0.26 seconds)
>> {noformat}
>> 
>> Using TO_DATE function with columns[x] as first input fails, with an
>> IllegalArgumentException
>> {noformat}
>> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'yyyy-mm-dd') FROM
>> `typeall_l.csv` t1 limit 5;
>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
>> 
>> Fragment 0:0
>> 
>> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on centos-01.qa.lab:31010]
>> (state=,code=0)
>> {noformat}
>> 
>> However, interestingly same query over parquet column returns correct
>> results, on same data.
>> 
>> {noformat}
>> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'yyyy-mm-dd') FROM
>> typeall_l limit 5;
>> +-------------+
>> |   EXPR$0    |
>> +-------------+
>> | 2011-01-04  |
>> | 1986-01-22  |
>> | 1992-01-10  |
>> | 2016-01-07  |
>> | 1986-01-25  |
>> +-------------+
>> 5 rows selected (0.286 seconds)
>> {noformat}
>> 
>> When the date string is passed as first input, to_date function returns
>> correct results.
>> {noformat}
>> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','yyyy-mm-dd')
>> from (values(1));
>> +-------------+
>> |   EXPR$0    |
>> +-------------+
>> | 2011-01-04  |
>> +-------------+
>> 1 row selected (0.235 seconds)
>> {noformat}
>> 
>> Thanks,
>> Khurram
>> 
> 
> 
> 
> -- 
> Veera Naranammalpuram
> Product Specialist - SQL on Hadoop
> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> *(Email) vnaranammalpu...@maprtech.com <naranammalpu...@maprtech.com>*
> *(Mobile) 917 683 8116 - can text *
> *Timezone: ET (UTC -5:00 / -4:00)*

Reply via email to