Apache Drill 1.12.0

2018-02-23 Thread Robles, Edgardo
Hi,

I am evaluating Apache Drill and have run into the following issue with CTAS 
and large table from RDBMS.

I am using apache drill 1.12.0 with ubuntu 16.04.3 vm 8 cores, 8GB of ram with 
all patches, oracle java 1.8.0.161-b12.  Postgres 9.4.1212 jdbc driver 
connecting to Greenplum 4.3.
I enabled the rdms plugin to connect to Greenplum.


I start drill using drill-embedded but this also fails with sqline.  I have 
tried a few older versions of the jdbc driver with the same error message.
The view contains 25M rows.  This works with smaller tables.  Can apache drill 
chunk data while processing?

The environment is set to Parquet storage format.

I ran a the following sql (example):

create table dfs.data.table1(col1, col2, col3, col4) partition by (col4) as
SELECT col1, col2, col3, col4
FROM bdl.schema.view_in_greenplum
order by col4, col3;

This uses 100% cpu until it fails with the following error:

2018-02-23 10:20:00,681 [256fd5fc-efb8-3504-d08a-0fdcb662f9d6:frag:0:0] ERROR 
o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting. 
Information message: Unable to handle out of memory condition in 
FragmentExecutor.
java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toCharArray(String.java:2899) ~[na:1.8.0_161]
at java.util.zip.ZipCoder.getBytes(ZipCoder.java:78) ~[na:1.8.0_161]
at java.util.zip.ZipFile.getEntry(ZipFile.java:316) ~[na:1.8.0_161]
at java.util.jar.JarFile.getEntry(JarFile.java:240) ~[na:1.8.0_161]
at java.util.jar.JarFile.getJarEntry(JarFile.java:223) ~[na:1.8.0_161]
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042) 
~[na:1.8.0_161]
at sun.misc.URLClassPath.getResource(URLClassPath.java:239) 
~[na:1.8.0_161]
at java.net.URLClassLoader$1.run(URLClassLoader.java:365) 
~[na:1.8.0_161]
at java.net.URLClassLoader$1.run(URLClassLoader.java:362) 
~[na:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) 
~[na:1.8.0_161]
at java.net.URLClassLoader.findClass(URLClassLoader.java:361) 
~[na:1.8.0_161]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[na:1.8.0_161]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) 
~[na:1.8.0_161]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[na:1.8.0_161]
at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2122)
 ~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:288) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:430) 
~[postgresql-9.4.1212.jar:9.4.1212]
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:356) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:303) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:289) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:266) 
~[postgresql-9.4.1212.jar:9.4.1212]
at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:233) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.apache.commons.dbcp.DelegatingStatement.executeQuery(DelegatingStatement.java:208)
 ~[commons-dbcp-1.4.jar:1.4]
at 
org.apache.commons.dbcp.DelegatingStatement.executeQuery(DelegatingStatement.java:208)
 ~[commons-dbcp-1.4.jar:1.4]
at 
org.apache.drill.exec.store.jdbc.JdbcRecordReader.setup(JdbcRecordReader.java:177)
 ~[drill-jdbc-storage-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.physical.impl.ScanBatch.getNextReaderIfHas(ScanBatch.java:242)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:166) 
~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
 ~[drill-java-exec-1.12.0.jar:1.12.0]


This also fails while running on Windows 2012 R2, oracle jdk 1.8.0.152 with 
same postgres driver.

Thank you for any assistance you can provide.

Edgardo Robles
Dell EMC | Big Data Operations
edgardo.robl...@emc.com




Base64 encoded results for Drill query (>10.000 columns)

2018-02-23 Thread Daniel Müller

Hi there,

I started working with Drill a few weeks ago and I'm still wondering why 
the query results are Base64 encoded...


I found this ticket, which also handles this situation: 
https://issues.apache.org/jira/browse/DRILL-4620


Of course, there's the CONVERT_FROM function to translate the results 
into the readable result, but as I'm working on a very dynamic HBase 
table (>10.000 columns, new columns are added every day), I can't call 
this convert function on each of these columns.


Is there another way to convert all returned columns automatically? 
Thank you!


Best regards
Daniel Müller



Re: Apache Drill 1.12.0

2018-02-23 Thread Khurram Faraaz
How many unique values does col4 have in bdl.schema.view_in_greenplum ?


Thanks,

Khurram


From: Robles, Edgardo 
Sent: Friday, February 23, 2018 8:27:59 AM
To: user@drill.apache.org
Subject: Apache Drill 1.12.0

Hi,

I am evaluating Apache Drill and have run into the following issue with CTAS 
and large table from RDBMS.

I am using apache drill 1.12.0 with ubuntu 16.04.3 vm 8 cores, 8GB of ram with 
all patches, oracle java 1.8.0.161-b12.  Postgres 9.4.1212 jdbc driver 
connecting to Greenplum 4.3.
I enabled the rdms plugin to connect to Greenplum.


I start drill using drill-embedded but this also fails with sqline.  I have 
tried a few older versions of the jdbc driver with the same error message.
The view contains 25M rows.  This works with smaller tables.  Can apache drill 
chunk data while processing?

The environment is set to Parquet storage format.

I ran a the following sql (example):

create table dfs.data.table1(col1, col2, col3, col4) partition by (col4) as
SELECT col1, col2, col3, col4
FROM bdl.schema.view_in_greenplum
order by col4, col3;

This uses 100% cpu until it fails with the following error:

2018-02-23 10:20:00,681 [256fd5fc-efb8-3504-d08a-0fdcb662f9d6:frag:0:0] ERROR 
o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting. 
Information message: Unable to handle out of memory condition in 
FragmentExecutor.
java.lang.OutOfMemoryError: Java heap space
at java.lang.String.toCharArray(String.java:2899) ~[na:1.8.0_161]
at java.util.zip.ZipCoder.getBytes(ZipCoder.java:78) ~[na:1.8.0_161]
at java.util.zip.ZipFile.getEntry(ZipFile.java:316) ~[na:1.8.0_161]
at java.util.jar.JarFile.getEntry(JarFile.java:240) ~[na:1.8.0_161]
at java.util.jar.JarFile.getJarEntry(JarFile.java:223) ~[na:1.8.0_161]
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042) 
~[na:1.8.0_161]
at sun.misc.URLClassPath.getResource(URLClassPath.java:239) 
~[na:1.8.0_161]
at java.net.URLClassLoader$1.run(URLClassLoader.java:365) 
~[na:1.8.0_161]
at java.net.URLClassLoader$1.run(URLClassLoader.java:362) 
~[na:1.8.0_161]
at java.security.AccessController.doPrivileged(Native Method) 
~[na:1.8.0_161]
at java.net.URLClassLoader.findClass(URLClassLoader.java:361) 
~[na:1.8.0_161]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[na:1.8.0_161]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) 
~[na:1.8.0_161]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[na:1.8.0_161]
at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2122)
 ~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:288) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:430) 
~[postgresql-9.4.1212.jar:9.4.1212]
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:356) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:303) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:289) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:266) 
~[postgresql-9.4.1212.jar:9.4.1212]
at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:233) 
~[postgresql-9.4.1212.jar:9.4.1212]
at 
org.apache.commons.dbcp.DelegatingStatement.executeQuery(DelegatingStatement.java:208)
 ~[commons-dbcp-1.4.jar:1.4]
at 
org.apache.commons.dbcp.DelegatingStatement.executeQuery(DelegatingStatement.java:208)
 ~[commons-dbcp-1.4.jar:1.4]
at 
org.apache.drill.exec.store.jdbc.JdbcRecordReader.setup(JdbcRecordReader.java:177)
 ~[drill-jdbc-storage-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.physical.impl.ScanBatch.getNextReaderIfHas(ScanBatch.java:242)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:166) 
~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
 ~[drill-java-exec-1.12.0.jar:1.12.0]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
 ~[drill-java-exec-1.12.0.jar:1.12.0]


This also fails while running on Windows 2012 R2, oracle jdk 1.8.0.152 w

Schema problems trying to convert JSON to Parquet

2018-02-23 Thread Lee, David
Using Drill's CTAS statements I've run into a schema inconsistency issue and 
I'm not sure how to solve it..

CREATE TABLE name [ (column list) ] AS query;  

If I have a directory called Cities which have JSON files which look like:

a.json:
{ "city":"San Francisco", "zip":"94105"}
{ "city":"San Jose", "zip":"94088"}

b.json:
{ "city":"Toronto ", "zip": null}
{ "city":"Montreal", "zip" null}

If I create a parquet file out of the Cities directory I will end up with files 
called:

1_0_0.parquet through 1_5_1.parquet

Now I got a problem:

Most of the parquet files have a column type of char for zip.
Some of the parquet files have a column type of int for zip because the zip 
value for a group of records was NULL..

This produces schema change errors later when trying to query the parquet 
directory.

Is it possible for Drill to do a better job learning schemas across all json 
files in a directory before creating parquet?





This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.


Re: Schema problems trying to convert JSON to Parquet

2018-02-23 Thread Andries Engelbrecht
This is a challenge when dealing with JSON. You can either force the data type 
in the CTAS statement (likely better option) or deal with the data type change 
in parquet table(s) by using CAST, etc. In the case of zip codes you need to 
consider if it will be 5 digits or the extended 5-4 digits to decide if the 
data type should be INT or VARCHAR.

Also look into the TYPEOF function, which you can use with CASE to deal with 
these types of issues.

I prefer to deal with data issues as soon as possible in the pipeline, so the 
tables you create are consistent and clean.

--Andries


On 2/23/18, 12:04 PM, "Lee, David"  wrote:

Using Drill's CTAS statements I've run into a schema inconsistency issue 
and I'm not sure how to solve it..

CREATE TABLE name [ (column list) ] AS query;  

If I have a directory called Cities which have JSON files which look like:

a.json:
{ "city":"San Francisco", "zip":"94105"}
{ "city":"San Jose", "zip":"94088"}

b.json:
{ "city":"Toronto ", "zip": null}
{ "city":"Montreal", "zip" null}

If I create a parquet file out of the Cities directory I will end up with 
files called:

1_0_0.parquet through 1_5_1.parquet

Now I got a problem:

Most of the parquet files have a column type of char for zip.
Some of the parquet files have a column type of int for zip because the zip 
value for a group of records was NULL..

This produces schema change errors later when trying to query the parquet 
directory.

Is it possible for Drill to do a better job learning schemas across all 
json files in a directory before creating parquet?





This message may contain information that is confidential or privileged. If 
you are not the intended recipient, please advise the sender immediately and 
delete this message. See 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_email-2Ddisclaimers&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=hpMAe2P_obD6f_4QKWUE_yeIbxM6me3oniVH3btG2Eg&e=
 for further information.  Please refer to 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_privacy-2Dpolicy&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=TCbMg__Jd7CA-8aVdb8xaCCPLXmqWwRNk1mHMB5d7uo&e=
 for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_about-2Dus_contacts-2Dlocations&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=99fuo3ra5r022Cja5zizkZcv2vzuxNneLGJjKbwv6Kw&e=.

© 2018 BlackRock, Inc. All rights reserved.




RE: Schema problems trying to convert JSON to Parquet

2018-02-23 Thread Lee, David
Unfortunately the JSON source files I'm trying to convert into nested Parquet 
have 4,000+ possible keys with multiple levels of nesting.. It would be ideal 
if you could inject the schema definition into a Drill query instead of relying 
on schema learning..

Like:

Contact
   First name
   Last name
Education
  School[]
  Degrees[]
 Work Experience
  Company[]
Title
Years
Addresses
  Address[]
Street
City
State
Zip

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] 
Sent: Friday, February 23, 2018 12:34 PM
To: user@drill.apache.org
Subject: Re: Schema problems trying to convert JSON to Parquet

This is a challenge when dealing with JSON. You can either force the data type 
in the CTAS statement (likely better option) or deal with the data type change 
in parquet table(s) by using CAST, etc. In the case of zip codes you need to 
consider if it will be 5 digits or the extended 5-4 digits to decide if the 
data type should be INT or VARCHAR.

Also look into the TYPEOF function, which you can use with CASE to deal with 
these types of issues.

I prefer to deal with data issues as soon as possible in the pipeline, so the 
tables you create are consistent and clean.

--Andries


On 2/23/18, 12:04 PM, "Lee, David"  wrote:

Using Drill's CTAS statements I've run into a schema inconsistency issue 
and I'm not sure how to solve it..

CREATE TABLE name [ (column list) ] AS query;  

If I have a directory called Cities which have JSON files which look like:

a.json:
{ "city":"San Francisco", "zip":"94105"}
{ "city":"San Jose", "zip":"94088"}

b.json:
{ "city":"Toronto ", "zip": null}
{ "city":"Montreal", "zip" null}

If I create a parquet file out of the Cities directory I will end up with 
files called:

1_0_0.parquet through 1_5_1.parquet

Now I got a problem:

Most of the parquet files have a column type of char for zip.
Some of the parquet files have a column type of int for zip because the zip 
value for a group of records was NULL..

This produces schema change errors later when trying to query the parquet 
directory.

Is it possible for Drill to do a better job learning schemas across all 
json files in a directory before creating parquet?





This message may contain information that is confidential or privileged. If 
you are not the intended recipient, please advise the sender immediately and 
delete this message. See 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_email-2Ddisclaimers&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=hpMAe2P_obD6f_4QKWUE_yeIbxM6me3oniVH3btG2Eg&e=
 for further information.  Please refer to 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_privacy-2Dpolicy&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=TCbMg__Jd7CA-8aVdb8xaCCPLXmqWwRNk1mHMB5d7uo&e=
 for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_about-2Dus_contacts-2Dlocations&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=99fuo3ra5r022Cja5zizkZcv2vzuxNneLGJjKbwv6Kw&e=.

© 2018 BlackRock, Inc. All rights reserved.




This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.


RE: Schema problems trying to convert JSON to Parquet

2018-02-23 Thread Lee, David
Ideally Drill could be enhanced so you can pass in a schema definition using 
some spec like:

http://json-schema.org/examples.html


-Original Message-
From: Lee, David 
Sent: Friday, February 23, 2018 12:44 PM
To: user@drill.apache.org
Subject: RE: Schema problems trying to convert JSON to Parquet

Unfortunately the JSON source files I'm trying to convert into nested Parquet 
have 4,000+ possible keys with multiple levels of nesting.. It would be ideal 
if you could inject the schema definition into a Drill query instead of relying 
on schema learning..

Like:

Contact
   First name
   Last name
Education
  School[]
  Degrees[]
 Work Experience
  Company[]
Title
Years
Addresses
  Address[]
Street
City
State
Zip

-Original Message-
From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] 
Sent: Friday, February 23, 2018 12:34 PM
To: user@drill.apache.org
Subject: Re: Schema problems trying to convert JSON to Parquet

This is a challenge when dealing with JSON. You can either force the data type 
in the CTAS statement (likely better option) or deal with the data type change 
in parquet table(s) by using CAST, etc. In the case of zip codes you need to 
consider if it will be 5 digits or the extended 5-4 digits to decide if the 
data type should be INT or VARCHAR.

Also look into the TYPEOF function, which you can use with CASE to deal with 
these types of issues.

I prefer to deal with data issues as soon as possible in the pipeline, so the 
tables you create are consistent and clean.

--Andries


On 2/23/18, 12:04 PM, "Lee, David"  wrote:

Using Drill's CTAS statements I've run into a schema inconsistency issue 
and I'm not sure how to solve it..

CREATE TABLE name [ (column list) ] AS query;  

If I have a directory called Cities which have JSON files which look like:

a.json:
{ "city":"San Francisco", "zip":"94105"}
{ "city":"San Jose", "zip":"94088"}

b.json:
{ "city":"Toronto ", "zip": null}
{ "city":"Montreal", "zip" null}

If I create a parquet file out of the Cities directory I will end up with 
files called:

1_0_0.parquet through 1_5_1.parquet

Now I got a problem:

Most of the parquet files have a column type of char for zip.
Some of the parquet files have a column type of int for zip because the zip 
value for a group of records was NULL..

This produces schema change errors later when trying to query the parquet 
directory.

Is it possible for Drill to do a better job learning schemas across all 
json files in a directory before creating parquet?





This message may contain information that is confidential or privileged. If 
you are not the intended recipient, please advise the sender immediately and 
delete this message. See 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_email-2Ddisclaimers&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=hpMAe2P_obD6f_4QKWUE_yeIbxM6me3oniVH3btG2Eg&e=
 for further information.  Please refer to 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_privacy-2Dpolicy&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=TCbMg__Jd7CA-8aVdb8xaCCPLXmqWwRNk1mHMB5d7uo&e=
 for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_about-2Dus_contacts-2Dlocations&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=99fuo3ra5r022Cja5zizkZcv2vzuxNneLGJjKbwv6Kw&e=.

© 2018 BlackRock, Inc. All rights reserved.




This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.


Re: Schema problems trying to convert JSON to Parquet

2018-02-23 Thread Paul Rogers
Hi David,
Your b.json file has only nulls; there is no way for Drill to determine what 
type of null is in your file. Drill requires each NULL to be a null of some 
type. Often, Drill guesses nullable int, which is why you saw the problem in 
your query.

If all your fields are strings, there is a workaround: use the "all-text-mode" 
when reading your JSON file. [1] Another workaround is to design the files so 
that the first record always has a non-null value for every field so Drill can 
guess the type correctly.

If any fields contain other types, you can use all text mode, then cast the 
non-text fields from text to the proper type.
 This is a known limitation: not just with Drill, but with any schema-free 
system. See DRILL-6035 [2].

At the present type, Drill supports no type of schema to help resolve this kind 
of ambiguity. (Though, having one would be the ideal solution to this kind of 
problem.)

Thanks,
- Paul

[1] https://drill.apache.org/docs/json-data-model/

 [2] https://issues.apache.org/jira/browse/DRILL-6035



On Friday, February 23, 2018, 12:04:01 PM PST, Lee, David 
 wrote:  
 
 Using Drill's CTAS statements I've run into a schema inconsistency issue and 
I'm not sure how to solve it..

CREATE TABLE name [ (column list) ] AS query;  

If I have a directory called Cities which have JSON files which look like:

a.json:
{ "city":"San Francisco", "zip":"94105"}
{ "city":"San Jose", "zip":"94088"}

b.json:
{ "city":"Toronto ", "zip": null}
{ "city":"Montreal", "zip" null}

If I create a parquet file out of the Cities directory I will end up with files 
called:

1_0_0.parquet through 1_5_1.parquet

Now I got a problem:

Most of the parquet files have a column type of char for zip.
Some of the parquet files have a column type of int for zip because the zip 
value for a group of records was NULL..

This produces schema change errors later when trying to query the parquet 
directory.

Is it possible for Drill to do a better job learning schemas across all json 
files in a directory before creating parquet?





This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.