Re: Schema problems trying to convert JSON to Parquet
Hi David, Your b.json file has only nulls; there is no way for Drill to determine what type of null is in your file. Drill requires each NULL to be a null of some type. Often, Drill guesses nullable int, which is why you saw the problem in your query. If all your fields are strings, there is a workaround: use the "all-text-mode" when reading your JSON file. [1] Another workaround is to design the files so that the first record always has a non-null value for every field so Drill can guess the type correctly. If any fields contain other types, you can use all text mode, then cast the non-text fields from text to the proper type. This is a known limitation: not just with Drill, but with any schema-free system. See DRILL-6035 [2]. At the present type, Drill supports no type of schema to help resolve this kind of ambiguity. (Though, having one would be the ideal solution to this kind of problem.) Thanks, - Paul [1] https://drill.apache.org/docs/json-data-model/ [2] https://issues.apache.org/jira/browse/DRILL-6035 On Friday, February 23, 2018, 12:04:01 PM PST, Lee, Davidwrote: Using Drill's CTAS statements I've run into a schema inconsistency issue and I'm not sure how to solve it.. CREATE TABLE name [ (column list) ] AS query; If I have a directory called Cities which have JSON files which look like: a.json: { "city":"San Francisco", "zip":"94105"} { "city":"San Jose", "zip":"94088"} b.json: { "city":"Toronto ", "zip": null} { "city":"Montreal", "zip" null} If I create a parquet file out of the Cities directory I will end up with files called: 1_0_0.parquet through 1_5_1.parquet Now I got a problem: Most of the parquet files have a column type of char for zip. Some of the parquet files have a column type of int for zip because the zip value for a group of records was NULL.. This produces schema change errors later when trying to query the parquet directory. Is it possible for Drill to do a better job learning schemas across all json files in a directory before creating parquet? This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. © 2018 BlackRock, Inc. All rights reserved.
RE: Schema problems trying to convert JSON to Parquet
Ideally Drill could be enhanced so you can pass in a schema definition using some spec like: http://json-schema.org/examples.html -Original Message- From: Lee, David Sent: Friday, February 23, 2018 12:44 PM To: user@drill.apache.org Subject: RE: Schema problems trying to convert JSON to Parquet Unfortunately the JSON source files I'm trying to convert into nested Parquet have 4,000+ possible keys with multiple levels of nesting.. It would be ideal if you could inject the schema definition into a Drill query instead of relying on schema learning.. Like: Contact First name Last name Education School[] Degrees[] Work Experience Company[] Title Years Addresses Address[] Street City State Zip -Original Message- From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] Sent: Friday, February 23, 2018 12:34 PM To: user@drill.apache.org Subject: Re: Schema problems trying to convert JSON to Parquet This is a challenge when dealing with JSON. You can either force the data type in the CTAS statement (likely better option) or deal with the data type change in parquet table(s) by using CAST, etc. In the case of zip codes you need to consider if it will be 5 digits or the extended 5-4 digits to decide if the data type should be INT or VARCHAR. Also look into the TYPEOF function, which you can use with CASE to deal with these types of issues. I prefer to deal with data issues as soon as possible in the pipeline, so the tables you create are consistent and clean. --Andries On 2/23/18, 12:04 PM, "Lee, David" <david@blackrock.com> wrote: Using Drill's CTAS statements I've run into a schema inconsistency issue and I'm not sure how to solve it.. CREATE TABLE name [ (column list) ] AS query; If I have a directory called Cities which have JSON files which look like: a.json: { "city":"San Francisco", "zip":"94105"} { "city":"San Jose", "zip":"94088"} b.json: { "city":"Toronto ", "zip": null} { "city":"Montreal", "zip" null} If I create a parquet file out of the Cities directory I will end up with files called: 1_0_0.parquet through 1_5_1.parquet Now I got a problem: Most of the parquet files have a column type of char for zip. Some of the parquet files have a column type of int for zip because the zip value for a group of records was NULL.. This produces schema change errors later when trying to query the parquet directory. Is it possible for Drill to do a better job learning schemas across all json files in a directory before creating parquet? This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_email-2Ddisclaimers=DwIFaQ=cskdkSMqhcnjZxdQVpwTXg=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ=hpMAe2P_obD6f_4QKWUE_yeIbxM6me3oniVH3btG2Eg= for further information. Please refer to https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_privacy-2Dpolicy=DwIFaQ=cskdkSMqhcnjZxdQVpwTXg=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ=TCbMg__Jd7CA-8aVdb8xaCCPLXmqWwRNk1mHMB5d7uo= for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_about-2Dus_contacts-2Dlocations=DwIFaQ=cskdkSMqhcnjZxdQVpwTXg=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ=99fuo3ra5r022Cja5zizkZcv2vzuxNneLGJjKbwv6Kw=. © 2018 BlackRock, Inc. All rights reserved. This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. © 2018 BlackRock, Inc. All rights reserved.
RE: Schema problems trying to convert JSON to Parquet
Unfortunately the JSON source files I'm trying to convert into nested Parquet have 4,000+ possible keys with multiple levels of nesting.. It would be ideal if you could inject the schema definition into a Drill query instead of relying on schema learning.. Like: Contact First name Last name Education School[] Degrees[] Work Experience Company[] Title Years Addresses Address[] Street City State Zip -Original Message- From: Andries Engelbrecht [mailto:aengelbre...@mapr.com] Sent: Friday, February 23, 2018 12:34 PM To: user@drill.apache.org Subject: Re: Schema problems trying to convert JSON to Parquet This is a challenge when dealing with JSON. You can either force the data type in the CTAS statement (likely better option) or deal with the data type change in parquet table(s) by using CAST, etc. In the case of zip codes you need to consider if it will be 5 digits or the extended 5-4 digits to decide if the data type should be INT or VARCHAR. Also look into the TYPEOF function, which you can use with CASE to deal with these types of issues. I prefer to deal with data issues as soon as possible in the pipeline, so the tables you create are consistent and clean. --Andries On 2/23/18, 12:04 PM, "Lee, David" <david@blackrock.com> wrote: Using Drill's CTAS statements I've run into a schema inconsistency issue and I'm not sure how to solve it.. CREATE TABLE name [ (column list) ] AS query; If I have a directory called Cities which have JSON files which look like: a.json: { "city":"San Francisco", "zip":"94105"} { "city":"San Jose", "zip":"94088"} b.json: { "city":"Toronto ", "zip": null} { "city":"Montreal", "zip" null} If I create a parquet file out of the Cities directory I will end up with files called: 1_0_0.parquet through 1_5_1.parquet Now I got a problem: Most of the parquet files have a column type of char for zip. Some of the parquet files have a column type of int for zip because the zip value for a group of records was NULL.. This produces schema change errors later when trying to query the parquet directory. Is it possible for Drill to do a better job learning schemas across all json files in a directory before creating parquet? This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_email-2Ddisclaimers=DwIFaQ=cskdkSMqhcnjZxdQVpwTXg=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ=hpMAe2P_obD6f_4QKWUE_yeIbxM6me3oniVH3btG2Eg= for further information. Please refer to https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_privacy-2Dpolicy=DwIFaQ=cskdkSMqhcnjZxdQVpwTXg=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ=TCbMg__Jd7CA-8aVdb8xaCCPLXmqWwRNk1mHMB5d7uo= for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_about-2Dus_contacts-2Dlocations=DwIFaQ=cskdkSMqhcnjZxdQVpwTXg=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ=99fuo3ra5r022Cja5zizkZcv2vzuxNneLGJjKbwv6Kw=. © 2018 BlackRock, Inc. All rights reserved. This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. © 2018 BlackRock, Inc. All rights reserved.
Re: Schema problems trying to convert JSON to Parquet
This is a challenge when dealing with JSON. You can either force the data type in the CTAS statement (likely better option) or deal with the data type change in parquet table(s) by using CAST, etc. In the case of zip codes you need to consider if it will be 5 digits or the extended 5-4 digits to decide if the data type should be INT or VARCHAR. Also look into the TYPEOF function, which you can use with CASE to deal with these types of issues. I prefer to deal with data issues as soon as possible in the pipeline, so the tables you create are consistent and clean. --Andries On 2/23/18, 12:04 PM, "Lee, David"wrote: Using Drill's CTAS statements I've run into a schema inconsistency issue and I'm not sure how to solve it.. CREATE TABLE name [ (column list) ] AS query; If I have a directory called Cities which have JSON files which look like: a.json: { "city":"San Francisco", "zip":"94105"} { "city":"San Jose", "zip":"94088"} b.json: { "city":"Toronto ", "zip": null} { "city":"Montreal", "zip" null} If I create a parquet file out of the Cities directory I will end up with files called: 1_0_0.parquet through 1_5_1.parquet Now I got a problem: Most of the parquet files have a column type of char for zip. Some of the parquet files have a column type of int for zip because the zip value for a group of records was NULL.. This produces schema change errors later when trying to query the parquet directory. Is it possible for Drill to do a better job learning schemas across all json files in a directory before creating parquet? This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_email-2Ddisclaimers=DwIFaQ=cskdkSMqhcnjZxdQVpwTXg=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ=hpMAe2P_obD6f_4QKWUE_yeIbxM6me3oniVH3btG2Eg= for further information. Please refer to https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_privacy-2Dpolicy=DwIFaQ=cskdkSMqhcnjZxdQVpwTXg=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ=TCbMg__Jd7CA-8aVdb8xaCCPLXmqWwRNk1mHMB5d7uo= for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_about-2Dus_contacts-2Dlocations=DwIFaQ=cskdkSMqhcnjZxdQVpwTXg=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ=99fuo3ra5r022Cja5zizkZcv2vzuxNneLGJjKbwv6Kw=. © 2018 BlackRock, Inc. All rights reserved.
Schema problems trying to convert JSON to Parquet
Using Drill's CTAS statements I've run into a schema inconsistency issue and I'm not sure how to solve it.. CREATE TABLE name [ (column list) ] AS query; If I have a directory called Cities which have JSON files which look like: a.json: { "city":"San Francisco", "zip":"94105"} { "city":"San Jose", "zip":"94088"} b.json: { "city":"Toronto ", "zip": null} { "city":"Montreal", "zip" null} If I create a parquet file out of the Cities directory I will end up with files called: 1_0_0.parquet through 1_5_1.parquet Now I got a problem: Most of the parquet files have a column type of char for zip. Some of the parquet files have a column type of int for zip because the zip value for a group of records was NULL.. This produces schema change errors later when trying to query the parquet directory. Is it possible for Drill to do a better job learning schemas across all json files in a directory before creating parquet? This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. © 2018 BlackRock, Inc. All rights reserved.