benj created DRILL-7389:
---------------------------

             Summary: JSON empty list avoid Parquet creation
                 Key: DRILL-7389
                 URL: https://issues.apache.org/jira/browse/DRILL-7389
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - JSON, Storage - Parquet
    Affects Versions: 1.16.0
            Reporter: benj


With a JSON file with only one row with an empty list as below, it's possible 
to request the file but there is an error when trying to create a Parquet
 File ANIMALS_1.json:
{code:json}
{"animals": 
{"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{code}
{code:sql}
SELECT * FROM ....`ANIMALS_1.json`;
+-------------------------------------------------------------------------------+
|                                animals                                        
|
+-------------------------------------------------------------------------------+
| 
{"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}}
 |
+-------------------------------------------------------------------------------+

CREATE TABLE ....`ANIMALS_1_pqt` AS 
(SELECT * FROM ....`ANIMALS_1.json`);
=>
Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema with an 
empty group: optional group Horse {}
{code}
 
 But if the json file contains a second line with a non-empty list for "Horse", 
it's possible to request file and create the Parquet
 File ANIMALS_2.json:
{code:json}
{"animals": 
{"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{"animals": 
{"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{"detected":false,"gender":"1","obsdate":"20171229"}}}
{code}
{code:sql}
SELECT * FROM ....`ANIMALS_2.json`;
+-------------------------------------------------------------------------------+
|                                     animals                          
+-------------------------------------------------------------------------------+
| 
{"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}}
 |
| 
{"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{"detected":"false","gender":"1","obsdate":"20171229"}}
 |
+-------------------------------------------------------------------------------+

CREATE TABLE ....`ANIMALS_2_pqt` AS 
(SELECT * FROM ....`ANIMALS_2.json`);
+----------+---------------------------+
| Fragment | Number of records written |
+----------+---------------------------+
| 0_0      | 2                         |
+----------+---------------------------+
{code}
 
Many problems appears with this when manipulating multiple JSON with "rare" 
value (and when do not master the generation).

It's very annoying to have no possibility push data in parquet where there is 
missing/null value in JSON. 
The possibility to cast in varchar (DRILL-7375) the data could allow the 
parquet storage 
 
In the simple case of the example discussed here, it's possible to change the 
type of the input file from JSON to CSV and it will work. But it does not 
answer all the problems and it doesn't allow to keep some part in "json"  and 
some other in "text"

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to