benj created DRILL-7389:
---------------------------
Summary: JSON empty list avoid Parquet creation
Key: DRILL-7389
URL: https://issues.apache.org/jira/browse/DRILL-7389
Project: Apache Drill
Issue Type: Improvement
Components: Storage - JSON, Storage - Parquet
Affects Versions: 1.16.0
Reporter: benj
With a JSON file with only one row with an empty list as below, it's possible
to request the file but there is an error when trying to create a Parquet
File ANIMALS_1.json:
{code:json}
{"animals":
{"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{code}
{code:sql}
SELECT * FROM ....`ANIMALS_1.json`;
+-------------------------------------------------------------------------------+
| animals
|
+-------------------------------------------------------------------------------+
|
{"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}}
|
+-------------------------------------------------------------------------------+
CREATE TABLE ....`ANIMALS_1_pqt` AS
(SELECT * FROM ....`ANIMALS_1.json`);
=>
Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema with an
empty group: optional group Horse {}
{code}
But if the json file contains a second line with a non-empty list for "Horse",
it's possible to request file and create the Parquet
File ANIMALS_2.json:
{code:json}
{"animals":
{"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{"animals":
{"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{"detected":false,"gender":"1","obsdate":"20171229"}}}
{code}
{code:sql}
SELECT * FROM ....`ANIMALS_2.json`;
+-------------------------------------------------------------------------------+
| animals
+-------------------------------------------------------------------------------+
|
{"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}}
|
|
{"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{"detected":"false","gender":"1","obsdate":"20171229"}}
|
+-------------------------------------------------------------------------------+
CREATE TABLE ....`ANIMALS_2_pqt` AS
(SELECT * FROM ....`ANIMALS_2.json`);
+----------+---------------------------+
| Fragment | Number of records written |
+----------+---------------------------+
| 0_0 | 2 |
+----------+---------------------------+
{code}
Many problems appears with this when manipulating multiple JSON with "rare"
value (and when do not master the generation).
It's very annoying to have no possibility push data in parquet where there is
missing/null value in JSON.
The possibility to cast in varchar (DRILL-7375) the data could allow the
parquet storage
In the simple case of the example discussed here, it's possible to change the
type of the input file from JSON to CSV and it will work. But it does not
answer all the problems and it doesn't allow to keep some part in "json" and
some other in "text"
--
This message was sent by Atlassian Jira
(v8.3.4#803005)