[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/27/17 7:04 AM:
--------------------------------------------------------------

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

The general rule is that Drill does not support nulls in JSON arrays. However, 
there are many odd cases.

If the JSON array is 2+ dimensional, then Drill supports nulls in outer 
dimensions but not inner ones. That is, the following is legal:

{code}
{a: [["a", "string"], null, ["another"]]}
{code}

Here, {{null}} behaves the same as an empty array: {{[ ]}}.

Note, however, that the above is *not* valid if the nulls come before the first 
non-null value:

{code}
{a: [null, ["not", "allowed"]]}
{code}

The reason is that, in the previous example, Drill could determine that the 
outer list is a list of string arrays. But, in the second case, Drill cannot 
tell the array type when it sees the {{null}}: it might be an array of scalars, 
so {{null}} is not allowed.

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to