[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ]
Paul Rogers edited comment on DRILL-6035 at 12/20/17 6:39 AM: -------------------------------------------------------------- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Two-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays h4. Missing {{LIST}} Support JSON arrays can contain nulls. Drill provides a (partially completed, inoperable) {{LIST}} type as described below that handles nulls. But, this vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types which cannot handle nulls. (The {{LIST}} type is described in a separate note below.) Using array types, the following rules apply to nulls: * An array cannot contain nulls. * An empty array at the start of the file has an unknown type. (Do we select Nullable {{INT}}?) * An entire array can be null, which is represented as an empty array. (That is, an empty array and a {{null}} value are considered the same.) h4. Late Type Identification As described earlier, Drill 1.13 will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Drill reads two files: one with an array with only nulls, another with real values. For example: {noformat} File A: {a: [null, null] } File B: {a: [10, 20] } {noformat} (The above condition can occur only if JSON uses the broken {{LIST}} type; it cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A contains: {noformat} {a: []} {noformat} Drill is distributed: one fragment will read File A, another will read File B. At some point, the two arrays will come together. One fragment will have created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will trigger a schema change error in this case. Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible and {{SqlLine}} will display the correct results. In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result if the guess is wrong. was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays Drill handles nulls in arrays using the {{LIST}} type, described in a separate note below. h4. Late Type Identification As described earlier, Drill will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Drill reads two files: one with an array with only nulls, another with real values. For example: {noformat} File A: {a: [null, null] } File B: {a: [10, 20] } {noformat} Drill is distributed: one fragment will read File A, another will read File B. At some point, the two arrays will come together. One fragment will have created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will trigger a schema change error in this case. Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible and {{SqlLine}} will display the correct results. In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result if the guess is wrong. > Specify Drill's JSON behavior > ----------------------------- > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.13.0 > Reporter: Paul Rogers > Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)