[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265 ]
Paul Rogers edited comment on DRILL-6035 at 12/16/17 2:04 AM: -------------------------------------------------------------- h4. JSON Scalars Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. h4. Nulls with Scalar Values The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} h4. BIGINT vs. FLOAT8 Types JSON has a single numeric type which can hold integers, floats or decimals. Drill attemps to infer the type of the field from the format of the first number encountered for a field using the following rules: * If the number consists entirely of digits, it is an integer, stored as {{BIGINT}}. * Else, it is a floating point number stored as {{FLOAT8}}. Drill 1.13 adds the following rule: * If a number is {{FLOAT8}}, accept an integer, but promote it to {{FLOAT8}}. That is, the following array is stored as {{FLOAT8}}: {code} [1.2, 3, 5.0] {code} Drill is easily fooled, however, rearrange the values and Drill will raise an error: {code} [3, 1.2, 5.0] {code} In the second case, Drill sees 3, guesses {{BIGINT}}, then immediately fails when trying to store the float value 1.2. To work around this problem, a user may set the {{`store.json.read_numbers_as_double`}} system/session property to read all numbers as {{FLOAT8}}, even if they appear to be integers. was (Author: paul.rogers): h4. JSON Scalars Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. h4. Nulls with Scalar Values The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} h4. BIGINT vs. FLOAT8 Types JSON has a single numeric type which can hold integers, floats or decimals. Drill attemps to infer the type of the field from the format of the first number encountered for a field using the following rules: * If the number consists entirely of digits, it is an integer, stored as {{BIGINT}}. * Else, it is a floating point number stored as {{FLOAT8}}. Drill 1.13 adds the following rule: * If a number is {{FLOAT8}}, accept an integer, but promote it to {{FLOAT8}}. That is, the following array is stored as {{FLOAT8}}: {code} [1.2, 3, 5.0] {code} Drill is easily fooled, however, rearrange the values and Drill will raise an error: {code} [3, 1.2, 5.0] {code} In the second case, Drill sees 3, guesses {{BIGINT}}, then immediately fails when trying to store the float value 1.2. > Specify Drill's JSON behavior > ----------------------------- > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.13.0 > Reporter: Paul Rogers > Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)