[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

paul-rogers Mon, 20 Feb 2017 15:59:18 -0800

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/594
  
    Three general rules to keep in mind in the current JSON reader 
implementation:
    
    * Drill can remember the past. (Once a type as been seen for a column, 
Drill will remember that type.)
    * Drill cannot predict the future. (If a type has not been seen for a 
column by the end of a record batch, Drill cannot predict what type will appear 
in some later batch.)
    * Drill can amend the past within a single record batch. (If a batch starts 
with nulls, but later a type is seen, the previous values are automatically 
filled with nulls.)
    
    Actual implementation of the JSON reader, and the value writers that form 
the implementation, is complex. As we read JSON values, we ask a type-specific 
writer to set that value into the value vector. Each writer marks the column as 
non-null, then adds the value. Any values not so set will default to null.
    
    Consider a file with five null "c1" values followed by a string value "foo" 
for that field. The five nulls are ignored. When we see the non-null c1, the 
code creates a VarChar vector and sets the 6th value to the string "foo". Doing 
so automatically marks the previous five column values as null.
    
    Suppose we have a file with a single string value "foo" for column "c1", 
followed by five nulls. In this case, the first value creates and sets the 
VarChar vector as before. Later, at the end of reading the record batch, the 
reader sets the record count for the vectors. This action, on the VarChar 
vector, has the effect of setting the trailing five column values to null.
    
    Since values default to null, we get this behavior, and the previous, for 
free. The result is that if a record batch contains even a single non-null 
value for a field, that column will be fully populated with nulls for all other 
records in the same batch.
    
    This gets us back to the same old problem in Drill: if all we see are 
nulls, Drill needs to know, "null of what type" while in JSON the value is just 
null. The JIRA tickets linked to this ticket all related to that same 
underlying issue.
    
    There is a long history of this issue: DRILL-5033, DRILL-1256, DRILL-4479, 
DRILL-3806 and more.
    
    This fix affects only "all text mode." This means that, regardless of the 
JSON type, create a VarChar column. Doing so provides a very simple fix. Since 
all columns are VarChar, when we see a new column, with a null value, just 
create a VarChar column. (No need to set the column to null.)
    
    That is, we can "predict the future" for nulls because *all* columns are 
VarChar -- so there is not much to predict.
    
    Otherwise, we have to stick with Jacques' design decision in DRILL-1256: 
"Drill's perspective is a non-existent column and a column with no value are 
equivalent." A record batch of all nulls, followed by a record batch with a 
non-null value, will cause a schema change.
    
    Again, Drill needs a "null" type that is compatible with all other types in 
order to support JSON semantics. (And, needs to differentiate between 
value-exists-and-is-null and value-does-not-exist.)
    
    Yet another solution is to have the user tell us their intent. The [JSON 
Schema](http://jsonschema.net) project provides a way to express the expected 
schema so that Drill would know up front the type of each column (and whether 
the column is really nullable.)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

Reply via email to