Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/594
Three general rules to keep in mind in the current JSON reader
implementation:
* Drill can remember the past. (Once a type as been seen for a column,
Drill will remember that type.)
* Drill cannot predict the future. (If a type has not been seen for a
column by the end of a record batch, Drill cannot predict what type will appear
in some later batch.)
* Drill can amend the past within a single record batch. (If a batch starts
with nulls, but later a type is seen, the previous values are automatically
filled with nulls.)
Actual implementation of the JSON reader, and the value writers that form
the implementation, is complex. As we read JSON values, we ask a type-specific
writer to set that value into the value vector. Each writer marks the column as
non-null, then adds the value. Any values not so set will default to null.
Consider a file with five null "c1" values followed by a string value "foo"
for that field. The five nulls are ignored. When we see the non-null c1, the
code creates a VarChar vector and sets the 6th value to the string "foo". Doing
so automatically marks the previous five column values as null.
Suppose we have a file with a single string value "foo" for column "c1",
followed by five nulls. In this case, the first value creates and sets the
VarChar vector as before. Later, at the end of reading the record batch, the
reader sets the record count for the vectors. This action, on the VarChar
vector, has the effect of setting the trailing five column values to null.
Since values default to null, we get this behavior, and the previous, for
free. The result is that if a record batch contains even a single non-null
value for a field, that column will be fully populated with nulls for all other
records in the same batch.
This gets us back to the same old problem in Drill: if all we see are
nulls, Drill needs to know, "null of what type" while in JSON the value is just
null. The JIRA tickets linked to this ticket all related to that same
underlying issue.
There is a long history of this issue: DRILL-5033, DRILL-1256, DRILL-4479,
DRILL-3806 and more.
This fix affects only "all text mode." This means that, regardless of the
JSON type, create a VarChar column. Doing so provides a very simple fix. Since
all columns are VarChar, when we see a new column, with a null value, just
create a VarChar column. (No need to set the column to null.)
That is, we can "predict the future" for nulls because *all* columns are
VarChar -- so there is not much to predict.
Otherwise, we have to stick with Jacques' design decision in DRILL-1256:
"Drill's perspective is a non-existent column and a column with no value are
equivalent." A record batch of all nulls, followed by a record batch with a
non-null value, will cause a schema change.
Again, Drill needs a "null" type that is compatible with all other types in
order to support JSON semantics. (And, needs to differentiate between
value-exists-and-is-null and value-does-not-exist.)
Yet another solution is to have the user tell us their intent. The [JSON
Schema](http://jsonschema.net) project provides a way to express the expected
schema so that Drill would know up front the type of each column (and whether
the column is really nullable.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---