Paul Rogers created DRILL-7304:
----------------------------------
Summary: Filter record batch misses schema changes within maps
Key: DRILL-7304
URL: https://issues.apache.org/jira/browse/DRILL-7304
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Paul Rogers
While testing the new row-set based JSON reader, it was found that the Filter
record batch does not properly handle a change of schema within a map. Consider
this test:
{code:java}
@Test
public void adHoc() {
try {
client.alterSession(ExecConstants.JSON_ALL_TEXT_MODE, true);
String sql = "select a from dfs.`jsoninput/drill_3353` where e = true";
RowSet results = runTest(sql);
results.print();
results.clear();
} finally {
client.resetSession(ExecConstants.JSON_ALL_TEXT_MODE);
}
}
{code}
The "drill_3353" directory contains two files. {{a.json}}:
{code:json}
{ a : { b : 1, c : 1 }, e : false }
{ a : { b : 1, c : 1 }, e : false }
{ a : { b : 1, c : 1 }, e : true }
{code}
And {{b.json}}:
{code:json}
{ a : { b : 1, d : 1 }, e : false }
{ a : { b : 1, d : 1 }, e : false }
{ a : { b : 1, d : 1 }, e : true }
{code}
Notice that both files contain the field {{a.b}}, but the first contains
{{a.c}} while the second contains {{a.d}}.
The test is configured to return the schema of each file without any "schema
smoothing." That is, there is a hard schema change between files.
The filter record batch fails to notice the schema change, trues to use the
{{b.json}} schema with {{a.json}}, and results in an exception due to invalid
offset vectors.
The problem is a symptom of this code:
{code:java}
protected boolean setupNewSchema() throws SchemaChangeException {
...
switch (incoming.getSchema().getSelectionVectorMode()) {
case NONE:
if (sv2 == null) {
sv2 = new SelectionVector2(oContext.getAllocator());
}
filter = generateSV2Filterer();
break;
...
if (container.isSchemaChanged()) {
container.buildSchema(SelectionVectorMode.TWO_BYTE);
return true;
}
{code}
That is, if, after calling {{generateSV2Filterer()}}, the schema of our
outgoing container changes, rebuild the schema. Since we changed a map
structure, the schema should be changed, but it is not.
Digging deeper, the following adds/gets each incoming field to the outgoing
container:
{code:java}
protected Filterer generateSV2Filterer() throws SchemaChangeException {
...
for (final VectorWrapper<?> v : incoming) {
final TransferPair pair =
v.getValueVector().makeTransferPair(container.addOrGet(v.getField(), callBack));
transfers.add(pair);
}
{code}
Now, since the top-level field {{a}} already exists in the container, we'd have
to do a bit of sleuthing to see if its contents changed, but we don't:
{code:java}
public <T extends ValueVector> T addOrGet(final MaterializedField field,
final SchemaChangeCallBack callBack) {
final TypedFieldId id =
getValueVectorId(SchemaPath.getSimplePath(field.getName()));
final ValueVector vector;
if (id != null) {
vector = getValueAccessorById(id.getFieldIds()).getValueVector();
if (id.getFieldIds().length == 1 &&
!vector.getField().getType().equals(field.getType())) {
final ValueVector newVector = TypeHelper.getNewVector(field,
this.getAllocator(), callBack);
replace(vector, newVector);
return (T) newVector;
}
...
{code}
The logic is to check if we have a vector of the top-level name. If so, we
check if the types are equal. if not, we go ahead and replace the existing
vector (which has {{a\{b,d\}}}) with the new one (which has {{a\{b, c\}}}).
However, when running the code, the {{if}}-statement is not triggered, so the
vector is not replaced, and the schema is not marked as changed.
The next question is why {{isEquals()}} considers the two maps the same. It
seems that the Protobuf-generated code does not handle the child types,
ignoring differences in child types.
As it turns out, prior work already ran into this issue in another context, and
a solution is available: {{MaterializedField.isEquivalent()}} already does the
proper checks, including proper type checking for types of Unions and member
checking for maps.
Replacing {{MajorType.equals()}} with {{MaterializedField.isEquivalent()}}
fixes the issue.
Digging deeper, the reason that {{isEquals()}} says that the two types are
equals is that they are the same object. The input container evolved its map
type as the JSON reader discovered new columns. But, the filter record batch
simply reused the same object. That is, since the two containers share the same
{{MaterializedField}}, changes made by the incoming operator immediately
changed the schema of the filter operator's container. Said another way: there
is no way to detect changes to the contents of maps because the two operators
share metadata.
This problem may be due to the original assumption that {{MaterializedField}}
is immutable. That is true for "scalar" types (Nullable Int, say) but it is
*not* true for complex types.
It appears that this is a design flaw. The best we an do now is hack around it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)