[
https://issues.apache.org/jira/browse/DRILL-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16186308#comment-16186308
]
Paul Rogers commented on DRILL-5826:
------------------------------------
The problem is subtle. The {{RecordBatchLoader}} class considers a schema to be
the same if the top-level schema is identical. In the case of HBase, this means
that the two batches consist of (VARBINARY, MAP).
The problem is, in this particular use case, the map contents differ:
* MAP{}
* MAP{e0: VARBINARY}
Because the {{RecordBatchLoader}} class treats these as the same, it recognizes
no schema change and continues to use the empty map as the output schema.
> UnorderedReceiverBatch fails to detect a schema change
> ------------------------------------------------------
>
> Key: DRILL-5826
> URL: https://issues.apache.org/jira/browse/DRILL-5826
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.11.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
>
> Run the following HBase query using:
> {code}
> select * from `hbase`.browser_action2 a
> {code}
> Table is defined as:
> {code}
> > create 'browser_action2', 'v', {SPLITS =>
> > ['0','1','2','3','4','5','6','7','8','9']}
> ...
> > scan 'browser_action2'
> ROW COLUMN+CELL
>
> 1 column=v:e0, timestamp=1506560555979,
> value=abc1
> 2 column=v:e0, timestamp=1506560564807,
> value=abc2
> {code}
> Step through the {{UnorderedReceiverBatch}} with a parallelization of 1.
> Observe the following (behavior is random):
> * The first batch has schema (row_key, v) where v is an empty map
> (corresponding to a column family), but no data (zero rows.)
> * Because the first batch has columns, it is sent downstream with
> {{OK_NEW_SCHEMA}}.
> * The second batch has schema (row_key, v{e0}), where v is a map with column
> e0 (corresponding to a column family with one column) and one row.
> * The code loads the batch, asking the batch itself if it has a new schema.
> * The batch does not have a new schema so returns false.
> * The {{UnorderedReceiverBatch}} returns {OK}, indicating to the downstream
> operator that the second batch has the same schema as the first (which, in
> this case, turns out to not be true.)
> Code in question:
> {code}
> final boolean schemaChanged = batchLoader.load(rbd, batch.getBody());
> {code}
> In point of fact, each sender has no visibility to the schema of other
> senders, and the order of receiving batches is undefined. Therefore, an input
> batch has no way of knowing if it has the same schema as the previous output
> batch.
> The obvious, correct, logic is to compare the incoming batch schema with the
> current receiver schema, and send {{OK}} or {{OK_NEW_SCHEMA}} based on the
> result of that comparison.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)