[
https://issues.apache.org/jira/browse/DRILL-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16186458#comment-16186458
]
Paul Rogers commented on DRILL-5826:
------------------------------------
[~vitalii], we actually have two problems:
* Returning an empty batch if it is the first one,
* Considering only top-level schema when looking for a schema change.
Your suggestion for resolving the first issue is good. Perhaps we can
generalize just a bit:
* If the first batch has no rows, but does have columns, set the batch aside.
* If a batch has no rows, skip it
* If at the end, we found no batches with rows, then use the set aside batch as
our only (empty) output batch.
The above will ensure that clients such as Tableau receive a schema even if the
result set is empty. This is necessary to provide correct {{LIMIT 0}} results.
For the second problem, we must fix {{RecordBatchLoader}} to recursively
descend into maps when checking for schema changes.
> UnorderedReceiverBatch fails to detect a schema change within a map
> -------------------------------------------------------------------
>
> Key: DRILL-5826
> URL: https://issues.apache.org/jira/browse/DRILL-5826
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.11.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
>
> Run the following HBase query using:
> {code}
> select * from `hbase`.browser_action2 a
> {code}
> Table is defined as:
> {code}
> > create 'browser_action2', 'v', {SPLITS =>
> > ['0','1','2','3','4','5','6','7','8','9']}
> ...
> > scan 'browser_action2'
> ROW COLUMN+CELL
>
> 1 column=v:e0, timestamp=1506560555979,
> value=abc1
> 2 column=v:e0, timestamp=1506560564807,
> value=abc2
> {code}
> Step through the {{UnorderedReceiverBatch}} with a parallelization of 1.
> Observe the following (behavior is random):
> * The first batch has schema (row_key, v) where v is an empty map
> (corresponding to a column family), but no data (zero rows.)
> * Because the first batch has columns, it is sent downstream with
> {{OK_NEW_SCHEMA}}.
> * The second batch has schema (row_key, v{e0}), where v is a map with column
> e0 (corresponding to a column family with one column) and one row.
> * The code loads the batch, asking the batch itself if it has a new schema.
> * The batch does not have a new schema so returns false.
> * The {{UnorderedReceiverBatch}} returns {OK}, indicating to the downstream
> operator that the second batch has the same schema as the first (which, in
> this case, turns out to not be true.)
> Code in question:
> {code}
> final boolean schemaChanged = batchLoader.load(rbd, batch.getBody());
> {code}
> In point of fact, each sender has no visibility to the schema of other
> senders, and the order of receiving batches is undefined. Therefore, an input
> batch has no way of knowing if it has the same schema as the previous output
> batch.
> The obvious, correct, logic is to compare the incoming batch schema with the
> current receiver schema, and send {{OK}} or {{OK_NEW_SCHEMA}} based on the
> result of that comparison.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)