[ 
https://issues.apache.org/jira/browse/DRILL-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16186458#comment-16186458
 ] 

Paul Rogers commented on DRILL-5826:
------------------------------------

[~vitalii], we actually have two problems:

* Returning an empty batch if it is the first one,
* Considering only top-level schema when looking for a schema change.

Your suggestion for resolving the first issue is good. Perhaps we can 
generalize just a bit:

* If the first batch has no rows, but does have columns, set the batch aside.
* If a batch has no rows, skip it
* If at the end, we found no batches with rows, then use the set aside batch as 
our only (empty) output batch.

The above will ensure that clients such as Tableau receive a schema even if the 
result set is empty. This is necessary to provide correct {{LIMIT 0}} results.

For the second problem, we must fix {{RecordBatchLoader}} to recursively 
descend into maps when checking for schema changes.

> UnorderedReceiverBatch fails to detect a schema change within a map
> -------------------------------------------------------------------
>
>                 Key: DRILL-5826
>                 URL: https://issues.apache.org/jira/browse/DRILL-5826
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>
> Run the following HBase query using:
> {code}
> select * from `hbase`.browser_action2 a
> {code}
> Table is defined as:
> {code}
> > create 'browser_action2', 'v', {SPLITS => 
> > ['0','1','2','3','4','5','6','7','8','9']}
> ...
> > scan 'browser_action2'
> ROW                                   COLUMN+CELL                             
>                                                                   
>  1                                    column=v:e0, timestamp=1506560555979, 
> value=abc1                                                          
>  2                                    column=v:e0, timestamp=1506560564807, 
> value=abc2
> {code}
> Step through the {{UnorderedReceiverBatch}} with a parallelization of 1. 
> Observe the following (behavior is random):
> * The first batch has schema (row_key, v) where v is an empty map 
> (corresponding to a column family), but no data (zero rows.)
> * Because the first batch has columns, it is sent downstream with 
> {{OK_NEW_SCHEMA}}.
> * The second batch has schema (row_key, v{e0}), where v is a map with column 
> e0 (corresponding to a column family with one column) and one row.
> * The code loads the batch, asking the batch itself if it has a new schema.
> * The batch does not have a new schema so returns false.
> * The {{UnorderedReceiverBatch}} returns {OK}, indicating to the downstream 
> operator that the second batch has the same schema as the first (which, in 
> this case, turns out to not be true.)
> Code in question:
> {code}
>       final boolean schemaChanged = batchLoader.load(rbd, batch.getBody());
> {code}
> In point of fact, each sender has no visibility to the schema of other 
> senders, and the order of receiving batches is undefined. Therefore, an input 
> batch has no way of knowing if it has the same schema as the previous output 
> batch.
> The obvious, correct, logic is to compare the incoming batch schema with the 
> current receiver schema, and send {{OK}} or {{OK_NEW_SCHEMA}} based on the 
> result of that comparison.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to