[ 
https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034946#comment-16034946
 ] 

Paul Rogers commented on DRILL-5546:
------------------------------------

In general, I agree with the proposal. The only suggestion might be to change 
the emphasis.

In looking carefully at the readers, we see that an empty result set (empty 
batch) is a natural outcome of reading. Some files just happen to be empty. If 
filters are pushed down, then some files just happen to have no matching rows.

Readers produce two distinct kinds of empty result sets:

* *Empty result set*: The reader found no data, but was able to find a schema. 
(Example: Parquet with a filter push-down or a JDBC query that returns no 
results.)
* *Null result set*: The reader found no data *and* no schema. (Example: empty 
CSV or JSON file.)


Note that filters also can produce an empty result set (if no rows match).

The Drill iterator protocol should be able to handle both kinds. It is perhaps 
a bit naive to expect that every operator has both a schema and a data set.

All operators should be able to identify, and handle, both null and empty 
result sets.

For the scanner, if one reader returns a null result set, just skip it and move 
to the next reader until a schema is found. If no reader has a non-null result 
set, then that branch of the query has no data (and no schema). That result 
should bubble up, with each operator handling the case depending on semantics. 
For example, a filter ignores the null result set. A UNION ALL skips that 
result set when assembling the result. A join handles the case depending on the 
side of the join and INNER/OUTER semantics, and so on.

To support the schema "fast track", operators should return an empty batch, 
with just schema, on the first call to {{next()}}. So, the scanner should 
return an empty batch (with schema) if a reader produces one (that is, skip 
null batches, return an empty batch.)

Again, each operator should, on the first (preferably empty) batch, assemble 
output schema according to the rules for that operator.

Do we have a spec and/or JIRA that describes the design behind the "fast 
schema" feature added shortly after 1.0? We should consult that to ensure the 
empty batch handling here is consistent with that design.

> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by 
> empty batch. This JIRA is opened as an umbrella for all those related JIRAS ( 
> such as DRILL-4686, DRILL-4734, DRILL4476, DRILL-4255, etc).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to