I think we should start with a much simpler discussion:

If I query an empty file, what should we return?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Sep 18, 2015 at 3:26 PM, Daniel Barclay <[email protected]>
wrote:

> What sequence of RecordBatch.IterOutcome<
> https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106
> ><
> https://github.com/dsbos/incubator-drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L41>
> values should ScanBatch's next() return for a reader (file/etc.) that has
> zero rows of data, and what does that sequence depend on (e.g., whether
> there's still a non-empty schema even though there are no rows, whether
> there other files in the scan)?  [See other questions at bottom.]
>
>
> I'm trying to resolve this question to fix DRILL-2288 <
> https://issues.apache.org/jira/browse/DRILL-2288>. Its initial symptom
> was that INFORMATION_SCHEMA queries that return zero rows because of
> pushed-down filtering yielded results that have zero columns instead of the
> expected columns.  An additional symptom was that "SELECT A, B, *" from an
> empty JSON file yielded zero columns instead of the expected columns A and
> B (with zero rows).
>
> The immediate cause of the problem (the missing schema information) was
> how ScanBatch.next() handled readers that returned no rows:
>
> If a reader has no rows at all, then the first call to its next() method
> (from ScanBatch.next()) returns zero (indicating that there are no more
> rows, and, in this case, no rows at all), and ScanBatch.next()'s call to
> the reader's mutator's isNewSchema() returns true, indicating that the
> reader has a schema that ScanBatch has not yet processed (e.g., notified
> its caller about).
>
> The way ScanBatch.next()'s code checked those conditions, when the last
> reader had no rows at all, ScanBatch.next() returned IterOutcome.NONE.
>
> However, when that /last /reader was the /only /reader, that returning of
> IterOutcome.NONE for a no-rows reader by ScanBatch.next() meant that next()
> never returned IterOutcome.OK_NEW_SCHEMA for that ScanBatch.
>
> That immediate return of NONE in turn meant that the downstream operator
> _never received a return value of __OK_NEW_SCHEMA__to trigger its schema
> processing_.  (For example, in the DRILL-2288 JSON case, the project
> operator never constructed its own schema containing columns A and B plus
> whatever columns (none) came from the empty JSON file; in DRILL-2288 's
> other case, the caller never propagated the statically known columns from
> the INFORMATION_SCHEMA table.)
>
> That returning of NONE without ever returning OK_NEW_SCHEMA also violates
> the (apparent) intended call/return protocol (sequence of IterOutcome
> values) for RecordBatch.next(). (See the draft Javadoc comments currently
> at RecordBatch.IterOutcome <
> https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106
> >.)
>
>
> Therefore, it seems that ScanBatch.next() _must_ return OK_NEW_SCHEMA
> before returning NONE, instead of immediately returning NONE, for
> readers/files with zero rows for at least _some_ cases.  (It must both
> notify the downstream caller that there is a schema /and/ give the caller a
> chance to read the schema (which is allowed after OK_NEW_SCHEMA is returned
> but not after NONE).)
>
> However, it is not clear exactly what that set of cases is.  (It does not
> seem to be _all_ zero-row cases--returning OK_NEW_SCHEMA before returning
> NONE in all zero-row cases causes lots of errors about schema changes.)
>
> At a higher level, the question is how zero-row files/etc. should interact
> with sibling files/etc. (i.e., when they do and don't cause a schema
> change).  Note that some kinds of files/sources still have a schema even
> when they have zero rows of data (e.g., Parquet files, right?), while other
> kinds of files/source can't define (imply) any schema unless they have at
> least one row (e.g., JSON files).
>
>
> In my in-progress fix <
> https://github.com/dsbos/incubator-drill/tree/bugs/WORK_2288_3641_3659>for
> DRILL-2288, I have currently changed ScanBatch.next()so that when the last
> reader has zero rows and next()would have returned NONE, next() now checks
> whether it has returned OK_NEW_SCHEMA yet (per any earlier files/readers),
> and, if so, now returns OK_NEW_SCHEMA, still returning NONE if not.  (Note
> that, currently, that is regardless of whether the reader has no schema (as
> from an empty JSON file) or has a schema.)
>
> That change fixed the DRILL-2288 symptoms (apparently by giving
> downstream/calling operators notification that they didn't get before).
>
> The change initially caused problems in UnionAllRecordBatch, because its
> code checked for NONE vs. OK_NEW_SCHEMA to try to detect empty inputs
> rather than checking directly. UnionAllRecordBatch has been fixed (in the
> in-progress fix for DRILL-2288).
>
> However, that change still causes other schema-change problems.  The
> additional returns of OK_NEW_SCHEMA are causing some code to perceive
> unprocessable schema changes.  It is not yet clear whether the code should
> be checking the number of rows too, or OK_NEW_SCHEMA shouldn't be returned
> in as many subcases of the no-rows last-reader/file case.
>
>
> So, some open and potential questions seem to be:
>
> 1. Is it the case that a) any batch's next() should return OK_NEW_SCHEMA
> before it returns NONE, and callers/downstream batches should be able to
> count on getting OK_NEW_SCHEMA (e.g., to trigger setting up their
> downstream schemas), or that b) empty files can cause next() to return NONE
> without ever returning OK_NEW_SCHEMA , and therefore all downstream batch
> classes must handle getting NONE before they have set up their schemas?
> 2. For a file/source kind that has a schema even when there are no rows,
> should getting an empty file constitute a schema change?  (On one hand
> there are no actual /rows/ (following the new schema) conflicting with any
> previous schema (and maybe rows), but on the other hand there is a
> non-empty /schema /that can conflict when that's enough to matter.)
> 3. For a file/source kind that implies a schema only when there are rows
> (e.g., JSON), when should or shouldn't that be considered a schema change?
> If ScanBatch reads non-empty JSON file A, reads empty JSON file B, and
> reads non-empty JSON file C implying the same schema as A did, should that
> be considered to not be schema change or not?  (When reading
> no-/empty-schema B, should ScanBatch the keep the schema from A and check
> against that when it gets to C, effectively ignoring the existence of B
> completely?)
> 4. In ScanBatch.next(), when the last reader had no rows at all, when
> should next() return OK_NEW_SCHEMA? always? /iff/ the reader has a
> non-empty schema?  just enough to never return NONE before returning
> OK_NEW_SCHEMA (which means it acts differently for otherwise-identical
> empty files, depending on what happened with previous readers)?  as in that
> last case except only if the reader has a non-empty schema?
>
> Thanks,
> Daniel
>
> --
> Daniel Barclay
> MapR Technologies
>
>

Reply via email to