Re: Resolving ScanBatch.next() behavior for 0-row readers; handling of NONE, OK_NEW_SCHEMA

Daniel Barclay Mon, 21 Sep 2015 10:36:16 -0700

Jacques Nadeau wrote:

I think we should start with a much simpler discussion:


If I query an empty file, what should we return?


One thing to clarify is the difference between empty files (e.g., a
zero-byte .json file) and other zero-row files (e.g., a non-empty file
that is a Parquet file that represents no rows but still carries a
schema).

(We might also need to distinguish between types of empty files (e.g.,
.json files, where without any row objects we know nothing about the
schema, vs. .csv files, where we can know that there's one logical
column of type VARCHAR ARRAY, even though we don't know the length).)


Here's one possible view of the JSON-style case of empty files:

An empty file implies no rows of data and implies no columns in the
schema.  (I don't mean that it implies that there are no columns; I
just mean that it does not imply any columns.)  It also most probably
does not imply the absence of any columns either--so it never
conflicts with another schema.

That non-implication of columns (i.e., the schema) means that:
- taking that file by itself as a table yields no rows and an empty
  schema, and
- taking that files as part of taking a subtree of files as a table
  means that the empty file never causes a conflict with the schema
  from other files in that subtree.

(The only thing that the empty file would imply would be the type (file
name extension) of other files when an ancestor directory is taken as a
table.  (That's assuming we don't allow mixing, say, JSON and CSV files
in the same subtree.))


Daniel


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Sep 18, 2015 at 3:26 PM, Daniel Barclay <dbarc...@maprtech.com>
wrote:

What sequence of RecordBatch.IterOutcome<
https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106

https://github.com/dsbos/incubator-drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L41>
values should ScanBatch's next() return for a reader (file/etc.) that has
zero rows of data, and what does that sequence depend on (e.g., whether
there's still a non-empty schema even though there are no rows, whether
there other files in the scan)?  [See other questions at bottom.]


I'm trying to resolve this question to fix DRILL-2288 <
https://issues.apache.org/jira/browse/DRILL-2288>. Its initial symptom
was that INFORMATION_SCHEMA queries that return zero rows because of
pushed-down filtering yielded results that have zero columns instead of the
expected columns.  An additional symptom was that "SELECT A, B, *" from an
empty JSON file yielded zero columns instead of the expected columns A and
B (with zero rows).

The immediate cause of the problem (the missing schema information) was
how ScanBatch.next() handled readers that returned no rows:

If a reader has no rows at all, then the first call to its next() method
(from ScanBatch.next()) returns zero (indicating that there are no more
rows, and, in this case, no rows at all), and ScanBatch.next()'s call to
the reader's mutator's isNewSchema() returns true, indicating that the
reader has a schema that ScanBatch has not yet processed (e.g., notified
its caller about).

The way ScanBatch.next()'s code checked those conditions, when the last
reader had no rows at all, ScanBatch.next() returned IterOutcome.NONE.

However, when that /last /reader was the /only /reader, that returning of
IterOutcome.NONE for a no-rows reader by ScanBatch.next() meant that next()
never returned IterOutcome.OK_NEW_SCHEMA for that ScanBatch.

That immediate return of NONE in turn meant that the downstream operator
_never received a return value of __OK_NEW_SCHEMA__to trigger its schema
processing_.  (For example, in the DRILL-2288 JSON case, the project
operator never constructed its own schema containing columns A and B plus
whatever columns (none) came from the empty JSON file; in DRILL-2288 's
other case, the caller never propagated the statically known columns from
the INFORMATION_SCHEMA table.)

That returning of NONE without ever returning OK_NEW_SCHEMA also violates
the (apparent) intended call/return protocol (sequence of IterOutcome
values) for RecordBatch.next(). (See the draft Javadoc comments currently
at RecordBatch.IterOutcome <
https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106

.)



Therefore, it seems that ScanBatch.next() _must_ return OK_NEW_SCHEMA
before returning NONE, instead of immediately returning NONE, for
readers/files with zero rows for at least _some_ cases.  (It must both
notify the downstream caller that there is a schema /and/ give the caller a
chance to read the schema (which is allowed after OK_NEW_SCHEMA is returned
but not after NONE).)

However, it is not clear exactly what that set of cases is.  (It does not
seem to be _all_ zero-row cases--returning OK_NEW_SCHEMA before returning
NONE in all zero-row cases causes lots of errors about schema changes.)

At a higher level, the question is how zero-row files/etc. should interact
with sibling files/etc. (i.e., when they do and don't cause a schema
change).  Note that some kinds of files/sources still have a schema even
when they have zero rows of data (e.g., Parquet files, right?), while other
kinds of files/source can't define (imply) any schema unless they have at
least one row (e.g., JSON files).


In my in-progress fix <
https://github.com/dsbos/incubator-drill/tree/bugs/WORK_2288_3641_3659>for
DRILL-2288, I have currently changed ScanBatch.next()so that when the last
reader has zero rows and next()would have returned NONE, next() now checks
whether it has returned OK_NEW_SCHEMA yet (per any earlier files/readers),
and, if so, now returns OK_NEW_SCHEMA, still returning NONE if not.  (Note
that, currently, that is regardless of whether the reader has no schema (as
from an empty JSON file) or has a schema.)

That change fixed the DRILL-2288 symptoms (apparently by giving
downstream/calling operators notification that they didn't get before).

The change initially caused problems in UnionAllRecordBatch, because its
code checked for NONE vs. OK_NEW_SCHEMA to try to detect empty inputs
rather than checking directly. UnionAllRecordBatch has been fixed (in the
in-progress fix for DRILL-2288).

However, that change still causes other schema-change problems.  The
additional returns of OK_NEW_SCHEMA are causing some code to perceive
unprocessable schema changes.  It is not yet clear whether the code should
be checking the number of rows too, or OK_NEW_SCHEMA shouldn't be returned
in as many subcases of the no-rows last-reader/file case.


So, some open and potential questions seem to be:

1. Is it the case that a) any batch's next() should return OK_NEW_SCHEMA
before it returns NONE, and callers/downstream batches should be able to
count on getting OK_NEW_SCHEMA (e.g., to trigger setting up their
downstream schemas), or that b) empty files can cause next() to return NONE
without ever returning OK_NEW_SCHEMA , and therefore all downstream batch
classes must handle getting NONE before they have set up their schemas?
2. For a file/source kind that has a schema even when there are no rows,
should getting an empty file constitute a schema change?  (On one hand
there are no actual /rows/ (following the new schema) conflicting with any
previous schema (and maybe rows), but on the other hand there is a
non-empty /schema /that can conflict when that's enough to matter.)
3. For a file/source kind that implies a schema only when there are rows
(e.g., JSON), when should or shouldn't that be considered a schema change?
If ScanBatch reads non-empty JSON file A, reads empty JSON file B, and
reads non-empty JSON file C implying the same schema as A did, should that
be considered to not be schema change or not?  (When reading
no-/empty-schema B, should ScanBatch the keep the schema from A and check
against that when it gets to C, effectively ignoring the existence of B
completely?)
4. In ScanBatch.next(), when the last reader had no rows at all, when
should next() return OK_NEW_SCHEMA? always? /iff/ the reader has a
non-empty schema?  just enough to never return NONE before returning
OK_NEW_SCHEMA (which means it acts differently for otherwise-identical
empty files, depending on what happened with previous readers)?  as in that
last case except only if the reader has a non-empty schema?

Thanks,
Daniel

--
Daniel Barclay
MapR Technologies



--
Daniel Barclay
MapR Technologies

Re: Resolving ScanBatch.next() behavior for 0-row readers; handling of NONE, OK_NEW_SCHEMA

Reply via email to