Hi Chris,
Great suggestion. As it turns out, however, when we've looked at this in the
past, it simply shifts the problem from one place to another. I'll walk though
the issues because it is helpful for us all to understand what we're up against.
Consider a single file with 100K nulls, followed by 150K Varchars. Drill works
in batches and will send a series of, say, 5 batches to the client (each with
50K records). The client will start by seeing 2 batches of nulls. The app using
JDBC or ODBC, needs to know the type to know how to allocate memory for the
data. But, what type?
Later, when Drill sees the first Varchar, it can change the type from, say,
batch 3 onwards. But, JDBC and ODBC generally require the schema be known up
front, so they would have to predict the future to know that a Varchar will
eventually appear.
Drill could buffer the result set, waiting for the schema change. Once it
occurs, Drill could rewrite the prior data to the newly discovered type. But,
since Drill has no idea when (or if) a schema change will occur, and data could
be unlimited in size, Drill would have to buffer the entire result set,
impacting latency since, in the general case, the entire result set would be
spilled to disk, then read back, before the first records are sent to the
client.
JDBC, at least, allows the idea of multiple result sets (used for stored
procedures.) Drill could send a first result set with NULL as the type, then a
second when the type changed to Varchar. This then pushes the schema change to
the app using JDBC, but most are not set up to handle multiple result sets.
(Indeed, what would the app do if it requires tabular data with consistent
types?)
Within Drill, suppose that the query includes a sort, and that memory limits
require spilling. The first two batches with just null will be spilled to disk
in one format. Third and later batches have a different schema. So, the code
that handles spilling must merge the two types. Not a big problem, but the fix
must be applied in multiple places in different ways. Very difficult to test
all the resulting combinations and permutations.
Suppose that the file in question is a CSV file that can be split and we have,
say, five concurrent readers, each reading 50K records. Two readers read the
nulls and shuffle these to other nodes. Three readers read the Varchars and
shuffle the data to other nodes. The receiving nodes sees, effectively, a
random arrival of data: some start with Varchar, followed by null, some the
other way around. Again, we could write code to merge these types in each
receiver.
But, if we scale up the file, we may find that some receivers go many batches
of nulls before they see a Varchar, which pushes the problem up the DAG,
eventually to the client, as described above.
We seem to be forced into the conclusion that the only solution is to predict
the future (so we know to use the Varchar in the reader, even when we see nulls
from the file). The only way to do that (without access to time travel) is for
metadata to tell us that, "data for this column, when it eventually appears,
will be Varchar."
Does this make sense? Are we overlooking an alternative solution?
Thanks,
- Paul
On Tuesday, August 21, 2018, 3:45:55 PM PDT, Chris Cunningham
<[email protected]> wrote:
Hi. Mostly off topic, but reading about this issue has finally prompted a
response.
On Wed, Aug 15, 2018 at 5:46 PM Paul Rogers <[email protected]>
wrote:
<snip>
> If we provide schema hints ("field x, when it appears, will be a Double"),
> then Drill need not fight with the problem that the data will be Double in
> some files, Nullable Int in others. The schema hint removes the ambiguity.
>
</snip>
How hard would it be to have drill 'fix' the null Int problem by just
noting it is nullable with no other type information assumed until it
encounters that type, and then require that type from then on? If an
entire file has only null's, then it doesn't define the field at all - only
files that have the type define the field, and then when they are combined
later, IF there are conflicting non-Null fields will the error be thrown?
Thanks,
Chris