Re: [DISCUSSION] Does schema-free really need

Paul Rogers Tue, 21 Aug 2018 16:44:12 -0700

Hi Chris,

Great suggestion. As it turns out, however, when we've looked at this in the 
past, it simply shifts the problem from one place to another. I'll walk though 
the issues because it is helpful for us all to understand what we're up against.

Consider a single file with 100K nulls, followed by 150K Varchars. Drill works 
in batches and will send a series of, say, 5 batches to the client (each with 
50K records). The client will start by seeing 2 batches of nulls. The app using 
JDBC or ODBC, needs to know the type to know how to allocate memory for the 
data. But, what type?

Later, when Drill sees the first Varchar, it can change the type from, say, 
batch 3 onwards. But, JDBC and ODBC generally require the schema be known up 
front, so they would have to predict the future to know that a Varchar will 
eventually appear.

Drill could buffer the result set, waiting for the schema change. Once it 
occurs, Drill could rewrite the prior data to the newly discovered type. But, 
since Drill has no idea when (or if) a schema change will occur, and data could 
be unlimited in size, Drill would have to buffer the entire result set, 
impacting latency since, in the general case, the entire result set would be 
spilled to disk, then read back, before the first records are sent to the 
client.

JDBC, at least, allows the idea of multiple result sets (used for stored 
procedures.) Drill could send a first result set with NULL as the type, then a 
second when the type changed to Varchar. This then pushes the schema change to 
the app using JDBC, but most are not set up to handle multiple result sets. 
(Indeed, what would the app do if it requires tabular data with consistent 
types?)

Within Drill, suppose that the query includes a sort, and that memory limits 
require spilling. The first two batches with just null will be spilled to disk 
in one format. Third and later batches have a different schema. So, the code 
that handles spilling must merge the two types. Not a big problem, but the fix 
must be applied in multiple places in different ways. Very difficult to test 
all the resulting combinations and permutations.

Suppose that the file in question is a CSV file that can be split and we have, 
say, five concurrent readers, each reading 50K records.  Two readers read the 
nulls and shuffle these to other nodes. Three readers read the Varchars and 
shuffle the data to other nodes. The receiving nodes sees, effectively, a 
random arrival of data: some start with Varchar, followed by null, some the 
other way around. Again, we could write code to merge these types in each 
receiver.

But, if we scale up the file, we may find that some receivers go many batches 
of nulls before they see a Varchar, which pushes the problem up the DAG, 
eventually to the client, as described above.

We seem to be forced into the conclusion that the only solution is to predict 
the future (so we know to use the Varchar in the reader, even when we see nulls 
from the file). The only way to do that (without access to time travel) is for 
metadata to tell us that, "data for this column, when it eventually appears, 
will be Varchar."

Does this make sense? Are we overlooking an alternative solution?

Thanks,
- Paul

    On Tuesday, August 21, 2018, 3:45:55 PM PDT, Chris Cunningham 
<cunningham...@gmail.com> wrote:  

 Hi.  Mostly off topic, but reading about this issue has finally prompted a
response.

On Wed, Aug 15, 2018 at 5:46 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:
<snip>

> If we provide schema hints ("field x, when it appears, will be a Double"),
> then Drill need not fight with the problem that the data will be Double in
> some files, Nullable Int in others. The schema hint removes the ambiguity.
>
</snip>

How hard would it be to have drill 'fix' the null Int problem by just
noting it is nullable with no other type information assumed until it
encounters that type, and then require that type from then on?  If an
entire file has only null's, then it doesn't define the field at all - only
files that have the type define the field, and then when they are combined
later, IF there are conflicting non-Null fields will the error be thrown?

Thanks,
Chris

Re: [DISCUSSION] Does schema-free really need

Reply via email to