Hi Chris, Great suggestion. As it turns out, however, when we've looked at this in the past, it simply shifts the problem from one place to another. I'll walk though the issues because it is helpful for us all to understand what we're up against.
Consider a single file with 100K nulls, followed by 150K Varchars. Drill works in batches and will send a series of, say, 5 batches to the client (each with 50K records). The client will start by seeing 2 batches of nulls. The app using JDBC or ODBC, needs to know the type to know how to allocate memory for the data. But, what type? Later, when Drill sees the first Varchar, it can change the type from, say, batch 3 onwards. But, JDBC and ODBC generally require the schema be known up front, so they would have to predict the future to know that a Varchar will eventually appear. Drill could buffer the result set, waiting for the schema change. Once it occurs, Drill could rewrite the prior data to the newly discovered type. But, since Drill has no idea when (or if) a schema change will occur, and data could be unlimited in size, Drill would have to buffer the entire result set, impacting latency since, in the general case, the entire result set would be spilled to disk, then read back, before the first records are sent to the client. JDBC, at least, allows the idea of multiple result sets (used for stored procedures.) Drill could send a first result set with NULL as the type, then a second when the type changed to Varchar. This then pushes the schema change to the app using JDBC, but most are not set up to handle multiple result sets. (Indeed, what would the app do if it requires tabular data with consistent types?) Within Drill, suppose that the query includes a sort, and that memory limits require spilling. The first two batches with just null will be spilled to disk in one format. Third and later batches have a different schema. So, the code that handles spilling must merge the two types. Not a big problem, but the fix must be applied in multiple places in different ways. Very difficult to test all the resulting combinations and permutations. Suppose that the file in question is a CSV file that can be split and we have, say, five concurrent readers, each reading 50K records. Two readers read the nulls and shuffle these to other nodes. Three readers read the Varchars and shuffle the data to other nodes. The receiving nodes sees, effectively, a random arrival of data: some start with Varchar, followed by null, some the other way around. Again, we could write code to merge these types in each receiver. But, if we scale up the file, we may find that some receivers go many batches of nulls before they see a Varchar, which pushes the problem up the DAG, eventually to the client, as described above. We seem to be forced into the conclusion that the only solution is to predict the future (so we know to use the Varchar in the reader, even when we see nulls from the file). The only way to do that (without access to time travel) is for metadata to tell us that, "data for this column, when it eventually appears, will be Varchar." Does this make sense? Are we overlooking an alternative solution? Thanks, - Paul On Tuesday, August 21, 2018, 3:45:55 PM PDT, Chris Cunningham <cunningham...@gmail.com> wrote: Hi. Mostly off topic, but reading about this issue has finally prompted a response. On Wed, Aug 15, 2018 at 5:46 PM Paul Rogers <par0...@yahoo.com.invalid> wrote: <snip> > If we provide schema hints ("field x, when it appears, will be a Double"), > then Drill need not fight with the problem that the data will be Double in > some files, Nullable Int in others. The schema hint removes the ambiguity. > </snip> How hard would it be to have drill 'fix' the null Int problem by just noting it is nullable with no other type information assumed until it encounters that type, and then require that type from then on? If an entire file has only null's, then it doesn't define the field at all - only files that have the type define the field, and then when they are combined later, IF there are conflicting non-Null fields will the error be thrown? Thanks, Chris