Hi Paul,
Thanks for the feedback !  I am in complete favor of doing the schema
discovery and schema hinting.  But even on this list in the past we have
discussed other use cases such as IoT devices where the schema-on-read is
needed (I think it was in the context of the 'death of schema-on-read'
email thread).   As I mentioned in my prior email, JSON document databases
don't have pre-defined schema and even if one does schema discovery, it
will have to be continuously updated given that these DBs are used in
operational applications where data is streaming in at a fast rate.

I think we should try for a complementary approach - wherever schema
discovery or hinting is feasible, Drill would use it.  For others
scenarios, can we do a best effort and not fail the query ?

Note that I don't want to backtrack and revise the data types of the rows
already sent to the client.  In fact, today, if you have 2 files with
different schema, if the columns are projected as below, the query will
return data to the client in separate batches.   In fact, this is common
among Drill users to do data exploration (with a LIMIT clause).
(file 1: {a: 10, b: 20.5}   file 2: {a: "cat", b: "dog"} )

0: jdbc:drill:zk=local> select a, b from dfs.`/tmp/table2` ;

*+------+-------+*

*| ** a  ** | **  b  ** |*

*+------+-------+*

*| *10  * | *20.5 * |*

*| *cat * | *dog  * |*

*+------+-------+*

You mention 'drill cant' predict the future'..which is true and I am saying
we don't need to predict the future.  If all operators did what the Scan
readers do which is emit a new record batch when it encounters a new
schema, then conceptually it would get us much farther along.

The point is : let's assume the client side is able to handle 2 different
schemas, how can Drill internally handle that in the execution plan ?   For
the non-blocking operators it means that as soon as the schema changes, it
emits the previous Record Batch and starts a new output batch.  For the
blocking operators,  there's more things to take care of and I created
DRILL-6829 <https://issues.apache.org/jira/browse/DRILL-6829>  to capture
that.

Aman

On Mon, Nov 5, 2018 at 8:50 PM Paul Rogers <[email protected]>
wrote:

> Hi Aman,
>
> Thanks much for the write-up. My two cents, FWIW.
>
> As the history of this list has shown, I've fought with the schema change
> issue multiple times: in sort, in JSON, in the row set loader framework,
> and in writing the "Data Engineering" chapter in the Learning Drill book.
>
> What I have come to realize is that there is no general solution to the
> schema change problem. Yes, there are clever things to do in special cases.
> But he general problem is unsolvable.
>
> Look at the open PR for the projection framework. There is an
> implementation of a "schema smoother." It tries really hard, but it
> highlights the inherent limitations of such an effort.
>
> The key reason is that, do do a good job, rows processed now must know the
> types of rows seen 100 million rows from now. Since Drill does not have a
> time machine, that is not possible.
>
> The easiest way to visualize this is with a single fragment that reads two
> files. File A has 100K rows with column C as an Varchar. File B has 100K
> rows with column C as an Int. There is no sort, so all rows are returned
> directly to the client as, say, four 50K batches.
>
> The client will encounter a schema with C as Varchar. Later, it will C as
> Int. But, since the client already told the JDBC consumer that the type is
> Varchar, the JDBC client is stuck. It could convert the Int to Varchar
> behind the scenes.
>
> Now, run the query again. The order in which Drill reads files is random.
> Second time, the client sees C as an Int. Now, JDBC must convert the later
> Varchar columns to Int. That works if the Varchar are numbers, but not if
> the Ints should have been Varchar.
>
> The general problem as I put it in the book, is that "Drill can't predict
> the future" but that is precisely what is needed for a general solution.
>
> However, if the user sets a policy (treat column C as a DECIMAL, even if
> you read it as an Int or Varchar), then time travel is not necessary.
>
> My humble suggestion is to focus on the schema effort: give the user a way
> to define the resolution to the issue that is right for their data. See how
> that works out for users. Then, with that extra information, go back and
> see what other features might be useful.
>
> The proposed schema support (at least as hints, preferably as a schema
> file, full blown with a metastore) is a much better, easier to understand,
> easier to explain solution that is familiar to anyone coming from a DB
> background.
>
>
> My suggestion: to understand the challenges and limitations, think through
> many different scenarios: look at the history of this list for some, see
> the notes in the Result Set Loader wiki and code for more. Work out how
> they could be resolved. You may see something I've missed, or you may
> realize that the problem is just not solvable in general without an
> up-front schema.
>
> More comments in the JIRA ticket.
>
> Thanks,
> - Paul
>
>
>
>     On Monday, November 5, 2018, 6:47:48 PM PST, Aman Sinha <
> [email protected]> wrote:
>
>  Hi all,
> While we continue to enhance the schema provision and metastore aspects in
> Drill, we also should explore what it means to be truly schema-less such
> that we can better handle {semi, un}structured data, data sitting in DBs
> that store JSON documents (e.g Mongo, MapR-DB).
>
> The blocking operators are the main hurdles in this goal. I wrote some
> thoughts on supporting Schema change in a Sort operator in DRILL-6829
> <https://issues.apache.org/jira/browse/DRILL-6829> .  Would welcome any
> feedback and see how to go about it going forward.
>
> Thanks,
> Aman
>

Reply via email to