Hi Aman,

Thanks much for the write-up. My two cents, FWIW.

As the history of this list has shown, I've fought with the schema change issue 
multiple times: in sort, in JSON, in the row set loader framework, and in 
writing the "Data Engineering" chapter in the Learning Drill book.

What I have come to realize is that there is no general solution to the schema 
change problem. Yes, there are clever things to do in special cases. But he 
general problem is unsolvable.

Look at the open PR for the projection framework. There is an implementation of 
a "schema smoother." It tries really hard, but it highlights the inherent 
limitations of such an effort.

The key reason is that, do do a good job, rows processed now must know the 
types of rows seen 100 million rows from now. Since Drill does not have a time 
machine, that is not possible.

The easiest way to visualize this is with a single fragment that reads two 
files. File A has 100K rows with column C as an Varchar. File B has 100K rows 
with column C as an Int. There is no sort, so all rows are returned directly to 
the client as, say, four 50K batches.

The client will encounter a schema with C as Varchar. Later, it will C as Int. 
But, since the client already told the JDBC consumer that the type is Varchar, 
the JDBC client is stuck. It could convert the Int to Varchar behind the scenes.

Now, run the query again. The order in which Drill reads files is random. 
Second time, the client sees C as an Int. Now, JDBC must convert the later 
Varchar columns to Int. That works if the Varchar are numbers, but not if the 
Ints should have been Varchar.

The general problem as I put it in the book, is that "Drill can't predict the 
future" but that is precisely what is needed for a general solution.

However, if the user sets a policy (treat column C as a DECIMAL, even if you 
read it as an Int or Varchar), then time travel is not necessary.

My humble suggestion is to focus on the schema effort: give the user a way to 
define the resolution to the issue that is right for their data. See how that 
works out for users. Then, with that extra information, go back and see what 
other features might be useful.

The proposed schema support (at least as hints, preferably as a schema file, 
full blown with a metastore) is a much better, easier to understand, easier to 
explain solution that is familiar to anyone coming from a DB background.


My suggestion: to understand the challenges and limitations, think through many 
different scenarios: look at the history of this list for some, see the notes 
in the Result Set Loader wiki and code for more. Work out how they could be 
resolved. You may see something I've missed, or you may realize that the 
problem is just not solvable in general without an up-front schema.

More comments in the JIRA ticket.

Thanks,
- Paul

 

    On Monday, November 5, 2018, 6:47:48 PM PST, Aman Sinha 
<amansi...@gmail.com> wrote:  
 
 Hi all,
While we continue to enhance the schema provision and metastore aspects in
Drill, we also should explore what it means to be truly schema-less such
that we can better handle {semi, un}structured data, data sitting in DBs
that store JSON documents (e.g Mongo, MapR-DB).

The blocking operators are the main hurdles in this goal. I wrote some
thoughts on supporting Schema change in a Sort operator in DRILL-6829
<https://issues.apache.org/jira/browse/DRILL-6829> .  Would welcome any
feedback and see how to go about it going forward.

Thanks,
Aman
  

Reply via email to