[ 
https://issues.apache.org/jira/browse/DRILL-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677572#comment-16677572
 ] 

Paul Rogers edited comment on DRILL-6829 at 11/7/18 2:48 AM:
-------------------------------------------------------------

[~amansinha100], thanks for the explanation. A couple of observations. First, 
Drill is a relational engine, clients are often JDBC or ODBC. Such clients 
cannot handle a schema change. (Of course, the Drill client is more flexible, 
so it certainly an handle schema changes.)

Second, the union type has never really worked. There is no support for it in 
JDBC or ODBC. So, it would be a "Drill-client-only" solution. That may or not 
be bad depending on Drill's target user base.

There is now overwhelming evidence that for non Mongo data sources, that there 
is no way to achieve a reliable schema incrementally when data is delivered in 
random order.

So, maybe divide the problem into two parts. The schema mechanism for those 
users that use xDBC. And something clever like what is suggested here for those 
users of Mongo that use the Drill client and can absorb varying schemas. (Other 
DB's have this same property, including MapR DB JSON IIRC.)

My experience is with the uses and users of xDBC and similar interfaces. I 
don't know of any users of the raw Drill client, but I suppose they could 
exist...

In any event, rather than debate the topic to death, just go ahead and work out 
what happens when there are many files, scanned on many nodes, in random order, 
with each supported kind of schema change. It is very hard for any relational 
engine to make sense as the schema changes randomly across runs (because of the 
random scan order.) Work through those cases in detail and you'll go into this 
with your eyes wide open about what can actually be done in practice.

May also be pointing out: even MongoDB users will appreciate a schema if they 
have wild and crazy data types, but must deliver consistent schema results to 
JDBC or ODBC. So, even the proposal here can be made to work for the Drill 
client, there is even more value for making in work for Tableau (and similar) 
users.


was (Author: paul.rogers):
[~amansinha100], thanks for the explanation. A couple of observations. First, 
Drill is a relational engine, clients are often JDBC or ODBC. Such clients 
cannot handle a schema change. (Of course, the Drill client is more flexible, 
so it certainly an handle schema changes.)

Second, the union type has never really worked. There is no support for it in 
JDBC or ODBC. So, it would be a "Drill-client-only" solution. That may or not 
be bad depending on Drill's target user base.

There is now overwhelming evidence that for non Mongo data sources, that there 
is no way to achieve a reliable schema incrementally when data is delivered in 
random order.

So, maybe divide the problem into two parts. The schema mechanism for those 
users that use xDBC. And something clever like what is suggested here for those 
users of Mongo that use the Drill client and can absorb varying schemas. (Other 
DB's have this same property, including MapR DB JSON IIRC.)

My experience is with the uses and users of xDBC and similar interfaces. I 
don't know of any users of the raw Drill client, but I suppose they could 
exist...

In any event, rather than debate the topic to death, just go ahead and work out 
what happens when there are many files, scanned on many nodes, in random order, 
with each supported kind of schema change. It is very hard for any relational 
engine to make sense as the schema changes randomly across runs (because of the 
random scan order.) Work through those cases in detail and you'll go into this 
with your eyes wide open about what can actually be done in practice.

> Handle schema change in ExternalSort
> ------------------------------------
>
>                 Key: DRILL-6829
>                 URL: https://issues.apache.org/jira/browse/DRILL-6829
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Aman Sinha
>            Priority: Major
>
> While we continue to enhance the schema provision and metastore aspects in 
> Drill, we also should explore what it means to be truly schema-less such that 
> we can better handle \{semi, un}structured data, data sitting in DBs that 
> store JSON documents (e.g Mongo, MapR-DB). 
>  
> The blocking operators are the main hurdles in this goal (other operators 
> also need to be smarter about this but the problem is harder for the blocking 
> operators).   This Jira is specifically about ExternalSort. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to