Re: [DISCUSSION] Does schema-free really need

2018-08-22 Thread Paul Rogers
Join the club. We all forget that from time to time - until a perfectly simple 
change blows up in some huge QA test...

- Paul

Sent from my iPhone

> On Aug 22, 2018, at 8:33 AM, Chris Cunningham  wrote:
> 
> Hmm.  Right. Somehow I forgot about the very large result sets - stupid of
> me.
> 
> On Tue, Aug 21, 2018 at 4:43 PM Paul Rogers 
> wrote:
> 
>> Hi Chris,
>> 
> 
> 
>> Later, when Drill sees the first Varchar, it can change the type from,
>> say, batch 3 onwards. But, JDBC and ODBC generally require the schema be
>> known up front, so they would have to predict the future to know that a
>> Varchar will eventually appear.
> 
> +
> 
>> Within Drill, suppose that the query includes a sort, and that memory
>> limits require spilling. The first two batches with just null will be
>> spilled to disk in one format. Third and later batches have a different
>> schema. So, the code that handles spilling must merge the two types. Not a
>> big problem, but the fix must be applied in multiple places in different
>> ways. Very difficult to test all the resulting combinations and
>> permutations.
>> 
> 
> IF drill is doing a sort, it is likely that Drill would know the types
> before any rows were returned to JDBC/ODBC, so in this case delaying
> telling the client what type of columns they are getting could work, right?
> Of course, many queries will avoid sorts, so this isn't really an answer.
> Maybe a partial one.
> 
> Does this make sense? Are we overlooking an alternative solution?
>> 
> 
> It does make sense.
> 
> And I can't see a reasonable alternative.  The closest I come is telling
> the client that it is a BLOB of unknown size (or make it roughly 'big').
> Then either NULL or a pointer to the data is returned - but this just
> pushes the actual determination of the type to the user - with much less
> help context than Drill would have for finding out what it really is.
> I.e., not really good.
> 
> IF JDBC/ODBC was enhanced with a new feature that allowed the server to say
> in essence 'Hey, I know I told you what the columns where, but really, this
> column has changed to this type' in the middle of the results, that would
> be nice.
> 
> ODBC does allow the user to unbind columns and rebind them - so it would be
> conceivably possible that Drill could raise a warning, the client could not
> the warning said something like 'Column 17 changed to varchar(2000)', and
> then the client unbinds column 17, and rebinds it to a new buffer that
> would hold the actual content.
> (
> https://docs.microsoft.com/en-us/sql/odbc/reference/syntax/sqlbindcol-function?view=sql-server-2017
> )
> 
> Of course, this would require users to custom code ODBC access to drill to
> take advantage of this - which I suspect would be pretty uncommon.  I also
> see no reference to this ability in JDBC.
> 
> Thanks,
>> - Paul
>> 
> 
> Thank you for the quick, detailed response!
> -Chris



Re: [DISCUSSION] Does schema-free really need

2018-08-22 Thread Chris Cunningham
Hmm.  Right. Somehow I forgot about the very large result sets - stupid of
me.

On Tue, Aug 21, 2018 at 4:43 PM Paul Rogers 
wrote:

> Hi Chris,
>


> Later, when Drill sees the first Varchar, it can change the type from,
> say, batch 3 onwards. But, JDBC and ODBC generally require the schema be
> known up front, so they would have to predict the future to know that a
> Varchar will eventually appear.

+

> Within Drill, suppose that the query includes a sort, and that memory
> limits require spilling. The first two batches with just null will be
> spilled to disk in one format. Third and later batches have a different
> schema. So, the code that handles spilling must merge the two types. Not a
> big problem, but the fix must be applied in multiple places in different
> ways. Very difficult to test all the resulting combinations and
> permutations.
>

IF drill is doing a sort, it is likely that Drill would know the types
before any rows were returned to JDBC/ODBC, so in this case delaying
telling the client what type of columns they are getting could work, right?
Of course, many queries will avoid sorts, so this isn't really an answer.
Maybe a partial one.

Does this make sense? Are we overlooking an alternative solution?
>

It does make sense.

And I can't see a reasonable alternative.  The closest I come is telling
the client that it is a BLOB of unknown size (or make it roughly 'big').
Then either NULL or a pointer to the data is returned - but this just
pushes the actual determination of the type to the user - with much less
help context than Drill would have for finding out what it really is.
I.e., not really good.

IF JDBC/ODBC was enhanced with a new feature that allowed the server to say
in essence 'Hey, I know I told you what the columns where, but really, this
column has changed to this type' in the middle of the results, that would
be nice.

ODBC does allow the user to unbind columns and rebind them - so it would be
conceivably possible that Drill could raise a warning, the client could not
the warning said something like 'Column 17 changed to varchar(2000)', and
then the client unbinds column 17, and rebinds it to a new buffer that
would hold the actual content.
(
https://docs.microsoft.com/en-us/sql/odbc/reference/syntax/sqlbindcol-function?view=sql-server-2017
)

Of course, this would require users to custom code ODBC access to drill to
take advantage of this - which I suspect would be pretty uncommon.  I also
see no reference to this ability in JDBC.

Thanks,
> - Paul
>

Thank you for the quick, detailed response!
-Chris


Re: [DISCUSSION] Does schema-free really need

2018-08-21 Thread Paul Rogers
Hi Chris,

Great suggestion. As it turns out, however, when we've looked at this in the 
past, it simply shifts the problem from one place to another. I'll walk though 
the issues because it is helpful for us all to understand what we're up against.

Consider a single file with 100K nulls, followed by 150K Varchars. Drill works 
in batches and will send a series of, say, 5 batches to the client (each with 
50K records). The client will start by seeing 2 batches of nulls. The app using 
JDBC or ODBC, needs to know the type to know how to allocate memory for the 
data. But, what type?

Later, when Drill sees the first Varchar, it can change the type from, say, 
batch 3 onwards. But, JDBC and ODBC generally require the schema be known up 
front, so they would have to predict the future to know that a Varchar will 
eventually appear.

Drill could buffer the result set, waiting for the schema change. Once it 
occurs, Drill could rewrite the prior data to the newly discovered type. But, 
since Drill has no idea when (or if) a schema change will occur, and data could 
be unlimited in size, Drill would have to buffer the entire result set, 
impacting latency since, in the general case, the entire result set would be 
spilled to disk, then read back, before the first records are sent to the 
client.

JDBC, at least, allows the idea of multiple result sets (used for stored 
procedures.) Drill could send a first result set with NULL as the type, then a 
second when the type changed to Varchar. This then pushes the schema change to 
the app using JDBC, but most are not set up to handle multiple result sets. 
(Indeed, what would the app do if it requires tabular data with consistent 
types?)

Within Drill, suppose that the query includes a sort, and that memory limits 
require spilling. The first two batches with just null will be spilled to disk 
in one format. Third and later batches have a different schema. So, the code 
that handles spilling must merge the two types. Not a big problem, but the fix 
must be applied in multiple places in different ways. Very difficult to test 
all the resulting combinations and permutations.

Suppose that the file in question is a CSV file that can be split and we have, 
say, five concurrent readers, each reading 50K records.  Two readers read the 
nulls and shuffle these to other nodes. Three readers read the Varchars and 
shuffle the data to other nodes. The receiving nodes sees, effectively, a 
random arrival of data: some start with Varchar, followed by null, some the 
other way around. Again, we could write code to merge these types in each 
receiver.

But, if we scale up the file, we may find that some receivers go many batches 
of nulls before they see a Varchar, which pushes the problem up the DAG, 
eventually to the client, as described above.

We seem to be forced into the conclusion that the only solution is to predict 
the future (so we know to use the Varchar in the reader, even when we see nulls 
from the file). The only way to do that (without access to time travel) is for 
metadata to tell us that, "data for this column, when it eventually appears, 
will be Varchar."

Does this make sense? Are we overlooking an alternative solution?

Thanks,
- Paul

 

On Tuesday, August 21, 2018, 3:45:55 PM PDT, Chris Cunningham 
 wrote:  
 
 Hi.  Mostly off topic, but reading about this issue has finally prompted a
response.

On Wed, Aug 15, 2018 at 5:46 PM Paul Rogers 
wrote:


> If we provide schema hints ("field x, when it appears, will be a Double"),
> then Drill need not fight with the problem that the data will be Double in
> some files, Nullable Int in others. The schema hint removes the ambiguity.
>


How hard would it be to have drill 'fix' the null Int problem by just
noting it is nullable with no other type information assumed until it
encounters that type, and then require that type from then on?  If an
entire file has only null's, then it doesn't define the field at all - only
files that have the type define the field, and then when they are combined
later, IF there are conflicting non-Null fields will the error be thrown?

Thanks,
Chris
  

Re: [DISCUSSION] Does schema-free really need

2018-08-16 Thread Paul Rogers
Thanks for the background Jinfeng, your explanation brings us back to the topic 
Arina raised: the state and direction of the Drill project.

For several years now, Drill has said, essentially, "dabble with raw data all 
you like early in the project, but for production, ETL your data into Parquet." 
One view is that we double-down on this idea. Parquet has many advantages: it 
carries its own schema. As Jinfeng noted, it is the product of an ETL process 
that cleans up and normalizes data, removing the variations that creep in 
during schema evolution.

Further, Drill already has Ted's mechanism to pre-scan the data: Drill does it 
to capture the Parquet footer and directory metadata. There is a 
not-yet-committed project to gather stats on Parquet files. Parquet allows 
Drill to be schema-free (yes, a marketing term, really schema-on-read) with a 
good, solid schema as defined in Parquet. Of course, even Parquet is subject to 
ambiguities due to newly added columns, but ETL should clean up such issues.

Extend that to other types: "schema-free" could mean, "Drill does not do 
schemas; each data source must provide its own CONSISTENT schema." Parquet does 
this today as does Hive. Allow external hints for CSV and JSON. Require that 
CSV and JSON have clean, extended relational schemas to avoid messy ambiguities 
and schema changes.

An argument can be made that many fine ETL tools exist: NiFi, Spark, MR, Hive, 
StreamSets, ... Drill need not try to solve this problem. Further, those other 
products generally allow the user to use code to handle tough cases; Drill only 
has SQL, and often SQL is simply not sufficiently expressive. Here is a 
favorite example:

{fields: [ {id: 1, name: "customer-name", type: "string", value: "fred"}, {id: 
2, name: "balance", type: "float", value: "123.45"}] }

It is unlikely that Drill can add enough UDFs to parse the above, and to do so 
efficiently. But, parsing the above in code (in Spark, say) is easy.

So, perhaps Drill becomes a Parquet-focused query engine, requiring unambiguous 
schema, defined by the data source itself (which includes, for Parquet, Drill's 
metadata and stats files.) Drill is "schema-free", but the data is required to 
provide a clean, unambiguous schema.

The only problem is that this big data niche is already occupied by an 
entrenched leader called Impala. Impala is faster than Drill (even if it is 
hard to build and is very hard to tune.) Drill can't win by saying, "we're 
almost as fast, have fewer bugs, and are easier to build." Spark didn't win by 
being almost as good as MR, it won because it was far better on many 
dimensions. Marketing 101 says that a product can never will a head-to-head 
battle with an entrenched leader.

So, what niche can Drill fill that is not already occupied? Spark is great for 
complex data transforms of the kind shown above, but is not good at all for 
interactive SQL. (Spark is not a persistent engine, it is a batch system like 
MR. As noted in an earlier message, Spark shuffles data in stages, not in 
batches like Drill and Impala.) But, Spark has a huge community; maybe someone 
will solve these issues. So, Drill as a better Spark is not appealing.

One idea that has come up from time-to-time is Drill as an open source 
Splunk-like tool. Splunk can ingest zillions of file formats using adapters, 
akin to Drill's storage and format plugins. Ingesting arbitrary files requires 
some schema cleansing on read. Drill's answer could be to add ad-hoc metadata, 
allow data-cleaning plugins, and add UDFs. That is, double-down on the idea 
that Drill does read multiple formats; solve the remaining issues to do so well.

In short, the big question is, "what does Drill want to do now that its grown 
up?" Compete with Impala (Parquet only)? Complete with Spark (better code-based 
query engine)? Compete with Splunk (query any file format)? Something else?

Whatever we do, to Weijie's point, we should do it in a way that is stable: 
today's approach to handling messy schema's can't ever work completely because 
it requires that Drill predict the future: a reader must decide on record 1 how 
to handle a field that won't actually appear until file (or block) 100. Do we 
need that? How do we maintain code (union vectors, list vectors, schema change) 
that never worked and probably never can? What is the better solution?

Thanks,
- Paul

 

On Wednesday, August 15, 2018, 11:15:10 PM PDT, Jinfeng Ni 
 wrote:  
 
 The use case Weijie described seems to fall into the category of
traditional data warehouse, i.e, schemas are predefined by users, data
strictly conforms to schema. Certainly this is one important uses, and I
agreed that the schema-on-read logic in Drill run-time indeed  is a
disadvantage for such use case, compared with other SQL query engine like
Impala/Presto.

The question we want to ask is whether that's the only use case Drill wants
to target. We probably want to hear more cases from Drill community, before
we 

Re: [DISCUSSION] Does schema-free really need

2018-08-16 Thread weijie tong
I think there's no schema-free data. To one ad-hoc query, one file, its
schema is already defined. The schema is just discovered by the Drill not
defined the user explicitly now.

On Thu, Aug 16, 2018 at 2:29 PM Jinfeng Ni  wrote:

> btw:  In one project that I'm currently working (an application related to
> IOT), I'm leveraging Drill's schema-on-read ability, without requiring user
> to predefine table DDL.
>
>
> On Wed, Aug 15, 2018 at 11:15 PM, Jinfeng Ni  wrote:
>
> > The use case Weijie described seems to fall into the category of
> > traditional data warehouse, i.e, schemas are predefined by users, data
> > strictly conforms to schema. Certainly this is one important uses, and I
> > agreed that the schema-on-read logic in Drill run-time indeed  is a
> > disadvantage for such use case, compared with other SQL query engine like
> > Impala/Presto.
> >
> > The question we want to ask is whether that's the only use case Drill
> > wants to target. We probably want to hear more cases from Drill
> community,
> > before we can decide what's the best strategy going forward.
> >
> > In examples Paul listed, why would two sets of data have different
> schema?
> > In many cases, that's because application generating the data is changed;
> > either adding/deleting one field, or modifying one existing field.  ETL
> is
> > a typical approach to clean up such data with different schema.  Drill's
> > argument, couple of years ago when the project was started, was that ETL
> is
> > too time-consuming.  it would provide great value if a query engine could
> > query directly against such datasets.
> >
> > I feel Paul's suggestion of letting user provide schema, or Drill
> > scan/probe and learn the schema seems to fall in the middle of spectrum;
> > ETL is one extreme, and Drill's current schema-on-read is the other
> > extreme.  Personally, I would prefer letting Drill scan/probe the schema,
> > as it might not be easy for user to provide schema in the case of nested
> > data (will they have to provide type information for any nested field?).
> >
> > To Weijie's comment about complexity of code of dealing schema, in theory
> > we should refactor/rewrite majority run-time operator, separating the
> logic
> > of handling schema and handling regular data flow.  That would clean up
> the
> > current mess.
> >
> > ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
> > Drill would be schema-on-read.
> > 2:  I would not call it a battle between non-relational data and
> > relational engine. The extended relational model has type of
> > array/composite types, similar to what Drill has.
> >
> >
> >
> >
> >
> > On Wed, Aug 15, 2018 at 7:27 PM, weijie tong 
> > wrote:
> >
> >> @Paul I really appreciate the statement ` Effort can go into new
> features
> >> rather than fighting an unwinnable battle to use non-relational data in
> a
> >> relational engine.` .
> >>
> >> At AntFinancial( known as Alipay  an Alibaba related company ) we now
> use
> >> Drill to support most of our analysis work. Our business and data is
> >> complex enough. Our strategy is to let users design their schema first,
> >> then dump in their data , query their data later. This work flow runs
> >> fluently.  But by deep inside into the Drill's code internal and see the
> >> JIRA bugs, we will see most of the non-intuitive codes to solve the
> schema
> >> change but really no help to most of the actual use case. I think this
> >> also
> >> make the storage plugin interface not so intuitive to implement.
> >>
> >> We are sacrificing most of our work to pay for little income. Users
> really
> >> don't care about defining a schema first, but pay attention whether
> their
> >> query is fast enough. By probing the data to guess the schema and cache
> >> them , to me ,is a compromise strategy but still not clean enough. So I
> >> hope we move the mess schema solving logic out of Drill to let the code
> >> cleaner by defining the schema firstly with DDL statements. If we agree
> on
> >> this, the work should be a sub work of DRILL-6552.
> >>
> >> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers 
> >> wrote:
> >>
> >> > Hi Ted,
> >> >
> >> > I like the "schema auto-detect" idea.
> >> >
> >> > As we discussed in a prior thread, caching of schema is a nice-add on
> >> once
> >> > we have defined the schema-on-read mechanism. Maybe we first get it to
> >> work
> >> > with a user-provided schema. Then, as an enhancement, we offer to
> infer
> >> the
> >> > schema by scanning data.
> >> >
> >> > There are some ambiguities that schema inference can't resolve: in {x:
> >> > "1002"} {x: 1003}, should x be an Int or a Varchar?
> >> >
> >> > Still if Drill could provide a guess at the schema, and the user could
> >> > refine it, we'd have a very elegant solution.
> >> >
> >> >
> >> > Thanks,
> >> > - Paul
> >> >
> >> >
> >> >
> >> > On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> >> > ted.dunn...@gmail.com> wrote:
> >> >
> >> >  This is 

Re: [DISCUSSION] Does schema-free really need

2018-08-16 Thread Jinfeng Ni
btw:  In one project that I'm currently working (an application related to
IOT), I'm leveraging Drill's schema-on-read ability, without requiring user
to predefine table DDL.


On Wed, Aug 15, 2018 at 11:15 PM, Jinfeng Ni  wrote:

> The use case Weijie described seems to fall into the category of
> traditional data warehouse, i.e, schemas are predefined by users, data
> strictly conforms to schema. Certainly this is one important uses, and I
> agreed that the schema-on-read logic in Drill run-time indeed  is a
> disadvantage for such use case, compared with other SQL query engine like
> Impala/Presto.
>
> The question we want to ask is whether that's the only use case Drill
> wants to target. We probably want to hear more cases from Drill community,
> before we can decide what's the best strategy going forward.
>
> In examples Paul listed, why would two sets of data have different schema?
> In many cases, that's because application generating the data is changed;
> either adding/deleting one field, or modifying one existing field.  ETL is
> a typical approach to clean up such data with different schema.  Drill's
> argument, couple of years ago when the project was started, was that ETL is
> too time-consuming.  it would provide great value if a query engine could
> query directly against such datasets.
>
> I feel Paul's suggestion of letting user provide schema, or Drill
> scan/probe and learn the schema seems to fall in the middle of spectrum;
> ETL is one extreme, and Drill's current schema-on-read is the other
> extreme.  Personally, I would prefer letting Drill scan/probe the schema,
> as it might not be easy for user to provide schema in the case of nested
> data (will they have to provide type information for any nested field?).
>
> To Weijie's comment about complexity of code of dealing schema, in theory
> we should refactor/rewrite majority run-time operator, separating the logic
> of handling schema and handling regular data flow.  That would clean up the
> current mess.
>
> ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
> Drill would be schema-on-read.
> 2:  I would not call it a battle between non-relational data and
> relational engine. The extended relational model has type of
> array/composite types, similar to what Drill has.
>
>
>
>
>
> On Wed, Aug 15, 2018 at 7:27 PM, weijie tong 
> wrote:
>
>> @Paul I really appreciate the statement ` Effort can go into new features
>> rather than fighting an unwinnable battle to use non-relational data in a
>> relational engine.` .
>>
>> At AntFinancial( known as Alipay  an Alibaba related company ) we now use
>> Drill to support most of our analysis work. Our business and data is
>> complex enough. Our strategy is to let users design their schema first,
>> then dump in their data , query their data later. This work flow runs
>> fluently.  But by deep inside into the Drill's code internal and see the
>> JIRA bugs, we will see most of the non-intuitive codes to solve the schema
>> change but really no help to most of the actual use case. I think this
>> also
>> make the storage plugin interface not so intuitive to implement.
>>
>> We are sacrificing most of our work to pay for little income. Users really
>> don't care about defining a schema first, but pay attention whether their
>> query is fast enough. By probing the data to guess the schema and cache
>> them , to me ,is a compromise strategy but still not clean enough. So I
>> hope we move the mess schema solving logic out of Drill to let the code
>> cleaner by defining the schema firstly with DDL statements. If we agree on
>> this, the work should be a sub work of DRILL-6552.
>>
>> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers 
>> wrote:
>>
>> > Hi Ted,
>> >
>> > I like the "schema auto-detect" idea.
>> >
>> > As we discussed in a prior thread, caching of schema is a nice-add on
>> once
>> > we have defined the schema-on-read mechanism. Maybe we first get it to
>> work
>> > with a user-provided schema. Then, as an enhancement, we offer to infer
>> the
>> > schema by scanning data.
>> >
>> > There are some ambiguities that schema inference can't resolve: in {x:
>> > "1002"} {x: 1003}, should x be an Int or a Varchar?
>> >
>> > Still if Drill could provide a guess at the schema, and the user could
>> > refine it, we'd have a very elegant solution.
>> >
>> >
>> > Thanks,
>> > - Paul
>> >
>> >
>> >
>> > On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
>> > ted.dunn...@gmail.com> wrote:
>> >
>> >  This is a bold statement.
>> >
>> > And there are variants of it that could give users nearly the same
>> > experience that we have now. For instance, if we cache discovered
>> schemas
>> > for old files and discover the schema for any new file that we see (and
>> > cache it) before actually running a query. That gives us pretty much the
>> > flexibility of schema on read without as much of the burden.
>> >
>> >
>> >
>> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong 
>> 

Re: [DISCUSSION] Does schema-free really need

2018-08-16 Thread Jinfeng Ni
The use case Weijie described seems to fall into the category of
traditional data warehouse, i.e, schemas are predefined by users, data
strictly conforms to schema. Certainly this is one important uses, and I
agreed that the schema-on-read logic in Drill run-time indeed  is a
disadvantage for such use case, compared with other SQL query engine like
Impala/Presto.

The question we want to ask is whether that's the only use case Drill wants
to target. We probably want to hear more cases from Drill community, before
we can decide what's the best strategy going forward.

In examples Paul listed, why would two sets of data have different schema?
In many cases, that's because application generating the data is changed;
either adding/deleting one field, or modifying one existing field.  ETL is
a typical approach to clean up such data with different schema.  Drill's
argument, couple of years ago when the project was started, was that ETL is
too time-consuming.  it would provide great value if a query engine could
query directly against such datasets.

I feel Paul's suggestion of letting user provide schema, or Drill
scan/probe and learn the schema seems to fall in the middle of spectrum;
ETL is one extreme, and Drill's current schema-on-read is the other
extreme.  Personally, I would prefer letting Drill scan/probe the schema,
as it might not be easy for user to provide schema in the case of nested
data (will they have to provide type information for any nested field?).

To Weijie's comment about complexity of code of dealing schema, in theory
we should refactor/rewrite majority run-time operator, separating the logic
of handling schema and handling regular data flow.  That would clean up the
current mess.

ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
Drill would be schema-on-read.
2:  I would not call it a battle between non-relational data and
relational engine. The extended relational model has type of
array/composite types, similar to what Drill has.





On Wed, Aug 15, 2018 at 7:27 PM, weijie tong 
wrote:

> @Paul I really appreciate the statement ` Effort can go into new features
> rather than fighting an unwinnable battle to use non-relational data in a
> relational engine.` .
>
> At AntFinancial( known as Alipay  an Alibaba related company ) we now use
> Drill to support most of our analysis work. Our business and data is
> complex enough. Our strategy is to let users design their schema first,
> then dump in their data , query their data later. This work flow runs
> fluently.  But by deep inside into the Drill's code internal and see the
> JIRA bugs, we will see most of the non-intuitive codes to solve the schema
> change but really no help to most of the actual use case. I think this also
> make the storage plugin interface not so intuitive to implement.
>
> We are sacrificing most of our work to pay for little income. Users really
> don't care about defining a schema first, but pay attention whether their
> query is fast enough. By probing the data to guess the schema and cache
> them , to me ,is a compromise strategy but still not clean enough. So I
> hope we move the mess schema solving logic out of Drill to let the code
> cleaner by defining the schema firstly with DDL statements. If we agree on
> this, the work should be a sub work of DRILL-6552.
>
> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers 
> wrote:
>
> > Hi Ted,
> >
> > I like the "schema auto-detect" idea.
> >
> > As we discussed in a prior thread, caching of schema is a nice-add on
> once
> > we have defined the schema-on-read mechanism. Maybe we first get it to
> work
> > with a user-provided schema. Then, as an enhancement, we offer to infer
> the
> > schema by scanning data.
> >
> > There are some ambiguities that schema inference can't resolve: in {x:
> > "1002"} {x: 1003}, should x be an Int or a Varchar?
> >
> > Still if Drill could provide a guess at the schema, and the user could
> > refine it, we'd have a very elegant solution.
> >
> >
> > Thanks,
> > - Paul
> >
> >
> >
> > On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> > ted.dunn...@gmail.com> wrote:
> >
> >  This is a bold statement.
> >
> > And there are variants of it that could give users nearly the same
> > experience that we have now. For instance, if we cache discovered schemas
> > for old files and discover the schema for any new file that we see (and
> > cache it) before actually running a query. That gives us pretty much the
> > flexibility of schema on read without as much of the burden.
> >
> >
> >
> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong 
> > wrote:
> >
> > > Hi all:
> > >  Hope the statement not seems too dash to you.
> > >  Drill claims be a schema-free distributed SQL engine. It pays lots of
> > > work to make the execution engine to support it to support JSON file
> like
> > > storage format. It is easier to make bugs and let the code logic ugly.
> I
> > > wonder do we still insist on this ,since we are 

Re: [DISCUSSION] Does schema-free really need

2018-08-15 Thread weijie tong
@Paul I really appreciate the statement ` Effort can go into new features
rather than fighting an unwinnable battle to use non-relational data in a
relational engine.` .

At AntFinancial( known as Alipay  an Alibaba related company ) we now use
Drill to support most of our analysis work. Our business and data is
complex enough. Our strategy is to let users design their schema first,
then dump in their data , query their data later. This work flow runs
fluently.  But by deep inside into the Drill's code internal and see the
JIRA bugs, we will see most of the non-intuitive codes to solve the schema
change but really no help to most of the actual use case. I think this also
make the storage plugin interface not so intuitive to implement.

We are sacrificing most of our work to pay for little income. Users really
don't care about defining a schema first, but pay attention whether their
query is fast enough. By probing the data to guess the schema and cache
them , to me ,is a compromise strategy but still not clean enough. So I
hope we move the mess schema solving logic out of Drill to let the code
cleaner by defining the schema firstly with DDL statements. If we agree on
this, the work should be a sub work of DRILL-6552.

On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers 
wrote:

> Hi Ted,
>
> I like the "schema auto-detect" idea.
>
> As we discussed in a prior thread, caching of schema is a nice-add on once
> we have defined the schema-on-read mechanism. Maybe we first get it to work
> with a user-provided schema. Then, as an enhancement, we offer to infer the
> schema by scanning data.
>
> There are some ambiguities that schema inference can't resolve: in {x:
> "1002"} {x: 1003}, should x be an Int or a Varchar?
>
> Still if Drill could provide a guess at the schema, and the user could
> refine it, we'd have a very elegant solution.
>
>
> Thanks,
> - Paul
>
>
>
> On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  This is a bold statement.
>
> And there are variants of it that could give users nearly the same
> experience that we have now. For instance, if we cache discovered schemas
> for old files and discover the schema for any new file that we see (and
> cache it) before actually running a query. That gives us pretty much the
> flexibility of schema on read without as much of the burden.
>
>
>
> On Wed, Aug 15, 2018 at 5:02 PM weijie tong 
> wrote:
>
> > Hi all:
> >  Hope the statement not seems too dash to you.
> >  Drill claims be a schema-free distributed SQL engine. It pays lots of
> > work to make the execution engine to support it to support JSON file like
> > storage format. It is easier to make bugs and let the code logic ugly. I
> > wonder do we still insist on this ,since we are designing the metadata
> > system with DRILL-6552.
> >Traditionally, people is used to design its table schema firstly
> before
> > firing a SQL query. I don't think this saves people too much time. Other
> > system like Spark is popular not due to lack the schema claiming. I think
> > we should be brave enough to take the right decision whether to still
> > insist on this feature which seems not so important but a burden.
> >Thanks.
> >
>


Re: [DISCUSSION] Does schema-free really need

2018-08-15 Thread Paul Rogers
Hi Ted,

I like the "schema auto-detect" idea.

As we discussed in a prior thread, caching of schema is a nice-add on once we 
have defined the schema-on-read mechanism. Maybe we first get it to work with a 
user-provided schema. Then, as an enhancement, we offer to infer the schema by 
scanning data.

There are some ambiguities that schema inference can't resolve: in {x: "1002"} 
{x: 1003}, should x be an Int or a Varchar?

Still if Drill could provide a guess at the schema, and the user could refine 
it, we'd have a very elegant solution.


Thanks,
- Paul

 

On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning 
 wrote:  
 
 This is a bold statement.

And there are variants of it that could give users nearly the same
experience that we have now. For instance, if we cache discovered schemas
for old files and discover the schema for any new file that we see (and
cache it) before actually running a query. That gives us pretty much the
flexibility of schema on read without as much of the burden.



On Wed, Aug 15, 2018 at 5:02 PM weijie tong  wrote:

> Hi all:
>  Hope the statement not seems too dash to you.
>  Drill claims be a schema-free distributed SQL engine. It pays lots of
> work to make the execution engine to support it to support JSON file like
> storage format. It is easier to make bugs and let the code logic ugly. I
> wonder do we still insist on this ,since we are designing the metadata
> system with DRILL-6552.
>    Traditionally, people is used to design its table schema firstly before
> firing a SQL query. I don't think this saves people too much time. Other
> system like Spark is popular not due to lack the schema claiming. I think
> we should be brave enough to take the right decision whether to still
> insist on this feature which seems not so important but a burden.
>    Thanks.
>
  

Re: [DISCUSSION] Does schema-free really need

2018-08-15 Thread Charles Givre
I like where Ted is going with this.  I do also like Paul’s idea of giving 
Drill hints as to the schema.  
— C

> On Aug 15, 2018, at 20:34, Ted Dunning  wrote:
> 
> This is a bold statement.
> 
> And there are variants of it that could give users nearly the same
> experience that we have now. For instance, if we cache discovered schemas
> for old files and discover the schema for any new file that we see (and
> cache it) before actually running a query. That gives us pretty much the
> flexibility of schema on read without as much of the burden.
> 
> 
> 
> On Wed, Aug 15, 2018 at 5:02 PM weijie tong  wrote:
> 
>> Hi all:
>>  Hope the statement not seems too dash to you.
>>  Drill claims be a schema-free distributed SQL engine. It pays lots of
>> work to make the execution engine to support it to support JSON file like
>> storage format. It is easier to make bugs and let the code logic ugly. I
>> wonder do we still insist on this ,since we are designing the metadata
>> system with DRILL-6552.
>>   Traditionally, people is used to design its table schema firstly before
>> firing a SQL query. I don't think this saves people too much time. Other
>> system like Spark is popular not due to lack the schema claiming. I think
>> we should be brave enough to take the right decision whether to still
>> insist on this feature which seems not so important but a burden.
>>Thanks.
>> 



Re: [DISCUSSION] Does schema-free really need

2018-08-15 Thread Paul Rogers
Hi Weijie,

Thanks for raising this topic. I think you've got a great suggestion.

My two cents: there is no harm in reading all manner of ugly data. But, rather 
than try to process the mess throughout Drill (as we do today with schema 
changes, just-in-time code generation, union vectors and the rest), simply 
require that the user provide disambiguation rules prior to running the query. 
Require that the reader convert input data into a clean relational format.

That is, rather than try to be schema-less, just be schema-on-read, which, as 
you point out, is what most users probably want anyway.


JSON is my favorite example of the current issues. JSON is a general-purpose 
data format. SQL is a relational format. IMHO, if JDBC and ODBC are Drill's 
primary APIs, data must fit those APIs, and that means data must be relational. 
So, the goal when using JDBC is to map the JDBC structure into a relational 
structure.

Since Drill now supports arrays of nested tuples via implicit joins, 
"relational format" here means the extended relational format. Perhaps a 
requirement is that arrays MUST be flattened to simple tuples prior to 
returning results to JDBC or ODBC. 


If we provide schema hints ("field x, when it appears, will be a Double"), then 
Drill need not fight with the problem that the data will be Double in some 
files, Nullable Int in others. The schema hint removes the ambiguity.

The result is that Drill can read any type of data. But, as you say,  the Drill 
internals are simpler, cleaner and faster. There is no ambiguity about types 
that Drill kinda-sorta supports but that ODBC/JDBC don't support. Effort can go 
into new features rather than fighting an unwinnable battle to use 
non-relational data in a relational engine.


In short, as part of the metadata API work, perhaps define how metadata can 
simplify Drill internals. Ensure that users can create simple metadata hint 
files for ad-hoc use, maybe as an extension to view files. Then, push the 
problem of messy schemas into a conversion layer in the reader and out of the 
rest of the Drill execution engine.

The reason I keep jumping into these metadata discussions is that I encountered 
the ambiguity problems first hand in the "result set loader" work. There are 
ambiguities in JSON that simply cannot be resolved until Drill can predict the 
future (by having access to metadata.)

Consider the query "SELECT a, x FROM foo.json". Column x is missing from the 
first 1000 records. In the 1001st record, it shows up and is a Double. 
Previously, Drill would guess Nullable Int on the first column, then blow up 
when x appears as a Double. A revision was to postpone picking a type as late 
as possible so that, if column x does show up in the first batch, ambiguity is 
avoided. But, this trick does not work if the column shows up in the second or 
later batch. We need a column in the first batch, and we will guess Nullable 
Int. Depending on the query, this will result in a schema change elsewhere in 
the DAG or in the client.

Similarly, in the new version, JSON can handle data of the form {x: 10.1} {x: 
10} because it can convert an Int to a Double. Sadly, {x: 10} {x: 10.1} still 
fails because Drill can't convert a Double to an Int. Sigh...


To implement these tricks, the revised reader framework accepts a schema and 
automagically does the work of converting data from the input type to the 
defined type. Once this code is into master (we are still about three PRs 
away), it can be combined with the metadata system to achieve the "apply schema 
on read" idea discussed above. We then would not need the horrible hacks like 
those just discussed.


Thanks,
- Paul

 

On Wednesday, August 15, 2018, 5:02:08 PM PDT, weijie tong 
 wrote:  
 
 Hi all:
  Hope the statement not seems too dash to you.
  Drill claims be a schema-free distributed SQL engine. It pays lots of
work to make the execution engine to support it to support JSON file like
storage format. It is easier to make bugs and let the code logic ugly. I
wonder do we still insist on this ,since we are designing the metadata
system with DRILL-6552.
  Traditionally, people is used to design its table schema firstly before
firing a SQL query. I don't think this saves people too much time. Other
system like Spark is popular not due to lack the schema claiming. I think
we should be brave enough to take the right decision whether to still
insist on this feature which seems not so important but a burden.
    Thanks.
  

Re: [DISCUSSION] Does schema-free really need

2018-08-15 Thread Ted Dunning
This is a bold statement.

And there are variants of it that could give users nearly the same
experience that we have now. For instance, if we cache discovered schemas
for old files and discover the schema for any new file that we see (and
cache it) before actually running a query. That gives us pretty much the
flexibility of schema on read without as much of the burden.



On Wed, Aug 15, 2018 at 5:02 PM weijie tong  wrote:

> Hi all:
>   Hope the statement not seems too dash to you.
>   Drill claims be a schema-free distributed SQL engine. It pays lots of
> work to make the execution engine to support it to support JSON file like
> storage format. It is easier to make bugs and let the code logic ugly. I
> wonder do we still insist on this ,since we are designing the metadata
> system with DRILL-6552.
>Traditionally, people is used to design its table schema firstly before
> firing a SQL query. I don't think this saves people too much time. Other
> system like Spark is popular not due to lack the schema claiming. I think
> we should be brave enough to take the right decision whether to still
> insist on this feature which seems not so important but a burden.
> Thanks.
>