"Death of Schema-on-Read"

2018-04-01 Thread Paul Rogers
...is the name of a provocative blog post [1].
Quote: "Once found, diverse data sets are very hard to integrate, since the 
data typically contains no documentation on the semantics of its attributes. 
... The rule of thumb is that data scientists spend 70% of their time finding, 
interpreting, and cleaning data, and only 30% actually analyzing it. Schema on 
read offers no help in these tasks, because data gives up none of its secrets 
until actually read, and even when read has no documentation beyond attribute 
names, which may be inscrutable, vacuous, or even misleading."
This quote relates to a discussion Salim & I have been having: that Drill 
struggles to extract a usable schema directly from anything but the cleanest of 
data sets, leading to unwanted and unexpected schema change exceptions due to 
inherent ambiguities in how to interpret the data. (E.g. in JSON, if we see 
nothing but nulls, what type is the null?)
A possible answer is further down in the post: "At Comcast, for instance, Kafka 
topics are associated with Apache Avro schemas that include non-trivial 
documentation on every attribute and use common subschemas to capture commonly 
used data... 'Schema on read' using Avro files thus includes rich documentation 
and common structures and naming conventions."
Food for thought.
Thanks,
- Paul
[1] 
https://www.oreilly.com/ideas/data-governance-and-the-death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328






Re: "Death of Schema-on-Read"

2018-04-02 Thread Aman Sinha
It is certainly a huge advantage to have embedded data type information in
the data such as provided by Avro format.  In the past, XML also had
schemas and DTDs.
Although, one may argue that XML died because of the weight of the extra
structure added to it and people just gravitated towards JSON.
In that respect,  Avro provides a good middle ground.   A similar approach
is taken by MapR-DB  JSON database which has data type information for the
fields of a JSON document.

That said, we still have to (a) deal with JSON data which is one of the
most prevalent format in big data space and (b) still have to handle schema
changes even with Avro-like formats.
Comcast's view point suggests the one-size-fits-all approach but there is
counter-points to that, for instance as mentioned here [1].  It would be
very useful to have a survey of other users/companies that are dealing with
the schema evolution issues to get a better understanding of whether
Comcast's experience is a broader trend.

>From Drill's perspective, we have in the past discussed the need for 2
modes:
 - A fixed schema mode which operates in a manner similar to the RDBMSs.
This is needed not just to resolve ambiguities but also for performance.
Why treat a column as nullable when data is non-nullable ?
 - A variable schema mode which is what it does today...but this part needs
to be enhanced to be *'declarative' such that ambiguities are removed.*   A
user may choose not to create any declaration, in which case Drill would
default to certain documented set of rules that do type conversions.


[1] https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/


-Aman


On Sun, Apr 1, 2018 at 10:46 PM, Paul Rogers 
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328
>
>
>
>
>


Re: "Death of Schema-on-Read"

2018-04-02 Thread Paul Rogers
quot;
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328
>
>
>
>
>
  

Re: "Death of Schema-on-Read"

2018-04-02 Thread Ted Dunning
On Mon, Apr 2, 2018 at 10:54 AM, Aman Sinha  wrote:

> ...
> Although, one may argue that XML died because of the weight of the extra
> structure added to it and people just gravitated towards JSON.
>

My argument would be that it died because it couldn't distinguish well
between an element and a list of elements of length 1.

JSON avoids that kind of problem.


> In that respect,  Avro provides a good middle ground.   A similar approach
> is taken by MapR-DB  JSON database which has data type information for the
> fields of a JSON document.
>

True that.

But another middle ground representation is a JSON with a side file
describing type information derived when the file was previously read.

That said, we still have to (a) deal with JSON data which is one of the
> most prevalent format in big data space and (b) still have to handle schema
> changes even with Avro-like formats.
>

This is a big deal.

To some degree, a lot of this can be handled by two simple mechanisms:

1) record what we learn when scanning a file.  That is, if a column is null
(or missing) until the final record when it is a float, remember that. This
allows subsequent queries to look further ahead when deciding what is
happening in a query.

2) allow queries to be restarted when it is discovered that type
assumptions are untenable. Currently, schema change is what we call this
situation where we can't really recover from mistaken assumptions that are
derived incrementally as we scan the data. If we had (1), then the
information obtained by the reading that we have done up to the point that
schema change was noted could be preserved. That means that we could
restart the query with the knowledge of the data types that might later
cause a schema change exception. In many cases, that would allow us to
avoid that exception entirely on the second pass through the data.

In most cases, restarts would not be necessary. I know this because schema
change exceptions are currently pretty rare and they would be even more
rare if we learned about file schemas from experience. Even when a new file
is seen for the first time, schema change wouldn't happen. As such, the
amortized cost of restarts would be very low. On the other hand, the
advantage of such a mechanism would be that more queries would succeed and
users would be happier.


> ...
> From Drill's perspective, we have in the past discussed the need for 2
> modes:
>  - A fixed schema mode which operates in a manner similar to the RDBMSs.
> This is needed not just to resolve ambiguities but also for performance.
> Why treat a column as nullable when data is non-nullable ?
>  - A variable schema mode which is what it does today...but this part needs
> to be enhanced to be *'declarative' such that ambiguities are removed.*   A
> user may choose not to create any declaration, in which case Drill would
> default to certain documented set of rules that do type conversions.
>

The restart suggestion above avoids the need for modes but also allows the
performance of the fixed schema mode in most cases.


Re: "Death of Schema-on-Read"

2018-04-03 Thread Parth Chandra
This, of course, begs the question [1], doesn't it?

If you have the schema, then you have either a) spent time designing and
documenting your data (both the schema and dictionary containing the
semantics) or b) spent time "finding, interpreting, and cleaning data" to
discover the data schema and dictionary.

Data that has "no documentation beyond attribute names, which may be
inscrutable, vacuous, or even misleading" will continue to be so even after
you specify the schema.

Asking users to design their schemas when they have already accumulated
data that is unclean and undocumented is asking them to do the work that
they use your software for in the first place.

The goal of schema on read is to facilitate the task of interpreting the
data that already exists, is mutating, and is undocumented (or documented
badly).


[1] https://en.wikipedia.org/wiki/Begging_the_question


On Mon, Apr 2, 2018 at 11:16 AM, Paul Rogers 
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328
>
>
>
>
>


Re: "Death of Schema-on-Read"

2018-04-03 Thread Paul Rogers
Here is another way to think about it. Today, Drill is "schema-forbidden": even 
if I know the schema, I can't communicate that to Drill; Drill must figure it 
out on its own, making the same mistakes every time on ambiguous schemas.

Contrast this with Hive, which is "schema-required": I must tell Hive the 
schema even in cases where Hive could easily figure it out on its own. 

Perhaps Drill can occupy a middle ground: "schema-optional": Drill will figure 
out the schema as best it can, but will accept suggestions (hints) which the 
user can provide when it is the most efficient path to get work done.

Once hints are supported, then a system such as Ted's can be built on top: 
retry the query to use a bit of machine learning to infer the schema. Or, get 
the schema from Hive. Or from Comcast's Avro files. Or whatever.

The point is, if the user knows the schema, and is willing to resolve the 
ambiguities for us, what value do we provide by refusing to accept those hints?

On the other hand, since schema is optional, then Drill can continue to be used 
for Parth's schema exploration use case. 

Still, after doing a bit of exploration; the needs to move into getting work 
done based on that exploration. This seems to be the case at Comcast: they've 
move past exploration into production. But, Drill has limited means to use the 
result of exploration to resolve schema ambiguities on future queries. (Views 
are a partial answer, but have gaps.)

Ted makes a good point: Drill works most of the time already. The suggestion is 
that users might prefer that Drill works not just most of the time, but rather 
all of the time so users can reliably get their work done with no surprises, 
even with less-than-perfect schemas. If providing a few schema hints is the 
cost to pay to get that reliability, shouldn't the user in a position to choose 
to make that tradeoff?

Thanks,
- Paul

 

On Tuesday, April 3, 2018, 2:32:05 AM PDT, Parth Chandra 
 wrote:  
 
 This, of course, begs the question [1], doesn't it?

If you have the schema, then you have either a) spent time designing and
documenting your data (both the schema and dictionary containing the
semantics) or b) spent time "finding, interpreting, and cleaning data" to
discover the data schema and dictionary.

Data that has "no documentation beyond attribute names, which may be
inscrutable, vacuous, or even misleading" will continue to be so even after
you specify the schema.

Asking users to design their schemas when they have already accumulated
data that is unclean and undocumented is asking them to do the work that
they use your software for in the first place.

The goal of schema on read is to facilitate the task of interpreting the
data that already exists, is mutating, and is undocumented (or documented
badly).


[1] https://en.wikipedia.org/wiki/Begging_the_question


On Mon, Apr 2, 2018 at 11:16 AM, Paul Rogers 
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328
>
>
>
>
>
  

Re: "Death of Schema-on-Read"

2018-04-03 Thread Ted Dunning
I don't see why you say that Drill is schema-forbidden.

The Parquet reader, for instance, makes strong use of the implied schema to
facilitate reading of typed data.

Likewise, the vectorized internal format is strongly typed and, as such,
uses schema information.

Views are another way to communicate schema information.

It is true that you can't, say, view comments on fields from the command
line. But I don't understand saying "schema-forbidden".


On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers 
wrote:

> Here is another way to think about it. Today, Drill is "schema-forbidden":
> even if I know the schema, I can't communicate that to Drill; Drill must
> figure it out on its own, making the same mistakes every time on ambiguous
> schemas.
>


Re: "Death of Schema-on-Read"

2018-04-03 Thread Paul Rogers
Subtle point. I can provide schema with Parquet, as you note. (Actually, for 
Parquet, Drill is schema-required: I can't not provide a schema due to the 
nature of Parquet...)

But, I can't provide a schema for JSON, CSV, etc. The point is, Drill forbids 
the user from providing a schema; only the file format itself can provide the 
schema (or not, in the case of JSON). This is the very heart of the problem.

The root cause of our schema change exception is that vectors are, indeed, 
strongly typed. But, file columns are not. Here is my favorite:

{x: 10} {x: 10.1}

Blam! Query fails because the vector is chosen as BigInt, then we discover it 
really should have been Float8. (If the answer is: go back and rebuild the 
vector with the new type, consider the case that 100K records separate the two 
above so that the first batch is long gone by the time we see the offending 
record. If only I could tell Drill to use Float8 (or Decimal) up front...

Views won't help here because the failure occurs before a view can kick in. 
However, presumably, I could write a view to handle a different classic case:

myDir /
|- File 1: {a: 10, b: "foo"}
|- File 2: {a: 20}

With query: SELECT a, b FROM myDir

For File 2, Drill will guess that b is a Nullable Int, but it is really 
VarChar. I think I could write clever SQL that says:

If b is of type Nullable Int, return NULL cast to nullable VarChar, else return 
b

The irony is that I must to write procedural code to declare a static attribute 
of the data. Yet SQL is otherwise declarative: I state what I want, not how to 
implement it.

Life would be so much easier if I could just say, "trust me, when you read 
column b, it is a VarChar."

Thanks,
- Paul

 

On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning 
 wrote:  
 
 I don't see why you say that Drill is schema-forbidden.

The Parquet reader, for instance, makes strong use of the implied schema to
facilitate reading of typed data.

Likewise, the vectorized internal format is strongly typed and, as such,
uses schema information.

Views are another way to communicate schema information.

It is true that you can't, say, view comments on fields from the command
line. But I don't understand saying "schema-forbidden".


On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers 
wrote:

> Here is another way to think about it. Today, Drill is "schema-forbidden":
> even if I know the schema, I can't communicate that to Drill; Drill must
> figure it out on its own, making the same mistakes every time on ambiguous
> schemas.
>
  

Re: "Death of Schema-on-Read"

2018-04-03 Thread Ted Dunning
Well, the restart strategy still works for your examples. And you only pay
once. From them you look at the cached type information and used an upper
bound data type as you read the data. Since it works to read the values in
the right order, it is obviously possible to push down typing information
even into the json reader.



On Tue, Apr 3, 2018, 21:42 Paul Rogers  wrote:

> Subtle point. I can provide schema with Parquet, as you note. (Actually,
> for Parquet, Drill is schema-required: I can't not provide a schema due to
> the nature of Parquet...)
>
> But, I can't provide a schema for JSON, CSV, etc. The point is, Drill
> forbids the user from providing a schema; only the file format itself can
> provide the schema (or not, in the case of JSON). This is the very heart of
> the problem.
>
> The root cause of our schema change exception is that vectors are, indeed,
> strongly typed. But, file columns are not. Here is my favorite:
>
> {x: 10} {x: 10.1}
>
> Blam! Query fails because the vector is chosen as BigInt, then we discover
> it really should have been Float8. (If the answer is: go back and rebuild
> the vector with the new type, consider the case that 100K records separate
> the two above so that the first batch is long gone by the time we see the
> offending record. If only I could tell Drill to use Float8 (or Decimal) up
> front...
>
> Views won't help here because the failure occurs before a view can kick
> in. However, presumably, I could write a view to handle a different classic
> case:
>
> myDir /
> |- File 1: {a: 10, b: "foo"}
> |- File 2: {a: 20}
>
> With query: SELECT a, b FROM myDir
>
> For File 2, Drill will guess that b is a Nullable Int, but it is really
> VarChar. I think I could write clever SQL that says:
>
> If b is of type Nullable Int, return NULL cast to nullable VarChar, else
> return b
>
> The irony is that I must to write procedural code to declare a static
> attribute of the data. Yet SQL is otherwise declarative: I state what I
> want, not how to implement it.
>
> Life would be so much easier if I could just say, "trust me, when you read
> column b, it is a VarChar."
>
> Thanks,
> - Paul
>
>
>
> On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  I don't see why you say that Drill is schema-forbidden.
>
> The Parquet reader, for instance, makes strong use of the implied schema to
> facilitate reading of typed data.
>
> Likewise, the vectorized internal format is strongly typed and, as such,
> uses schema information.
>
> Views are another way to communicate schema information.
>
> It is true that you can't, say, view comments on fields from the command
> line. But I don't understand saying "schema-forbidden".
>
>
> On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers 
> wrote:
>
> > Here is another way to think about it. Today, Drill is
> "schema-forbidden":
> > even if I know the schema, I can't communicate that to Drill; Drill must
> > figure it out on its own, making the same mistakes every time on
> ambiguous
> > schemas.
> >
>


Re: "Death of Schema-on-Read"

2018-04-04 Thread Jinfeng Ni
I feel it's probably premature to cal it "death of schema-on-read" just
based on one application case. For one product I have been working on
recently, one use case is for IOT related application where data is sent
from a variety of small devices (sensors, camera, etc). It would be a hard
requirement to pre-define schema upfront for each device, before write data
into the system. Further, the value of data is likely to decrease
significantly over time; data within hours/days is way more important than
that of weeks/months ago. It's unimaginable to wait for weeks to run data
clean/preparation job, before user could query such data. In other words,
for application with requirements of  flexibility and time-sensitivity,
'schema-on-read' provides a huge benefit, compared with traditional
ETL-then-query approach.

Drill's schema-on-read is actually trying to solve a rather hard problem,
in that we deal with not only relational type, but also nested type. In
that sense, Drill is walking in an uncharted territory where not many
others are doing similar things.  Dealing with undocumented/unstructured
data is a big challenge. Although Drill's solution is not perfect, IMHO,
it's still a big step towards such a problem.

With that said, I agreed with points people raised earlier. In addition to
"schema-on-read", Drill has to do a better to handle the traditional cases
where schema is known beforehand, by introducing a meta-store /catalog, or
by allowing users to declare schema upfront ( I probably will not call
Drill "schema-forbidden"). The restart strategy seems to be also
interesting to handle failure caused by missing schema / schema change.




On Tue, Apr 3, 2018 at 10:01 PM, Ted Dunning  wrote:

> Well, the restart strategy still works for your examples. And you only pay
> once. From them you look at the cached type information and used an upper
> bound data type as you read the data. Since it works to read the values in
> the right order, it is obviously possible to push down typing information
> even into the json reader.
>
>
>
> On Tue, Apr 3, 2018, 21:42 Paul Rogers  wrote:
>
> > Subtle point. I can provide schema with Parquet, as you note. (Actually,
> > for Parquet, Drill is schema-required: I can't not provide a schema due
> to
> > the nature of Parquet...)
> >
> > But, I can't provide a schema for JSON, CSV, etc. The point is, Drill
> > forbids the user from providing a schema; only the file format itself can
> > provide the schema (or not, in the case of JSON). This is the very heart
> of
> > the problem.
> >
> > The root cause of our schema change exception is that vectors are,
> indeed,
> > strongly typed. But, file columns are not. Here is my favorite:
> >
> > {x: 10} {x: 10.1}
> >
> > Blam! Query fails because the vector is chosen as BigInt, then we
> discover
> > it really should have been Float8. (If the answer is: go back and rebuild
> > the vector with the new type, consider the case that 100K records
> separate
> > the two above so that the first batch is long gone by the time we see the
> > offending record. If only I could tell Drill to use Float8 (or Decimal)
> up
> > front...
> >
> > Views won't help here because the failure occurs before a view can kick
> > in. However, presumably, I could write a view to handle a different
> classic
> > case:
> >
> > myDir /
> > |- File 1: {a: 10, b: "foo"}
> > |- File 2: {a: 20}
> >
> > With query: SELECT a, b FROM myDir
> >
> > For File 2, Drill will guess that b is a Nullable Int, but it is really
> > VarChar. I think I could write clever SQL that says:
> >
> > If b is of type Nullable Int, return NULL cast to nullable VarChar, else
> > return b
> >
> > The irony is that I must to write procedural code to declare a static
> > attribute of the data. Yet SQL is otherwise declarative: I state what I
> > want, not how to implement it.
> >
> > Life would be so much easier if I could just say, "trust me, when you
> read
> > column b, it is a VarChar."
> >
> > Thanks,
> > - Paul
> >
> >
> >
> > On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning <
> > ted.dunn...@gmail.com> wrote:
> >
> >  I don't see why you say that Drill is schema-forbidden.
> >
> > The Parquet reader, for instance, makes strong use of the implied schema
> to
> > facilitate reading of typed data.
> >
> > Likewise, the vectorized internal format is strongly typed and, as such,
> > uses schema information.
> >
> > Views are another way to communicate schema information.
> >
> > It is true that you can't, say, view comments on fields from the command
> > line. But I don't understand saying "schema-forbidden".
> >
> >
> > On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers 
> > wrote:
> >
> > > Here is another way to think about it. Today, Drill is
> > "schema-forbidden":
> > > even if I know the schema, I can't communicate that to Drill; Drill
> must
> > > figure it out on its own, making the same mistakes every time on
> > ambiguous
> > > schemas.
> > >
> >
>


Re: "Death of Schema-on-Read"

2018-04-05 Thread Joel Pfaff
Hello,

A lot of versioning problems arise when trying to share data through kafka
between multiple applications with different lifecycles and maintainers,
since by default, a single message in Kafka is just a blob.
One way to solve that is to agree on a single serialization format,
friendly with a record per record storage (like avro) and in order to not
have to serialize the schema in use for every message, just reference an
entry in the Avro Schema Registry (this flow is described here:
https://medium.com/@stephane.maarek/introduction-to-schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
).
On top of the schema registry, specific client libs allow to validate the
message structure prior to the injection in kafka.
So while comcast mentions the usage of an Avro Schema to describe its
feeds, it does not mention directly the usage of avro files (to describe
the schema).

Coming back to Drill, I think it tries to make a nice effort to provide
similar features on top of loosely typed datasets. It could probably try to
do better in some cases (handling unknown types as `Unknown` is probably
better than `Nullable Int`), but its ability to dynamically merge data with
different (but still compatible) schemas is really nice.

When using untyped file formats (JSON, CSV), Drill does its best, and while
it is not perfect, but it is already pretty good.
When relying on types formats like Parquet /ORC / Avro, lot of problems are
solved because each file describes its columns (name / types), allowing
even for complex structures.
But the usage of CSV/JSON still is problematic. I like the idea of having
an optional way to describe the expected types somewhere (either in a
central meta-store, or in a structured file next to the dataset).
That would make the usage of CTAS much safer/easier (sometimes, we have to
use Spark to generate the Parquet files because of schema/type problems).

Independently from the meta-store, it is a bit annoying that Drill would
need to `discover` the columns and types at every scan through trial and
error, and cannot benefit from the previous queries.
Extending the `Analyze Table` command so that meta-data could be generated
from JSON/CSV file/folder could improve this situation without introducing
a costly/painful ETL process.

Regards, Joel


On Wed, Apr 4, 2018 at 10:35 PM, Jinfeng Ni  wrote:

> I feel it's probably premature to cal it "death of schema-on-read" just
> based on one application case. For one product I have been working on
> recently, one use case is for IOT related application where data is sent
> from a variety of small devices (sensors, camera, etc). It would be a hard
> requirement to pre-define schema upfront for each device, before write data
> into the system. Further, the value of data is likely to decrease
> significantly over time; data within hours/days is way more important than
> that of weeks/months ago. It's unimaginable to wait for weeks to run data
> clean/preparation job, before user could query such data. In other words,
> for application with requirements of  flexibility and time-sensitivity,
> 'schema-on-read' provides a huge benefit, compared with traditional
> ETL-then-query approach.
>
> Drill's schema-on-read is actually trying to solve a rather hard problem,
> in that we deal with not only relational type, but also nested type. In
> that sense, Drill is walking in an uncharted territory where not many
> others are doing similar things.  Dealing with undocumented/unstructured
> data is a big challenge. Although Drill's solution is not perfect, IMHO,
> it's still a big step towards such a problem.
>
> With that said, I agreed with points people raised earlier. In addition to
> "schema-on-read", Drill has to do a better to handle the traditional cases
> where schema is known beforehand, by introducing a meta-store /catalog, or
> by allowing users to declare schema upfront ( I probably will not call
> Drill "schema-forbidden"). The restart strategy seems to be also
> interesting to handle failure caused by missing schema / schema change.
>
>
>
>
> On Tue, Apr 3, 2018 at 10:01 PM, Ted Dunning 
> wrote:
>
> > Well, the restart strategy still works for your examples. And you only
> pay
> > once. From them you look at the cached type information and used an upper
> > bound data type as you read the data. Since it works to read the values
> in
> > the right order, it is obviously possible to push down typing information
> > even into the json reader.
> >
> >
> >
> > On Tue, Apr 3, 2018, 21:42 Paul Rogers 
> wrote:
> >
> > > Subtle point. I can provide schema with Parquet, as you note.
> (Actually,
> > > for Parquet, Drill is schema-required: I can't not provide a schema due
> > t

Re: "Death of Schema-on-Read"

2018-04-05 Thread Aman Sinha
All good discussions in this thread.  It clearly shows that Drill's
schema-on-read is not only a nice-to-have but for applications like IOT, it
is a must-have.
For other types of data that is slowly changing,  in order to improve
overall user experience where the user is willing to run offline commands
to discover schema (as opposed to doing it while querying),
we should consider doing sampling of the files with different sampling
percentages.  This would be similar to collecting statistics through
sampling.  In fact,
the two things (schema discovery and stats) can be done in a single pass
over the data.

-Aman

On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff  wrote:

> Hello,
>
> A lot of versioning problems arise when trying to share data through kafka
> between multiple applications with different lifecycles and maintainers,
> since by default, a single message in Kafka is just a blob.
> One way to solve that is to agree on a single serialization format,
> friendly with a record per record storage (like avro) and in order to not
> have to serialize the schema in use for every message, just reference an
> entry in the Avro Schema Registry (this flow is described here:
> https://medium.com/@stephane.maarek/introduction-to-
> schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
> ).
> On top of the schema registry, specific client libs allow to validate the
> message structure prior to the injection in kafka.
> So while comcast mentions the usage of an Avro Schema to describe its
> feeds, it does not mention directly the usage of avro files (to describe
> the schema).
>
> Coming back to Drill, I think it tries to make a nice effort to provide
> similar features on top of loosely typed datasets. It could probably try to
> do better in some cases (handling unknown types as `Unknown` is probably
> better than `Nullable Int`), but its ability to dynamically merge data with
> different (but still compatible) schemas is really nice.
>
> When using untyped file formats (JSON, CSV), Drill does its best, and while
> it is not perfect, but it is already pretty good.
> When relying on types formats like Parquet /ORC / Avro, lot of problems are
> solved because each file describes its columns (name / types), allowing
> even for complex structures.
> But the usage of CSV/JSON still is problematic. I like the idea of having
> an optional way to describe the expected types somewhere (either in a
> central meta-store, or in a structured file next to the dataset).
> That would make the usage of CTAS much safer/easier (sometimes, we have to
> use Spark to generate the Parquet files because of schema/type problems).
>
> Independently from the meta-store, it is a bit annoying that Drill would
> need to `discover` the columns and types at every scan through trial and
> error, and cannot benefit from the previous queries.
> Extending the `Analyze Table` command so that meta-data could be generated
> from JSON/CSV file/folder could improve this situation without introducing
> a costly/painful ETL process.
>
> Regards, Joel
>
>
> On Wed, Apr 4, 2018 at 10:35 PM, Jinfeng Ni  wrote:
>
> > I feel it's probably premature to cal it "death of schema-on-read" just
> > based on one application case. For one product I have been working on
> > recently, one use case is for IOT related application where data is sent
> > from a variety of small devices (sensors, camera, etc). It would be a
> hard
> > requirement to pre-define schema upfront for each device, before write
> data
> > into the system. Further, the value of data is likely to decrease
> > significantly over time; data within hours/days is way more important
> than
> > that of weeks/months ago. It's unimaginable to wait for weeks to run data
> > clean/preparation job, before user could query such data. In other words,
> > for application with requirements of  flexibility and time-sensitivity,
> > 'schema-on-read' provides a huge benefit, compared with traditional
> > ETL-then-query approach.
> >
> > Drill's schema-on-read is actually trying to solve a rather hard problem,
> > in that we deal with not only relational type, but also nested type. In
> > that sense, Drill is walking in an uncharted territory where not many
> > others are doing similar things.  Dealing with undocumented/unstructured
> > data is a big challenge. Although Drill's solution is not perfect, IMHO,
> > it's still a big step towards such a problem.
> >
> > With that said, I agreed with points people raised earlier. In addition
> to
> > "schema-on-read", Drill has to do a better to handle the traditional
> cases
> > where schema is known beforehand, 

Re: "Death of Schema-on-Read"

2018-04-05 Thread Charles Givre
I’ll weigh in here.  IMHO, Drill’s schema discovery features are excellent, but 
what would be nice is if Drill could:
1.  Accept hints or directives for the schema
2.  Remember these hints if the file doesn’t change. 
3.  Allow these hints to be applied to a directory of files.  

Therefore, I do think it would be useful for Drill to have some sort of 
metastore which would enable Drill to remember previously defined schemata so 
that you don’t have to do it over and over again.  For the logfile AKA regex 
file reader that Paul and I are working on, it does this via the config, but it 
would be nice for this capability to exist for delimited and other file-based 
data types. 
-C


> On Apr 5, 2018, at 12:02, Aman Sinha  wrote:
> 
> All good discussions in this thread.  It clearly shows that Drill's
> schema-on-read is not only a nice-to-have but for applications like IOT, it
> is a must-have.
> For other types of data that is slowly changing,  in order to improve
> overall user experience where the user is willing to run offline commands
> to discover schema (as opposed to doing it while querying),
> we should consider doing sampling of the files with different sampling
> percentages.  This would be similar to collecting statistics through
> sampling.  In fact,
> the two things (schema discovery and stats) can be done in a single pass
> over the data.
> 
> -Aman
> 
> On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff  wrote:
> 
>> Hello,
>> 
>> A lot of versioning problems arise when trying to share data through kafka
>> between multiple applications with different lifecycles and maintainers,
>> since by default, a single message in Kafka is just a blob.
>> One way to solve that is to agree on a single serialization format,
>> friendly with a record per record storage (like avro) and in order to not
>> have to serialize the schema in use for every message, just reference an
>> entry in the Avro Schema Registry (this flow is described here:
>> https://medium.com/@stephane.maarek/introduction-to-
>> schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
>> ).
>> On top of the schema registry, specific client libs allow to validate the
>> message structure prior to the injection in kafka.
>> So while comcast mentions the usage of an Avro Schema to describe its
>> feeds, it does not mention directly the usage of avro files (to describe
>> the schema).
>> 
>> Coming back to Drill, I think it tries to make a nice effort to provide
>> similar features on top of loosely typed datasets. It could probably try to
>> do better in some cases (handling unknown types as `Unknown` is probably
>> better than `Nullable Int`), but its ability to dynamically merge data with
>> different (but still compatible) schemas is really nice.
>> 
>> When using untyped file formats (JSON, CSV), Drill does its best, and while
>> it is not perfect, but it is already pretty good.
>> When relying on types formats like Parquet /ORC / Avro, lot of problems are
>> solved because each file describes its columns (name / types), allowing
>> even for complex structures.
>> But the usage of CSV/JSON still is problematic. I like the idea of having
>> an optional way to describe the expected types somewhere (either in a
>> central meta-store, or in a structured file next to the dataset).
>> That would make the usage of CTAS much safer/easier (sometimes, we have to
>> use Spark to generate the Parquet files because of schema/type problems).
>> 
>> Independently from the meta-store, it is a bit annoying that Drill would
>> need to `discover` the columns and types at every scan through trial and
>> error, and cannot benefit from the previous queries.
>> Extending the `Analyze Table` command so that meta-data could be generated
>> from JSON/CSV file/folder could improve this situation without introducing
>> a costly/painful ETL process.
>> 
>> Regards, Joel
>> 
>> 
>> On Wed, Apr 4, 2018 at 10:35 PM, Jinfeng Ni  wrote:
>> 
>>> I feel it's probably premature to cal it "death of schema-on-read" just
>>> based on one application case. For one product I have been working on
>>> recently, one use case is for IOT related application where data is sent
>>> from a variety of small devices (sensors, camera, etc). It would be a
>> hard
>>> requirement to pre-define schema upfront for each device, before write
>> data
>>> into the system. Further, the value of data is likely to decrease
>>> significantly over time; data within hours/days is way more important
>> than
>>> that of weeks/months ago. It's unima

Re: "Death of Schema-on-Read"

2018-04-05 Thread Ted Dunning
On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff  wrote:

> Hello,
>
> A lot of versioning problems arise when trying to share data through kafka
> between multiple applications with different lifecycles and maintainers,
> since by default, a single message in Kafka is just a blob.
> One way to solve that is to agree on a single serialization format,
> friendly with a record per record storage (like avro) and in order to not
> have to serialize the schema in use for every message, just reference an
> entry in the Avro Schema Registry (this flow is described here:
> https://medium.com/@stephane.maarek/introduction-to-
> schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
> ).
> On top of the schema registry, specific client libs allow to validate the
> message structure prior to the injection in kafka.
> So while comcast mentions the usage of an Avro Schema to describe its
> feeds, it does not mention directly the usage of avro files (to describe
> the schema).
>

This is all good except for the assumption of a single schema for all time.
You can mutate schemas in Avro (or JSON) in a future-proof manner, but it
is important to recognize the simple truth that the data in a stream will
not necessarily be uniform (and is even unlikely to be uniform)




>
>  But the usage of CSV/JSON still is problematic. I like the idea of
> having
> an optional way to describe the expected types somewhere (either in a
> central meta-store, or in a structured file next to the dataset).
>

Central meta-stores are seriously bad problems and are the single biggest
nightmare in trying to upgrade Hive users. Let's avoid that if possible.

Writing meta-data next to the file is also problematic if it needs to be
written by the processing doing a query (the directory may not be writable).

Having a convention for redirecting the meta-data cache to a parallel
directory might solve the problem of non-writable local locations.

In the worst case that Drill can't have any place to persist what it has
learned but wants to do a restart, there needs to be SOME place to cache
meta-data or else restarts will get no further than the original failed
query.


Re: "Death of Schema-on-Read"

2018-04-05 Thread Ted Dunning
On Thu, Apr 5, 2018 at 9:02 AM, Aman Sinha  wrote:

> All good discussions in this thread.  It clearly shows that Drill's
> schema-on-read is not only a nice-to-have but for applications like IOT, it
> is a must-have.
>

Absolutely. It must be possible to have schema on read even if that isn't
necessarily the most common path.


> For other types of data that is slowly changing,  in order to improve
> overall user experience where the user is willing to run offline commands
> to discover schema (as opposed to doing it while querying),
> we should consider doing sampling of the files with different sampling
> percentages.


I am not even convinced that sampling is necessary. Just have [select *
from foo.file] store the types and statistics that it discovers. If you
want to discover the stats before most people do a query, just do that on
each new file in a directory.



> This would be similar to collecting statistics through
> sampling.


I don't see why there needs to be anything special to make this happen. Any
full file scan for new files should suffice.



> In fact,
> the two things (schema discovery and stats) can be done in a single pass
> over the data.
>

Indeed.


Re: "Death of Schema-on-Read"

2018-04-05 Thread Ted Dunning
On Thu, Apr 5, 2018 at 12:12 PM, Charles Givre  wrote:

> I’ll weigh in here.  IMHO, Drill’s schema discovery features are
> excellent, but what would be nice is if Drill could:
> 1.  Accept hints or directives for the schema
> 2.  Remember these hints if the file doesn’t change.
> 3.  Allow these hints to be applied to a directory of files.
>

This might be better phrased as "allow the hints for files to be combined
to form hints for an entire directory".


> Therefore, I do think it would be useful for Drill to have some sort of
> metastore which would enable Drill to remember previously defined schemata
> so that you don’t have to do it over and over again.


Metastore is good.

*Centralized* metastore is transcendentally bad.


Re: "Death of Schema-on-Read"

2018-04-05 Thread Paul Rogers
N/CSV file/folder could improve this situation without introducing
> a costly/painful ETL process.
>
> Regards, Joel
>
>
> On Wed, Apr 4, 2018 at 10:35 PM, Jinfeng Ni  wrote:
>
> > I feel it's probably premature to cal it "death of schema-on-read" just
> > based on one application case. For one product I have been working on
> > recently, one use case is for IOT related application where data is sent
> > from a variety of small devices (sensors, camera, etc). It would be a
> hard
> > requirement to pre-define schema upfront for each device, before write
> data
> > into the system. Further, the value of data is likely to decrease
> > significantly over time; data within hours/days is way more important
> than
> > that of weeks/months ago. It's unimaginable to wait for weeks to run data
> > clean/preparation job, before user could query such data. In other words,
> > for application with requirements of  flexibility and time-sensitivity,
> > 'schema-on-read' provides a huge benefit, compared with traditional
> > ETL-then-query approach.
> >
> > Drill's schema-on-read is actually trying to solve a rather hard problem,
> > in that we deal with not only relational type, but also nested type. In
> > that sense, Drill is walking in an uncharted territory where not many
> > others are doing similar things.  Dealing with undocumented/unstructured
> > data is a big challenge. Although Drill's solution is not perfect, IMHO,
> > it's still a big step towards such a problem.
> >
> > With that said, I agreed with points people raised earlier. In addition
> to
> > "schema-on-read", Drill has to do a better to handle the traditional
> cases
> > where schema is known beforehand, by introducing a meta-store /catalog,
> or
> > by allowing users to declare schema upfront ( I probably will not call
> > Drill "schema-forbidden"). The restart strategy seems to be also
> > interesting to handle failure caused by missing schema / schema change.
> >
> >
> >
> >
> > On Tue, Apr 3, 2018 at 10:01 PM, Ted Dunning 
> > wrote:
> >
> > > Well, the restart strategy still works for your examples. And you only
> > pay
> > > once. From them you look at the cached type information and used an
> upper
> > > bound data type as you read the data. Since it works to read the values
> > in
> > > the right order, it is obviously possible to push down typing
> information
> > > even into the json reader.
> > >
> > >
> > >
> > > On Tue, Apr 3, 2018, 21:42 Paul Rogers 
> > wrote:
> > >
> > > > Subtle point. I can provide schema with Parquet, as you note.
> > (Actually,
> > > > for Parquet, Drill is schema-required: I can't not provide a schema
> due
> > > to
> > > > the nature of Parquet...)
> > > >
> > > > But, I can't provide a schema for JSON, CSV, etc. The point is, Drill
> > > > forbids the user from providing a schema; only the file format itself
> > can
> > > > provide the schema (or not, in the case of JSON). This is the very
> > heart
> > > of
> > > > the problem.
> > > >
> > > > The root cause of our schema change exception is that vectors are,
> > > indeed,
> > > > strongly typed. But, file columns are not. Here is my favorite:
> > > >
> > > > {x: 10} {x: 10.1}
> > > >
> > > > Blam! Query fails because the vector is chosen as BigInt, then we
> > > discover
> > > > it really should have been Float8. (If the answer is: go back and
> > rebuild
> > > > the vector with the new type, consider the case that 100K records
> > > separate
> > > > the two above so that the first batch is long gone by the time we see
> > the
> > > > offending record. If only I could tell Drill to use Float8 (or
> Decimal)
> > > up
> > > > front...
> > > >
> > > > Views won't help here because the failure occurs before a view can
> kick
> > > > in. However, presumably, I could write a view to handle a different
> > > classic
> > > > case:
> > > >
> > > > myDir /
> > > > |- File 1: {a: 10, b: "foo"}
> > > > |- File 2: {a: 20}
> > > >
> > > > With query: SELECT a, b FROM myDir
> > > >
> > > > For File 2, Drill will guess that b is a Nullable Int, but it is
&

Re: "Death of Schema-on-Read"

2018-04-05 Thread Paul Rogers
Great discussion. Really appreciate the insight from the Drill users!

To Ted's points: the simplest possible solution is to allow a table function to 
express types. Just making stuff up:

SELECT a FROM schema(myTable, (a: INT))

Or, a SQL extension:

SELECT a FROM myTable(a: INT)

Or, really ugly, a session option:

ALTER SESSION SET schema.myTable="a: INT"

All these are ephemeral and not compatible with, say, Tableau.

Building on Ted's suggestion of using the (distributed) file system we can toss 
out a few half-baked ideas. Maybe use a directory to represent a name space, 
with files representing tables. If I have "weblogs" as my directory, I might 
have a file called "jsonlog" to describe the (messy) format of my 
JSON-formatted log files. And "csvlog" to describe my CSV-format logs. 
Different directories represent different SQL databases (schemas), different 
files represent tables within the schema.


The table files can store column hints. But, it could do more. Maybe define the 
partitioning scheme (by year, month, day, say) so that can be mapped to a 
column. Wouldn't it be be great if Drill could figure out the partitioning 
itself if we gave a date range?

The file could also define the format plugin to use, and its options, to avoid 
the need to define this format separate from the data, and to reduce the need 
for table functions.

Today, Drill matches files to format plugins using only extensions. The table 
file could provide a regex for those old-style files (such as real web logs) 
that don't use suffixes. Or, to differentiate between "sales.csv" and 
"returns.csv" in the same data directory.


While we're at it, the file might as well contain a standard view to apply to 
the table to define computed columns, do data conversions and so on.

If Drill does automatic scans (to detect schema, to gather stats), maybe store 
that alongside the table file: "csvlogs.drill" for the Drill-generated info.


Voila! A nice schema definition with no formal metastore. Because the info is 
in files, it easy to version using git, etc. (especially if the directory can 
be mounted using NFS as a normal directory.) Atomic updates can be done via the 
rename trick (which, sadly, does not work on S3...)


Or, maybe store all information in ZK in JSON as we do for plugin 
configurations. (Hard to version and modify though...)


Lots of ways to skin this cat once we agree that hints are, in fact, useful 
additions to Drill's automatic schema detection.


Thanks,
- Paul

 

On Thursday, April 5, 2018, 3:22:07 PM PDT, Ted Dunning 
 wrote:  
 
 On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff  wrote:

> Hello,
>
> A lot of versioning problems arise when trying to share data through kafka
> between multiple applications with different lifecycles and maintainers,
> since by default, a single message in Kafka is just a blob.
> One way to solve that is to agree on a single serialization format,
> friendly with a record per record storage (like avro) and in order to not
> have to serialize the schema in use for every message, just reference an
> entry in the Avro Schema Registry (this flow is described here:
> https://medium.com/@stephane.maarek/introduction-to-
> schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
> ).
> On top of the schema registry, specific client libs allow to validate the
> message structure prior to the injection in kafka.
> So while comcast mentions the usage of an Avro Schema to describe its
> feeds, it does not mention directly the usage of avro files (to describe
> the schema).
>

This is all good except for the assumption of a single schema for all time.
You can mutate schemas in Avro (or JSON) in a future-proof manner, but it
is important to recognize the simple truth that the data in a stream will
not necessarily be uniform (and is even unlikely to be uniform)




>
>  But the usage of CSV/JSON still is problematic. I like the idea of
> having
> an optional way to describe the expected types somewhere (either in a
> central meta-store, or in a structured file next to the dataset).
>

Central meta-stores are seriously bad problems and are the single biggest
nightmare in trying to upgrade Hive users. Let's avoid that if possible.

Writing meta-data next to the file is also problematic if it needs to be
written by the processing doing a query (the directory may not be writable).

Having a convention for redirecting the meta-data cache to a parallel
directory might solve the problem of non-writable local locations.

In the worst case that Drill can't have any place to persist what it has
learned but wants to do a restart, there needs to be SOME place to cache
meta-data or else restarts will get no further than the original failed
query.
  

Re: "Death of Schema-on-Read"

2018-04-05 Thread Hanumath Rao Maduri
Hello,

Thank you Paul for starting this discussion.
However, I was not clear on the latest point as to how providing hints and
creating a view(mechanism which already exists in DRILL) is different.
I do think that creating a view can be cumbersome (in terms of syntax).
Providing hints are ephemeral and hence it can be used for quick validation
of the schema for a query execution. But if the user absolutely knows the
schema, then I think creating a view and using it might be a better option.
Can you please share your thoughts on this.

Thank you Ted for your valuable suggestions, as regards to your comment on
"metastore is good but centralized is bad" can you please share your view
point on what all design issues it can cause. I know that it can be
bottleneck but just want to know about other issues.
Put in other terms if centralized metastore engineered in a good way to
avoid most of the bottleneck, then do you think it can be good to use for
metadata?

Thanks,
-Hanu

On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers 
wrote:

> Great discussion. Really appreciate the insight from the Drill users!
>
> To Ted's points: the simplest possible solution is to allow a table
> function to express types. Just making stuff up:
>
> SELECT a FROM schema(myTable, (a: INT))
>
> Or, a SQL extension:
>
> SELECT a FROM myTable(a: INT)
>
> Or, really ugly, a session option:
>
> ALTER SESSION SET schema.myTable="a: INT"
>
> All these are ephemeral and not compatible with, say, Tableau.
>
> Building on Ted's suggestion of using the (distributed) file system we can
> toss out a few half-baked ideas. Maybe use a directory to represent a name
> space, with files representing tables. If I have "weblogs" as my directory,
> I might have a file called "jsonlog" to describe the (messy) format of my
> JSON-formatted log files. And "csvlog" to describe my CSV-format logs.
> Different directories represent different SQL databases (schemas),
> different files represent tables within the schema.
>
>
> The table files can store column hints. But, it could do more. Maybe
> define the partitioning scheme (by year, month, day, say) so that can be
> mapped to a column. Wouldn't it be be great if Drill could figure out the
> partitioning itself if we gave a date range?
>
> The file could also define the format plugin to use, and its options, to
> avoid the need to define this format separate from the data, and to reduce
> the need for table functions.
>
> Today, Drill matches files to format plugins using only extensions. The
> table file could provide a regex for those old-style files (such as real
> web logs) that don't use suffixes. Or, to differentiate between "sales.csv"
> and "returns.csv" in the same data directory.
>
>
> While we're at it, the file might as well contain a standard view to apply
> to the table to define computed columns, do data conversions and so on.
>
> If Drill does automatic scans (to detect schema, to gather stats), maybe
> store that alongside the table file: "csvlogs.drill" for the
> Drill-generated info.
>
>
> Voila! A nice schema definition with no formal metastore. Because the info
> is in files, it easy to version using git, etc. (especially if the
> directory can be mounted using NFS as a normal directory.) Atomic updates
> can be done via the rename trick (which, sadly, does not work on S3...)
>
>
> Or, maybe store all information in ZK in JSON as we do for plugin
> configurations. (Hard to version and modify though...)
>
>
> Lots of ways to skin this cat once we agree that hints are, in fact,
> useful additions to Drill's automatic schema detection.
>
>
> Thanks,
> - Paul
>
>
>
> On Thursday, April 5, 2018, 3:22:07 PM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff  wrote:
>
> > Hello,
> >
> > A lot of versioning problems arise when trying to share data through
> kafka
> > between multiple applications with different lifecycles and maintainers,
> > since by default, a single message in Kafka is just a blob.
> > One way to solve that is to agree on a single serialization format,
> > friendly with a record per record storage (like avro) and in order to not
> > have to serialize the schema in use for every message, just reference an
> > entry in the Avro Schema Registry (this flow is described here:
> > https://medium.com/@stephane.maarek/introduction-to-
> > schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
> > ).
> > On top of the schema registry, specific client libs allow to validate the
> > message structure prior to the injection in kafka.
> > So while comcast mentions the usage of an Avro Schema to describe its
> > feeds, it does not mention directly the usage of avro files (to describe
> > the schema).
> >
>
> This is all good except for the assumption of a single schema for all time.
> You can mutate schemas in Avro (or JSON) in a future-proof manner, but it
> is important to recognize the simple truth that the data in a stream wi

Re: "Death of Schema-on-Read"

2018-04-06 Thread Ted Dunning
On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers 
wrote:

> Great discussion. Really appreciate the insight from the Drill users!
>
> To Ted's points: the simplest possible solution is to allow a table
> function to express types. Just making stuff up:
>
> SELECT a FROM schema(myTable, (a: INT))
>

Why not just allow cast to be pushed down to the reader?

Why invent new language features?

Or, really ugly, a session option:
>
> ALTER SESSION SET schema.myTable="a: INT"
>

These are a big problem.


Re: "Death of Schema-on-Read"

2018-04-06 Thread Ted Dunning
On Thu, Apr 5, 2018 at 10:22 PM, Hanumath Rao Maduri 
wrote:

> ...
>
> Thank you Ted for your valuable suggestions, as regards to your comment on
> "metastore is good but centralized is bad" can you please share your view
> point on what all design issues it can cause. I know that it can be
> bottleneck but just want to know about other issues.

Put in other terms if centralized metastore engineered in a good way to
> avoid most of the bottleneck, then do you think it can be good to use for
> metadata?
>

Centralized metadata stores have caused the following problems in my
experience:

1) they lock versions and make it extremely hard to upgrade applications
incrementally. It is a common fiction that one can upgrade all applications
using the same data at the same moment. It isn't acceptable to require an
outage and force an upgrade on users. It also isn't acceptable to force the
metadata store to never be updated.

2) they go down and take everything else with it.

3) they require elaborate caching. The error message "updating metadata
cache" was the most common string on the impala mailing list for a long
time because of the 30 minute delays that customers were seeing due to this
kind of problem.

4) they limit expressivity. Because it is hard to update a metadata store
safely, they move slowly and typically don't describe new data well. Thus,
Hive metadata store doesn't deal with variable typed data or structured
data worth a darn. The same thing will happen with any new centralized
meta-data store.

5) they inhibit multi-tenancy. Ideally, data describes itself so that
different users can see the same data even if they are nominally not part
of the same org or sub-org.

6) they inhibit data fabrics that extend beyond a single cluster.
Centralized metadata stores are inherently anti-global. Self-describing
data, on the other hand, is inherently global since whereever the data
goes, so goes the metadata. Note that self-describing data does not have to
be intrinsically self-descriptive in a single file. I view JSON file with a
schema file alongside as a self-describing pair.

As an example, imagine that file extensions were tied to applications by a
central authority (a metadata store). This would mean that you couldn't
change web browsers (.html) or spreadhsheets. Or compilers. And frankly,
the fact that my computer has a single idea about how a file is interpreted
is limiting. I would prefer to use photoshop on images in certain
directories and Preview for other images elsewhere. A single repository
linking file type to application is too limiting even on my laptop.

That is the same issue, ultimately, as a centralized data store except that
my issues with images are tiny compared to the problems that occur when you
have 5000 analysts working on data that all get screwed by a single broken
piece of software.


Re: "Death of Schema-on-Read"

2018-04-06 Thread Paul Rogers
Hi Hanu,
Sorry, I tossed in a new topic late in the discussion. We started by noting 
that views don't always work to resolve low-level schema conflicts for the 
reasons we discussed, so we need something else. That led to the schema hint 
discussion.

The additional point I raised was that views are still very useful for other 
tasks such as, say, computed columns (extended_price = price * quantity), for 
filtering out data (rejecting certain kinds of unwanted records) and so on.

If we need both hints and views (they serve distinct purposes), we'd want to 
ask how could a user can combine them into a single file-based schema file so 
that query users just see a simplified version of the table (with the hints and 
views applied.)

Since I tossed in new ideas, here is one more. We once saw a wild-and-crazy 
form of JSON with embedded metadata. Assume a list of customers:

{name: {type: "string", value: "Fred"}, age: {type: "int", value: 40}}

In such a case, would be great to be able to transform the data so that the 
fields become simple value like this:

{name: "Fred", age: 40}

Views can do this for top-level fields. But, there is no syntax in Drill to do 
this in nested maps:

{... address: {street: {type: "string", value "301 Cobblestone Way"}, ...}}

Ideally, we'd transform this to:

{name: "Fred", age: 40, address: {street: "301 Cobblestone Way", ...}}

So, if we come up with a metadata hint system, we (or the community) should be 
able to add rules for the type of messy data actually encountered in the field.

Thanks,
- Paul

 

On Thursday, April 5, 2018, 10:22:46 PM PDT, Hanumath Rao Maduri 
 wrote:  
 
 Hello,

Thank you Paul for starting this discussion.
However, I was not clear on the latest point as to how providing hints and
creating a view(mechanism which already exists in DRILL) is different.
I do think that creating a view can be cumbersome (in terms of syntax).
Providing hints are ephemeral and hence it can be used for quick validation
of the schema for a query execution. But if the user absolutely knows the
schema, then I think creating a view and using it might be a better option.
Can you please share your thoughts on this.

Thank you Ted for your valuable suggestions, as regards to your comment on
"metastore is good but centralized is bad" can you please share your view
point on what all design issues it can cause. I know that it can be
bottleneck but just want to know about other issues.
Put in other terms if centralized metastore engineered in a good way to
avoid most of the bottleneck, then do you think it can be good to use for
metadata?

Thanks,
-Hanu

On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers 
wrote:

> Great discussion. Really appreciate the insight from the Drill users!
>
> To Ted's points: the simplest possible solution is to allow a table
> function to express types. Just making stuff up:
>
> SELECT a FROM schema(myTable, (a: INT))
>
> Or, a SQL extension:
>
> SELECT a FROM myTable(a: INT)
>
> Or, really ugly, a session option:
>
> ALTER SESSION SET schema.myTable="a: INT"
>
> All these are ephemeral and not compatible with, say, Tableau.
>
> Building on Ted's suggestion of using the (distributed) file system we can
> toss out a few half-baked ideas. Maybe use a directory to represent a name
> space, with files representing tables. If I have "weblogs" as my directory,
> I might have a file called "jsonlog" to describe the (messy) format of my
> JSON-formatted log files. And "csvlog" to describe my CSV-format logs.
> Different directories represent different SQL databases (schemas),
> different files represent tables within the schema.
>
>
> The table files can store column hints. But, it could do more. Maybe
> define the partitioning scheme (by year, month, day, say) so that can be
> mapped to a column. Wouldn't it be be great if Drill could figure out the
> partitioning itself if we gave a date range?
>
> The file could also define the format plugin to use, and its options, to
> avoid the need to define this format separate from the data, and to reduce
> the need for table functions.
>
> Today, Drill matches files to format plugins using only extensions. The
> table file could provide a regex for those old-style files (such as real
> web logs) that don't use suffixes. Or, to differentiate between "sales.csv"
> and "returns.csv" in the same data directory.
>
>
> While we're at it, the file might as well contain a standard view to apply
> to the table to define computed columns, do data conversions and so on.
>
> If Drill does automatic scans (to detect schema, to gather stats), maybe
> store that alongside the table file: "csvlogs.drill" for the
> Drill-generated info.
>
>
> Voila! A nice schema definition with no formal metastore. Because the info
> is in files, it easy to version using git, etc. (especially if the
> directory can be mounted using NFS as a normal directory.) Atomic updates
> can be done via the rename trick (which, sadly, does not work on S3...)
>
>
>

Re: "Death of Schema-on-Read"

2018-04-06 Thread Paul Rogers
Ted, this is why your participation in Drill is such a gift: cast push-down is 
an elegant, simple solution that even works in views. Beautiful.

Thanks,
- Paul

 

On Friday, April 6, 2018, 11:35:37 AM PDT, Ted Dunning 
 wrote:  
 
 On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers 
wrote:

> Great discussion. Really appreciate the insight from the Drill users!
>
> To Ted's points: the simplest possible solution is to allow a table
> function to express types. Just making stuff up:
>
> SELECT a FROM schema(myTable, (a: INT))
>

Why not just allow cast to be pushed down to the reader?

Why invent new language features?

Or, really ugly, a session option:
>
> ALTER SESSION SET schema.myTable="a: INT"
>

These are a big problem.
  

Re: "Death of Schema-on-Read"

2018-04-06 Thread Hanumath Rao Maduri
Hello,

Thanks for Ted & Paul for clarifying my questions.
Sorry for not being clear in my previous post, When I said create view I
was under the impression for simple views where we use cast expressions
currently to cast them to types. In this case planner can use this
information to force the scans to use this as the schema.

If the query fails then it fails at the scan and not after inferring the
schema by the scanner.

I know that views can get complicated with joins and expressions. For
schema hinting through views I assume they should be created on single
tables with corresponding columns one wants to project from the table.


Regarding the same question, today we had a discussion with Aman. Here view
can be considered as a "view" of the table with schema in place.

We can change some syntax to suite it for specifying schema. something like
this.

create schema[optional] view(/virtual table ) v1 as (a: int, b : int)
select a, b from t1 with some other rules as to conversion of scalar to
complex types.

Then the queries when used on this view (below) should enable the scanner
to use this type information and then use it to convert the data into the
appropriate types.
select * from v1

For the possibility of schema information not being known by the user, may
be use something like this.

create schema[optional] view(/virtual table) v1 as select a, b from t1
infer schema.

This view when used to query the table should trigger the logic of
inferring and consolidating the schema and attaching that inferred schema
to the view. In future when we use the same view, we should be using the
inferred schema. This view either can be local view pertaining to the
session or a global view so that other queries across sessions can use them.


By default we can apply certain rules such as converting simple scalar
values to other scalar values (like int to double etc). But we should be
also able to give option to the customer to enable rules such as scalar int
to array[int] when creating the view itself.


Thanks,
-Hanu


On Fri, Apr 6, 2018 at 3:10 PM, Paul Rogers 
wrote:

> Ted, this is why your participation in Drill is such a gift: cast
> push-down is an elegant, simple solution that even works in views.
> Beautiful.
>
> Thanks,
> - Paul
>
>
>
> On Friday, April 6, 2018, 11:35:37 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers 
> wrote:
>
> > Great discussion. Really appreciate the insight from the Drill users!
> >
> > To Ted's points: the simplest possible solution is to allow a table
> > function to express types. Just making stuff up:
> >
> > SELECT a FROM schema(myTable, (a: INT))
> >
>
> Why not just allow cast to be pushed down to the reader?
>
> Why invent new language features?
>
> Or, really ugly, a session option:
> >
> > ALTER SESSION SET schema.myTable="a: INT"
> >
>
> These are a big problem.
>
>


Re: "Death of Schema-on-Read"

2018-04-06 Thread Aman Sinha
On the subject of CAST pushdown to Scans, there are potential drawbacks
...

   - In general, the planner will see a Scan-Project where the Project has
   CAST functions.  But the Project can have arbitrary expressions,  e.g
   CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
   functions etc.   It would be quite expensive to examine each expression
   (there could be hundreds) to determine whether it is eligible to be pushed
   to the Scan.
   - Expressing Nullability is not possible with CAST.   If a column should
   be tagged as  (not)nullable, CAST syntax does not allow that.
   - Drill currently supports CASTing to a SQL data type, but not to the
   complex types such as arrays and maps.  We would have to add support for
   that from a language perspective as well as the run-time.  This would be
   non-trivial effort.

-Aman


On Fri, Apr 6, 2018 at 4:59 PM, Hanumath Rao Maduri 
wrote:

> Hello,
>
> Thanks for Ted & Paul for clarifying my questions.
> Sorry for not being clear in my previous post, When I said create view I
> was under the impression for simple views where we use cast expressions
> currently to cast them to types. In this case planner can use this
> information to force the scans to use this as the schema.
>
> If the query fails then it fails at the scan and not after inferring the
> schema by the scanner.
>
> I know that views can get complicated with joins and expressions. For
> schema hinting through views I assume they should be created on single
> tables with corresponding columns one wants to project from the table.
>
>
> Regarding the same question, today we had a discussion with Aman. Here view
> can be considered as a "view" of the table with schema in place.
>
> We can change some syntax to suite it for specifying schema. something like
> this.
>
> create schema[optional] view(/virtual table ) v1 as (a: int, b : int)
> select a, b from t1 with some other rules as to conversion of scalar to
> complex types.
>
> Then the queries when used on this view (below) should enable the scanner
> to use this type information and then use it to convert the data into the
> appropriate types.
> select * from v1
>
> For the possibility of schema information not being known by the user, may
> be use something like this.
>
> create schema[optional] view(/virtual table) v1 as select a, b from t1
> infer schema.
>
> This view when used to query the table should trigger the logic of
> inferring and consolidating the schema and attaching that inferred schema
> to the view. In future when we use the same view, we should be using the
> inferred schema. This view either can be local view pertaining to the
> session or a global view so that other queries across sessions can use
> them.
>
>
> By default we can apply certain rules such as converting simple scalar
> values to other scalar values (like int to double etc). But we should be
> also able to give option to the customer to enable rules such as scalar int
> to array[int] when creating the view itself.
>
>
> Thanks,
> -Hanu
>
>
> On Fri, Apr 6, 2018 at 3:10 PM, Paul Rogers 
> wrote:
>
> > Ted, this is why your participation in Drill is such a gift: cast
> > push-down is an elegant, simple solution that even works in views.
> > Beautiful.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> > On Friday, April 6, 2018, 11:35:37 AM PDT, Ted Dunning <
> > ted.dunn...@gmail.com> wrote:
> >
> >  On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers 
> > wrote:
> >
> > > Great discussion. Really appreciate the insight from the Drill users!
> > >
> > > To Ted's points: the simplest possible solution is to allow a table
> > > function to express types. Just making stuff up:
> > >
> > > SELECT a FROM schema(myTable, (a: INT))
> > >
> >
> > Why not just allow cast to be pushed down to the reader?
> >
> > Why invent new language features?
> >
> > Or, really ugly, a session option:
> > >
> > > ALTER SESSION SET schema.myTable="a: INT"
> > >
> >
> > These are a big problem.
> >
> >
>


Re: "Death of Schema-on-Read"

2018-04-06 Thread Paul Rogers
Hi Hanu,

The problem with views as is, even with casts, is that the casting comes too 
late to resolve he issues I highlighted in earlier messages. Ted's cast 
push-down idea causes the conversion to happen at read time so that we can, 
say, cast a string to an int, or cast a null to the proper type.

Today, if we use a cast, such as SELECT cast(a AS INT) FROM myTable then we get 
a DAG that has tree parts (to keep things simple):

* Scan the data, using types inferred from the data itself
* In a Filter operator, convert the type of data to INT
* In Screen, return the result to the user

If the type is ambiguous in the file, then the first step above fails; data 
never gets far enough for the Filter to kick in and apply the cast. Also, if a 
file contains a run of nulls, the scanner will choose Nullable Int, then fail 
when it finds, say, a string.

The key point is that the cast push-down means that the query will not fail due 
to dicey files: the cast resolves the ambiguity. If we push the cast down, then 
it is the SCAN operator that resolves the conflict and does the cast; avoiding 
the failures we've been discussing.

I like the idea you seem to be proposing: cascading views. Have a table view 
that cleans up each table. Then, these can be combined in higher-order views 
for specialized purposes.

The beauty of the cast push-down idea is that no metadata is needed other than 
the query. If the user wants metadata, they use existing views (that contain 
the casts and cause the cast push-down.)

This seems like such a simple, elegant solution that we could try it out 
quickly (if we get past the planner issues Aman mentioned.) In fact, the new 
scan operator code (done as part of the batch sizing work) already has a 
prototype mechanism for type hints. If the type hint is provided to the 
scanner, it uses them, otherwise it infers the type. We'd just hook up the cast 
push down data to that prototype and we could try out the result quickly. (The 
new scan operator is still in my private branch, in case anyone goes looking 
for it...)

Some of your discussion talks about automatically inferring the schema. I 
really don't think we need to do that. The hint (cast push-down) is sufficient 
to resolve ambiguities in the existing scan-time schema inference.

The syntax trick would be to find a way to provide hints just for those columns 
that are issues. If I have a table with columns a, b, ... z, but only b is a 
problem, I don't want to have to do:

SELECT a, CAST(b AS INT), c, ... z FROM myTable

Would be great if we could just do:

SELECT *, CAST(b AS INT) FROM myTable

I realize the above has issues; the key idea is: provide casts only for the 
problem fields without spelling out all fields.

If we really want to get fancy, we can do UDF push down for the complex cases 
you mentioned. Maybe:

SELECT *, CAST(b AS INT), parseCode(c) ...

We are diving into design here; maybe you can file a JIRA and we can shift 
detailed design discussion to that JIRA. Salim already has one related to 
schema change errors, which was why the "Death" article caught my eye.

Thanks,
- Paul

 

On Friday, April 6, 2018, 4:59:40 PM PDT, Hanumath Rao Maduri 
 wrote:  
 
 Hello,

Thanks for Ted & Paul for clarifying my questions.
Sorry for not being clear in my previous post, When I said create view I
was under the impression for simple views where we use cast expressions
currently to cast them to types. In this case planner can use this
information to force the scans to use this as the schema.

If the query fails then it fails at the scan and not after inferring the
schema by the scanner.

I know that views can get complicated with joins and expressions. For
schema hinting through views I assume they should be created on single
tables with corresponding columns one wants to project from the table.


Regarding the same question, today we had a discussion with Aman. Here view
can be considered as a "view" of the table with schema in place.

We can change some syntax to suite it for specifying schema. something like
this.

create schema[optional] view(/virtual table ) v1 as (a: int, b : int)
select a, b from t1 with some other rules as to conversion of scalar to
complex types.

Then the queries when used on this view (below) should enable the scanner
to use this type information and then use it to convert the data into the
appropriate types.
select * from v1

For the possibility of schema information not being known by the user, may
be use something like this.

create schema[optional] view(/virtual table) v1 as select a, b from t1
infer schema.

This view when used to query the table should trigger the logic of
inferring and consolidating the schema and attaching that inferred schema
to the view. In future when we use the same view, we should be using the
inferred schema. This view either can be local view pertaining to the
session or a global view so that other queries across sessions can use them.


By default we can apply

Re: "Death of Schema-on-Read"

2018-04-06 Thread Paul Rogers
Hi Aman,

As we get into details, I suggested to Hanu that we move the discussion into a 
JIRA ticket.

 >On the subject of CAST pushdown to Scans, there are potential drawbacks

 >  - In general, the planner will see a Scan-Project where the Project has  
CAST functions.  But the Project can have arbitrary expressions,  e.g  CAST(a 
as INT) * 5

Suggestion: push the CAST(a AS INT) down to the scan, do the a * 5 in the 
Project operator.

>  or a combination of 2 CAST functions 

If the user does a two-stage cast, CAST(CAST(a AS INT) AS BIGINT), then one 
simple rule is to push only the innermost cast downwards.

> or non-CAST functions etc.

Just keep it in Project.

 >    It would be quite expensive to examine each expression (there could be 
hundreds) to determine whether it is eligible to be pushed to the Scan.

Just push CAST( AS ). Even that would be a huge win. Note, 
for CSV, it might have to be CAST(columns[2] AS INT), since "columns" is 
special for CSV.

>   - Expressing Nullability is not possible with CAST.  If a column should be 
> tagged as  (not)nullable, CAST syntax does not allow that.

Can we just add keywords: CAST(a AS INT NULL), CAST(b AS VARCHAR NOT NULL) ?

 >  - Drill currently supports CASTing to a SQL data type, but not to the 
complex types such as arrays and maps.  We would have to add support for that 
from a language perspective as well as the run-time.  This would be non-trivial 
effort.

The term "complex type" is always confusing. Consider a map. The rules would 
apply recursively to the members of the map. (Problem: today, if I reference a 
map member, Drill pulls it to the top level: SELECT m.a creates a new top-level 
field, it does not select "a" within "m". We need to fix that anyway.  So, 
CAST(m.a AS INT) should imply the type of column "a" within map "m".

For arrays, the problem is more complex. Perhaps more syntax: CAST(a[] AS INT) 
to force array elements to INT. Maybe use CAST(a[][] AS INT) for a repeated 
list (2D array).

Unions don't need a solution as they are their own solution (they can hold 
multiple types.) Same for (non-repeated) lists. 

To resolve runs of nulls, maybe allow CAST(m AS MAP). Or we can imply that "m" 
is a Map from the expression CAST(m.a AS INT). For arrays, the previously 
suggested CAST(a[] AS INT). If columns "a" or "m" turn out to be a non-null 
scalar, then we have no good answer.

CAST cannot solve the nasty cases of JSON in which some fields are complex, 
some scalar. E.g. {a: 10} {a: [20]} or {m: "foo"} {m: {value: "foo"}}. I 
suppose no solution is perfect...

I'm sure that, if someone gets a chance to desig this feature, they'll find 
lots more issues. Maybe cast push-down is only a partial solution. But, it 
seems to solve so many of the JSON and CSV cases that I've seen that it seems 
too good to pass up.

Thanks,


- Paul  

Re: "Death of Schema-on-Read"

2018-04-07 Thread Hanumath Rao Maduri
Hello All,

I have created a JIRA to track this approach.
https://issues.apache.org/jira/browse/DRILL-6312

Thanks,
-Hanu

On Fri, Apr 6, 2018 at 7:38 PM, Paul Rogers 
wrote:

> Hi Aman,
>
> As we get into details, I suggested to Hanu that we move the discussion
> into a JIRA ticket.
>
>  >On the subject of CAST pushdown to Scans, there are potential drawbacks
>
>  >  - In general, the planner will see a Scan-Project where the Project
> has  CAST functions.  But the Project can have arbitrary expressions,  e.g
> CAST(a as INT) * 5
>
> Suggestion: push the CAST(a AS INT) down to the scan, do the a * 5 in the
> Project operator.
>
> >  or a combination of 2 CAST functions
>
> If the user does a two-stage cast, CAST(CAST(a AS INT) AS BIGINT), then
> one simple rule is to push only the innermost cast downwards.
>
> > or non-CAST functions etc.
>
> Just keep it in Project.
>
>  >It would be quite expensive to examine each expression (there could
> be hundreds) to determine whether it is eligible to be pushed to the Scan.
>
> Just push CAST( AS ). Even that would be a huge win.
> Note, for CSV, it might have to be CAST(columns[2] AS INT), since "columns"
> is special for CSV.
>
> >   - Expressing Nullability is not possible with CAST.  If a column
> should be tagged as  (not)nullable, CAST syntax does not allow that.
>
> Can we just add keywords: CAST(a AS INT NULL), CAST(b AS VARCHAR NOT NULL)
> ?
>
>  >  - Drill currently supports CASTing to a SQL data type, but not to
> the complex types such as arrays and maps.  We would have to add support
> for that from a language perspective as well as the run-time.  This would
> be non-trivial effort.
>
> The term "complex type" is always confusing. Consider a map. The rules
> would apply recursively to the members of the map. (Problem: today, if I
> reference a map member, Drill pulls it to the top level: SELECT m.a creates
> a new top-level field, it does not select "a" within "m". We need to fix
> that anyway.  So, CAST(m.a AS INT) should imply the type of column "a"
> within map "m".
>
> For arrays, the problem is more complex. Perhaps more syntax: CAST(a[] AS
> INT) to force array elements to INT. Maybe use CAST(a[][] AS INT) for a
> repeated list (2D array).
>
> Unions don't need a solution as they are their own solution (they can hold
> multiple types.) Same for (non-repeated) lists.
>
> To resolve runs of nulls, maybe allow CAST(m AS MAP). Or we can imply that
> "m" is a Map from the expression CAST(m.a AS INT). For arrays, the
> previously suggested CAST(a[] AS INT). If columns "a" or "m" turn out to be
> a non-null scalar, then we have no good answer.
>
> CAST cannot solve the nasty cases of JSON in which some fields are
> complex, some scalar. E.g. {a: 10} {a: [20]} or {m: "foo"} {m: {value:
> "foo"}}. I suppose no solution is perfect...
>
> I'm sure that, if someone gets a chance to desig this feature, they'll
> find lots more issues. Maybe cast push-down is only a partial solution.
> But, it seems to solve so many of the JSON and CSV cases that I've seen
> that it seems too good to pass up.
>
> Thanks,
>
>
> - Paul


Re: "Death of Schema-on-Read"

2018-04-07 Thread Paul Rogers
Hi Hanu,
Thanks! After sleeping on the idea, I realized that it can be generalized for 
any kind of expression. But, I also realized that the cast mechanism, by 
itself, cannot be a complete solution. Details posted in the JIRA for anyone 
who is interested.
Thanks,
- Paul

 

On Saturday, April 7, 2018, 8:05:34 AM PDT, Hanumath Rao Maduri 
 wrote:  
 
 Hello All,

I have created a JIRA to track this approach.
https://issues.apache.org/jira/browse/DRILL-6312

Thanks,
-Hanu

  

Re: "Death of Schema-on-Read"

2018-04-08 Thread Ted Dunning
I have been thinking about this email and I still don't understand some of
the comments.

On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha  wrote:

> On the subject of CAST pushdown to Scans, there are potential drawbacks
> ...
>
>- In general, the planner will see a Scan-Project where the Project has
>CAST functions.  But the Project can have arbitrary expressions,  e.g
>CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
>functions etc.   It would be quite expensive to examine each expression
>(there could be hundreds) to determine whether it is eligible to be
> pushed
>to the Scan.
>

How is this different than filter and project pushdown? There could be
hundreds of those and it could be difficult for Calcite to find appropriate
pushdowns. But I have never heard of any problem.

The reasons that I think that cast pushdown would be much easier include:

- for a first approximation, no type inference would be needed.

- because of the first point, only the roots of arithmetic expressions
would need to be examined. If they have casts, then pushdown should be
tried. If not, don't do it.

- cast pushdown is always a win if supported so there is no large increase
in the complexity of the cost-based optimization search space.

- the traversal of all expressions is already required and already done in
order to find the set of columns that are being extracted. As such, cast
pushdown can be done in the same motions as project pushdown.


>- Expressing Nullability is not possible with CAST.   If a column should
>be tagged as  (not)nullable, CAST syntax does not allow that.
>

This may be true. But nullability crosses the cast cleanly. Thus, filter
expressions like [x is not NULL] can be used to constrain nullability and
there is no requirement that the two constraints (the cast and the
nullability) need not be near each other syntactically. Furthermore, if the
query does not specify nullability, then the scanner is free to do so.


>- Drill currently supports CASTing to a SQL data type, but not to the
>complex types such as arrays and maps.  We would have to add support for
>that from a language perspective as well as the run-time.  This would be
>non-trivial effort.
>

Well, there is a trivial subset of this effort in that casting a.b.c is
easy to express. Anything more complex is hard for current scanners to use
anyway.

So deferring most of the work on complex types is a fine solution. It isn't
like SQL has nice syntax for casting of anything.


Re: "Death of Schema-on-Read"

2018-04-08 Thread Paul Rogers
Hi Ted,

All good points. In the Jira ticket, I mentioned that using casts to infer type 
is a special case of a general problem: inferring column type information in a 
downward traversal of the parse tree, then refining that type information with 
a bottom-up traversal of the DAG at run time. Turns out this problem is similar 
to one that the dynamic language people are exploring to convert dynamic types 
to static types automatically. There is a paper reference in the ticket.

Our problem occurs, in general, with that bottom-up type inference. Because 
Drill is distributed, different branches of the tree may make different type 
inferences. (We discussed several examples.) We hope that higher nodes can 
reconcile conflicting decisions made by lower nodes, but we've noted cases 
where this is not possible. And, we've noted cases where the scanner itself 
ends up making conflicting decisions (e.g. a long run of nulls followed by 
non-null data that reveals the type.)

Further, schema is a property of the DATA not the QUERY. But, casts are a 
property of the query. So, relying exclusively on a query to provide type 
information must be an incomplete and redundant solution. That said, if type 
information is available only in the query, we should certainly use it to 
resolve column type ambiguity. Cast is simply the most obvious source of 
inference. So, better type inference is helpful, but neither necessary nor 
sufficient.

There is also the mismatch between SQL syntax and the desired simplicity of a 
hint. We want to provide a hint only for a troublesome column or two, but to 
mention one column in a SELECT clause, we must mention all of them; we can no 
longer use the wildcard (which, of course, shouldn't be used, but sure is handy 
during data exploration: the very situation where the type problem is most 
likely to occur.)

Here it might be useful to recall the subtitle of the article: "Data governance 
and the death of schema on read." The author's point is that, in a production 
system, schemas cannot be ad-hoc, loosely-goosey (my terms), but rather must be 
well-understood and agreed upon so that all parties agree on data structure and 
meaning. In Drill's world, it says that schema (or hints) are discovered during 
data exploration, the results captured in some form of metadata, and then used 
by a wide number of users sitting in Tableau getting their day-to-day work 
done. In short, schema (and metadata in general) are needed to move from 
exploration into production. I personally am receptive to this idea because of 
the many years spent building BI tools in which schema information provided a 
much more robust and simple user experience.

Thanks,

- Paul

 

On Sunday, April 8, 2018, 10:58:02 AM PDT, Ted Dunning 
 wrote:  
 
 I have been thinking about this email and I still don't understand some of
the comments.

On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha  wrote:

> On the subject of CAST pushdown to Scans, there are potential drawbacks
> ...
>
>    - In general, the planner will see a Scan-Project where the Project has
>    CAST functions.  But the Project can have arbitrary expressions,  e.g
>    CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
>    functions etc.  It would be quite expensive to examine each expression
>    (there could be hundreds) to determine whether it is eligible to be
> pushed
>    to the Scan.
>

How is this different than filter and project pushdown? There could be
hundreds of those and it could be difficult for Calcite to find appropriate
pushdowns. But I have never heard of any problem.

The reasons that I think that cast pushdown would be much easier include:

- for a first approximation, no type inference would be needed.

- because of the first point, only the roots of arithmetic expressions
would need to be examined. If they have casts, then pushdown should be
tried. If not, don't do it.

- cast pushdown is always a win if supported so there is no large increase
in the complexity of the cost-based optimization search space.

- the traversal of all expressions is already required and already done in
order to find the set of columns that are being extracted. As such, cast
pushdown can be done in the same motions as project pushdown.


>    - Expressing Nullability is not possible with CAST.  If a column should
>    be tagged as  (not)nullable, CAST syntax does not allow that.
>

This may be true. But nullability crosses the cast cleanly. Thus, filter
expressions like [x is not NULL] can be used to constrain nullability and
there is no requirement that the two constraints (the cast and the
nullability) need not be near each other syntactically. Furthermore, if the
query does not specify nullability, then the scanner is free to do so.


>    - Drill currently supports CASTing to a SQL data type, but not to the
>    complex types such as arrays and

Re: "Death of Schema-on-Read"

2018-04-08 Thread Aman Sinha
On Sun, Apr 8, 2018 at 10:57 AM, Ted Dunning  wrote:

> I have been thinking about this email and I still don't understand some of
> the comments.
>
> On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha  wrote:
>
> > On the subject of CAST pushdown to Scans, there are potential drawbacks
> > ...
> >
> >- In general, the planner will see a Scan-Project where the Project
> has
> >CAST functions.  But the Project can have arbitrary expressions,  e.g
> >CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
> >functions etc.   It would be quite expensive to examine each
> expression
> >(there could be hundreds) to determine whether it is eligible to be
> > pushed
> >to the Scan.
> >
>
> How is this different than filter and project pushdown? There could be
> hundreds of those and it could be difficult for Calcite to find appropriate
> pushdowns. But I have never heard of any problem.
>
> - the traversal of all expressions is already required and already done in
> order to find the set of columns that are being extracted. As such, cast
> pushdown can be done in the same motions as project pushdown.
>

It is true that the amount of work done by the planner would be about the
same as when
determining projection pushdowns into the scan.  In my mind I was
contrasting with the
pure DDL based approach with an explicitly specified schema (such as with
a  'CREATE EXTERNAL TABLE ...' or with per query hints as Paul mentioned).
However, in the absence of those, I agree that it would be a win to do the
'simple' CAST pushdowns, keeping in mind that the same column may be
referenced
in multiple ways:  e.g   CAST(a as varchar(10)),  CAST(a as varchar(20))
in the same query/view.  In such cases, we would want to either not do the
pushdown
or determine the highest common datatype and push that down.

All of this, though, does not preclude the real need for the 'source of
truth' of the schema for the cases where data has been already explored and
curated.
We do want to have a solution for that core issue.

-Aman


Re: "Death of Schema-on-Read"

2018-04-08 Thread Ted Dunning
Inline

On Sun, Apr 8, 2018 at 12:36 PM, Aman Sinha  wrote:

> On Sun, Apr 8, 2018 at 10:57 AM, Ted Dunning 
> wrote:
>
> > [cast pushdown is cheaper than you might think]
>
> It is true that the amount of work done by the planner would be about the
> same as when
> determining projection pushdowns into the scan.  In my mind I was
> contrasting with the
> pure DDL based approach with an explicitly specified schema (such as with
> a  'CREATE EXTERNAL TABLE ...' or with per query hints as Paul mentioned).
>

Actually, I think that these are equivalent in many ways. The only
important difference is when the constraints on type are expressed. With
DDL, it is expressed ahead of time.  With cast-pushdown, it is done at the
last moment in the query.  With DDL, we traditionally assume that the
reader has already taken the information into account before the query
starts, but nothing really says that it has to.

The information expressed in either case is essentially identical, however,
and the reader should take heed. We are just allowing late typing if we add
cast pushdown.


... keeping in mind that the same column may be
> referenced in multiple ways:  e.g   CAST(a as varchar(10)),  CAST(a as
> varchar(20))
> in the same query/view.  In such cases, we would want to either not do the
> pushdown or determine the highest common datatype and push that down.
>

I disagree. It is the job of the reader to return both variants if it
accepts the pushdown or return the raw data and not do any casting. This is
really quite similar to cases where the column is retrieved twice.

If the reader can do something clever by reading the data once with one
case and then modifying the column after reading, that's great, but we
can't really assume that it will.



>
> All of this, though, does not preclude the real need for the 'source of
> truth' of the schema for the cases where data has been already explored and
> curated. We do want to have a solution for that core issue.
>

Why is the answer different from a view with casts (that might be pushed
down)?


Re: "Death of Schema-on-Read"

2018-04-08 Thread Aman Sinha
Responding to the last comment since it is at the heart of the discussion:

On Sun, Apr 8, 2018 at 3:20 PM, Ted Dunning  wrote:

> Inline
>
> On Sun, Apr 8, 2018 at 12:36 PM, Aman Sinha  wrote:
>
> > On Sun, Apr 8, 2018 at 10:57 AM, Ted Dunning 
> > wrote:
> 

>
> > All of this, though, does not preclude the real need for the 'source of
> > truth' of the schema for the cases where data has been already explored
> and
> > curated. We do want to have a solution for that core issue.
> >
>
> Why is the answer different from a view with casts (that might be pushed
> down)?
>

 I agree that views with CASTs and pushdowns help a lot but they are
limited to column names and data types primarily.
I can think of a few reasons why they are not sufficient (not sure if these
were mentioned by Paul or not):

1. Constraints such as UNIQUEness (for primary keys),  partitioning keys,
collation property etc are not expressible in views.
These attributes are heavily used by the optimizer.

2.  We want to support collecting statistics via something like ANALYZE
TABLE command.  This would operate on the
  base table rather than views since we want to collect histograms and
number of distinct-values stats etc on the base data
  such that it can be shared among all queries, not necessarily those
against specific views.

3.   INSERT INTO TABLE may need to do schema checks and schema merging
before inserting the data.  These would not
   be possible with views.

-Aman