Re: "Death of Schema-on-Read"

Joel Pfaff Thu, 05 Apr 2018 07:25:33 -0700

Hello,

A lot of versioning problems arise when trying to share data through kafka
between multiple applications with different lifecycles and maintainers,
since by default, a single message in Kafka is just a blob.
One way to solve that is to agree on a single serialization format,
friendly with a record per record storage (like avro) and in order to not
have to serialize the schema in use for every message, just reference an
entry in the Avro Schema Registry (this flow is described here:
https://medium.com/@stephane.maarek/introduction-to-schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
).
On top of the schema registry, specific client libs allow to validate the
message structure prior to the injection in kafka.
So while comcast mentions the usage of an Avro Schema to describe its
feeds, it does not mention directly the usage of avro files (to describe
the schema).


Coming back to Drill, I think it tries to make a nice effort to provide
similar features on top of loosely typed datasets. It could probably try to
do better in some cases (handling unknown types as `Unknown` is probably
better than `Nullable Int`), but its ability to dynamically merge data with
different (but still compatible) schemas is really nice.

When using untyped file formats (JSON, CSV), Drill does its best, and while
it is not perfect, but it is already pretty good.
When relying on types formats like Parquet /ORC / Avro, lot of problems are
solved because each file describes its columns (name / types), allowing
even for complex structures.
But the usage of CSV/JSON still is problematic. I like the idea of having
an optional way to describe the expected types somewhere (either in a
central meta-store, or in a structured file next to the dataset).
That would make the usage of CTAS much safer/easier (sometimes, we have to
use Spark to generate the Parquet files because of schema/type problems).

Independently from the meta-store, it is a bit annoying that Drill would
need to `discover` the columns and types at every scan through trial and
error, and cannot benefit from the previous queries.
Extending the `Analyze Table` command so that meta-data could be generated
from JSON/CSV file/folder could improve this situation without introducing
a costly/painful ETL process.

Regards, Joel


On Wed, Apr 4, 2018 at 10:35 PM, Jinfeng Ni <j...@apache.org> wrote:

> I feel it's probably premature to cal it "death of schema-on-read" just
> based on one application case. For one product I have been working on
> recently, one use case is for IOT related application where data is sent
> from a variety of small devices (sensors, camera, etc). It would be a hard
> requirement to pre-define schema upfront for each device, before write data
> into the system. Further, the value of data is likely to decrease
> significantly over time; data within hours/days is way more important than
> that of weeks/months ago. It's unimaginable to wait for weeks to run data
> clean/preparation job, before user could query such data. In other words,
> for application with requirements of  flexibility and time-sensitivity,
> 'schema-on-read' provides a huge benefit, compared with traditional
> ETL-then-query approach.
>
> Drill's schema-on-read is actually trying to solve a rather hard problem,
> in that we deal with not only relational type, but also nested type. In
> that sense, Drill is walking in an uncharted territory where not many
> others are doing similar things.  Dealing with undocumented/unstructured
> data is a big challenge. Although Drill's solution is not perfect, IMHO,
> it's still a big step towards such a problem.
>
> With that said, I agreed with points people raised earlier. In addition to
> "schema-on-read", Drill has to do a better to handle the traditional cases
> where schema is known beforehand, by introducing a meta-store /catalog, or
> by allowing users to declare schema upfront ( I probably will not call
> Drill "schema-forbidden"). The restart strategy seems to be also
> interesting to handle failure caused by missing schema / schema change.
>
>
>
>
> On Tue, Apr 3, 2018 at 10:01 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Well, the restart strategy still works for your examples. And you only
> pay
> > once. From them you look at the cached type information and used an upper
> > bound data type as you read the data. Since it works to read the values
> in
> > the right order, it is obviously possible to push down typing information
> > even into the json reader.
> >
> >
> >
> > On Tue, Apr 3, 2018, 21:42 Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
> >
> > > Subtle point. I can provide schema with Parquet, as you note.
> (Actually,
> > > for Parquet, Drill is schema-required: I can't not provide a schema due
> > to
> > > the nature of Parquet...)
> > >
> > > But, I can't provide a schema for JSON, CSV, etc. The point is, Drill
> > > forbids the user from providing a schema; only the file format itself
> can
> > > provide the schema (or not, in the case of JSON). This is the very
> heart
> > of
> > > the problem.
> > >
> > > The root cause of our schema change exception is that vectors are,
> > indeed,
> > > strongly typed. But, file columns are not. Here is my favorite:
> > >
> > > {x: 10} {x: 10.1}
> > >
> > > Blam! Query fails because the vector is chosen as BigInt, then we
> > discover
> > > it really should have been Float8. (If the answer is: go back and
> rebuild
> > > the vector with the new type, consider the case that 100K records
> > separate
> > > the two above so that the first batch is long gone by the time we see
> the
> > > offending record. If only I could tell Drill to use Float8 (or Decimal)
> > up
> > > front...
> > >
> > > Views won't help here because the failure occurs before a view can kick
> > > in. However, presumably, I could write a view to handle a different
> > classic
> > > case:
> > >
> > > myDir /
> > > |- File 1: {a: 10, b: "foo"}
> > > |- File 2: {a: 20}
> > >
> > > With query: SELECT a, b FROM myDir
> > >
> > > For File 2, Drill will guess that b is a Nullable Int, but it is really
> > > VarChar. I think I could write clever SQL that says:
> > >
> > > If b is of type Nullable Int, return NULL cast to nullable VarChar,
> else
> > > return b
> > >
> > > The irony is that I must to write procedural code to declare a static
> > > attribute of the data. Yet SQL is otherwise declarative: I state what I
> > > want, not how to implement it.
> > >
> > > Life would be so much easier if I could just say, "trust me, when you
> > read
> > > column b, it is a VarChar."
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >     On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning <
> > > ted.dunn...@gmail.com> wrote:
> > >
> > >  I don't see why you say that Drill is schema-forbidden.
> > >
> > > The Parquet reader, for instance, makes strong use of the implied
> schema
> > to
> > > facilitate reading of typed data.
> > >
> > > Likewise, the vectorized internal format is strongly typed and, as
> such,
> > > uses schema information.
> > >
> > > Views are another way to communicate schema information.
> > >
> > > It is true that you can't, say, view comments on fields from the
> command
> > > line. But I don't understand saying "schema-forbidden".
> > >
> > >
> > > On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers <par0...@yahoo.com.invalid
> >
> > > wrote:
> > >
> > > > Here is another way to think about it. Today, Drill is
> > > "schema-forbidden":
> > > > even if I know the schema, I can't communicate that to Drill; Drill
> > must
> > > > figure it out on its own, making the same mistakes every time on
> > > ambiguous
> > > > schemas.
> > > >
> > >
> >
>

Re: "Death of Schema-on-Read"

Reply via email to