Re: "Death of Schema-on-Read"

Hanumath Rao Maduri Thu, 05 Apr 2018 22:23:46 -0700

Hello,

Thank you Paul for starting this discussion.
However, I was not clear on the latest point as to how providing hints and
creating a view(mechanism which already exists in DRILL) is different.
I do think that creating a view can be cumbersome (in terms of syntax).
Providing hints are ephemeral and hence it can be used for quick validation
of the schema for a query execution. But if the user absolutely knows the
schema, then I think creating a view and using it might be a better option.
Can you please share your thoughts on this.


Thank you Ted for your valuable suggestions, as regards to your comment on
"metastore is good but centralized is bad" can you please share your view
point on what all design issues it can cause. I know that it can be
bottleneck but just want to know about other issues.
Put in other terms if centralized metastore engineered in a good way to
avoid most of the bottleneck, then do you think it can be good to use for
metadata?

Thanks,
-Hanu

On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Great discussion. Really appreciate the insight from the Drill users!
>
> To Ted's points: the simplest possible solution is to allow a table
> function to express types. Just making stuff up:
>
> SELECT a FROM schema(myTable, (a: INT))
>
> Or, a SQL extension:
>
> SELECT a FROM myTable(a: INT)
>
> Or, really ugly, a session option:
>
> ALTER SESSION SET schema.myTable="a: INT"
>
> All these are ephemeral and not compatible with, say, Tableau.
>
> Building on Ted's suggestion of using the (distributed) file system we can
> toss out a few half-baked ideas. Maybe use a directory to represent a name
> space, with files representing tables. If I have "weblogs" as my directory,
> I might have a file called "jsonlog" to describe the (messy) format of my
> JSON-formatted log files. And "csvlog" to describe my CSV-format logs.
> Different directories represent different SQL databases (schemas),
> different files represent tables within the schema.
>
>
> The table files can store column hints. But, it could do more. Maybe
> define the partitioning scheme (by year, month, day, say) so that can be
> mapped to a column. Wouldn't it be be great if Drill could figure out the
> partitioning itself if we gave a date range?
>
> The file could also define the format plugin to use, and its options, to
> avoid the need to define this format separate from the data, and to reduce
> the need for table functions.
>
> Today, Drill matches files to format plugins using only extensions. The
> table file could provide a regex for those old-style files (such as real
> web logs) that don't use suffixes. Or, to differentiate between "sales.csv"
> and "returns.csv" in the same data directory.
>
>
> While we're at it, the file might as well contain a standard view to apply
> to the table to define computed columns, do data conversions and so on.
>
> If Drill does automatic scans (to detect schema, to gather stats), maybe
> store that alongside the table file: "csvlogs.drill" for the
> Drill-generated info.
>
>
> Voila! A nice schema definition with no formal metastore. Because the info
> is in files, it easy to version using git, etc. (especially if the
> directory can be mounted using NFS as a normal directory.) Atomic updates
> can be done via the rename trick (which, sadly, does not work on S3...)
>
>
> Or, maybe store all information in ZK in JSON as we do for plugin
> configurations. (Hard to version and modify though...)
>
>
> Lots of ways to skin this cat once we agree that hints are, in fact,
> useful additions to Drill's automatic schema detection.
>
>
> Thanks,
> - Paul
>
>
>
>     On Thursday, April 5, 2018, 3:22:07 PM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff <joel.pf...@gmail.com> wrote:
>
> > Hello,
> >
> > A lot of versioning problems arise when trying to share data through
> kafka
> > between multiple applications with different lifecycles and maintainers,
> > since by default, a single message in Kafka is just a blob.
> > One way to solve that is to agree on a single serialization format,
> > friendly with a record per record storage (like avro) and in order to not
> > have to serialize the schema in use for every message, just reference an
> > entry in the Avro Schema Registry (this flow is described here:
> > https://medium.com/@stephane.maarek/introduction-to-
> > schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
> > ).
> > On top of the schema registry, specific client libs allow to validate the
> > message structure prior to the injection in kafka.
> > So while comcast mentions the usage of an Avro Schema to describe its
> > feeds, it does not mention directly the usage of avro files (to describe
> > the schema).
> >
>
> This is all good except for the assumption of a single schema for all time.
> You can mutate schemas in Avro (or JSON) in a future-proof manner, but it
> is important to recognize the simple truth that the data in a stream will
> not necessarily be uniform (and is even unlikely to be uniform)
>
>
>
>
> >
> > .... But the usage of CSV/JSON still is problematic. I like the idea of
> > having
> > an optional way to describe the expected types somewhere (either in a
> > central meta-store, or in a structured file next to the dataset).
> >
>
> Central meta-stores are seriously bad problems and are the single biggest
> nightmare in trying to upgrade Hive users. Let's avoid that if possible.
>
> Writing meta-data next to the file is also problematic if it needs to be
> written by the processing doing a query (the directory may not be
> writable).
>
> Having a convention for redirecting the meta-data cache to a parallel
> directory might solve the problem of non-writable local locations.
>
> In the worst case that Drill can't have any place to persist what it has
> learned but wants to do a restart, there needs to be SOME place to cache
> meta-data or else restarts will get no further than the original failed
> query.
>
>

Re: "Death of Schema-on-Read"

Reply via email to