Re: Drill schema handling [Was: Avro - Let's talk Avro again]

Stefán Baxter Sat, 19 Aug 2017 16:28:31 -0700

Hello and thank you.

Please also put this in the Avro context. That is quite reveling when
thinking about Evolving schema support in Drill, not just for Avro but in
general.


Avro supports evolving schema and some might say it's designed around
exactly that.

The current handling of Avro in Drill is more strict than Avro it self. Let
me explain.

A data-set in Avro that has evolving schema can not be queried by drill
unless all schema headers are examined. A column that was dropped will not
be excepted in queries unless the query includes files with that column in
its schema.
(This could easily happen with directory pruning etc.)

Support for evolving schema and the goal to eliminate the ETL need were
very much a part of Drill in the beginning and I have the hardest time
wrapping my head around a discussion focused on if it is needed over the
discussion, I believe is missing, on how it can best be provided.

Support for evolving schema has very little to do with Avro alone, it's a
key trade. A key trade Drill needs to decide to keep as a key trade or not.

Do you, for example, know that Drill can not sort a select query on a
column if one columns is suspected of a schema change even if that column
is not used in the sort and the query runs just fine without the sort (the
sort column is "clean").

Regards,
  -Stefán










On Sat, Aug 19, 2017 at 9:18 PM, Paul Rogers <prog...@mapr.com> wrote:

> Hi All,
>
> Let’s also talk about the other core issue that Stefan raised: schema
> handling.
>
> Drill is very powerful in its use of JSON as its internal data format.
> Drill effortlessly handles maps and arrays as well as the more traditional
> scalar types.
>
> This can cause confusion, however. We all know that JSON files have no
> schema. When used for its original purpose, a web server app and a browser
> app agree on the format of the JSON that they exchange. When used as a file
> format, the app that writes the files knows the meaning of the file.
>
> When Drill reads JSON files, it has none of that context: Drill reads each
> JSON file from first principles. As a result, Drill handles arbitrary
> schemas; and must handle schema changes as the read proceeds. (This is, in
> fact, the definition of “schema-on-read.”) For example, files from 2016,
> say, might have fields a, b and c. In 2017, the app that creates the files
> was revised to deprecate a, and add d and e. Similar reasoning applies to
> other file formats. Files don’t have a DBMS to keep the format consistent
> and upgrade formats when things change.
>
> Drill tries hard to handle such schema changes. There are many techniques,
> such as those that Stefan mentioned, that can handle point solutions. But,
> even conversion to a common type has limits. What if field a started as a
> single integer, but a later revision took a poorly-considered decision to
> make a into an array of integers? Converting to a string is not really that
> helpful. Instead, Drill should know to treat column a as an array even if
> it has 0 or 1 values.
>
> We’ve discussed other gotchas: Drill reads files in random order, so some
> queries may cause Drill to see, say, an number, then a string for a given
> column, while other queries may see these columns in the opposite order.
> Some operators can handle schema changes, but others cannot. If one filter
> includes both types, but another filter sees only one type, then the query
> can produce different results (because, say, the collation order for
> strings differs from that of integers.) And so on for many cases.
>
> To some degree, users can work around such issues with casts, conditional
> expressions and so on. But, every query needs exactly the same casts and
> logic, which just pushes the complexity onto the poor user.
>
> IMHO, we can never solve the schema change issue as a series of point
> solutions. Even the best point solutions ultimately require that Drill
> predict the future (what will happen a billion records into the file).
> Alternatively, that Drill take multiple passes over the data to first
> determine the common schema, then do the query. Neither are realistic.
>
> Let us also remember that Drill’s primary use case is to power JDBC and
> ODBC: interfaces that demand schema up front and that don’t (easily)
> tolerate schema change.
>
> So, here is where we can all do some serious brainstorming. How do we
> square this circle? How do we ensure a known, single schema for xDBC
> clients without producing the future?
>
> It would seem that the ultimate solution requires that Drill know, up
> front, how to handle the schema so that it does not run into surprises.
> That is, rather than predict the future, Drill is told the rules that unify
> the data that it will scan. To be a bit philosophical: Drill would have a
> theory of the data rather than just empirically observing each row as it
> shows up.
>
> Folks have made several suggestions for how this might work in practice:
>
> 1. Live with schema changes for data exploration. Use a tool that can
> handle them. (JDBC can do so: it allows a single query to return multiple
> result sets. Most tools, however, are not set up for this use case.) So,
> would we need specialized tools? How likely is that to happen?
>
> 2. Leverage the fact that Drill is not used only once per file. If we scan
> a file a second time, can we somehow use what we learned the first time?
> There seems no value when, on the 100th scan of a file, Drill says, “Gee,
> never seen this one before, let’s see what data it might contain…”
>
> 3. Leverage Drill’s existing mechanisms to gather a schema. Parquet
> provides a schema and Drill’s Parquet metadata cache can provide a schema.
> Once Drill provides statistics (“stats”), Drill will scan files and, in so
> doing, will explore the schema of the file.
>
> 4. Do what other tools do and leverage the schema already defined in the
> Hive metastore. Hive then becomes the central place to define schemas
> across not just Drill, but other tools as well.
>
> 5. Leverage views as a way of fixing the schema. The view can define rules
> for handling missing columns and for cleaning up data types.
>
> 6. Add a schema definition layer to Drill in which users can spell out
> (for the example cited above), that in file X column a is to always be
> treated as an array, even if it is missing or has just one value. Or,
> provide names to unnamed or poorly named columns. Specify conversion rules
> from string to, say, Date format for specific columns. Etc.
>
> Each solution has its advantages and disadvantages. The real question is:
> what do Drill users want and need? Stefan indicates that incomplete schema
> handling is an issue for him. Is this true for other users as well? Which
> of the solutions above would help? Or, is there some other solution that
> would be even better? (Perhaps leveraging a third-party solution such as
> Looker, AtScale, etc.?)
>
> Thoughts?
>
> Thanks,
>
> - Paul
>
> > On Aug 18, 2017, at 3:55 PM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
> >
> > Thank you for this candid feedback, Stefan. The fact that you even
> decided
> > to write an email offering this feedback despite moving away from Drill
> > just suggests to me that you are still a supporter. We need all the help
> > that we can get from every member in this community to make Drill provide
> > value to all users that include you.
> >
> > I am new to the community but I have looked at your emails where your
> past
> > attempts to doing this have not taken you anywhere. We have to change
> that.
> >
> > We cannot undo the past as far as addressing your needs are concerned
> but I
> > want to assure you that we are bringing reform to the community in
> general.
> > The stakeholders who are impacted by Drill have increased beyond the
> small
> > group that existed a couple of years ago. So be rest assured that you
> have
> > a voice here.
> >
> > I think the biggest challenge we have in the community is that there are
> > users who could get a lot of value if some work was done to support
> > integrations. I know for sure that there are many developers who would
> love
> > to participate in this community and do the work for a modest fee. It
> helps
> > them get interested in the project, helps them provide support beyond
> just
> > the open source aspect and also helps users such as you to get the value
> > that you need where you need it.
> >
> > Please let me know if you would be willing to pursue that route.
> >
> > On the Avro front, I do hear a lot of users asking for it but I hear a
> lot
> > more requests on Parquet. Plus, there are core issues in Drill that needs
> > to be addressed first. The community is definitely trying to prioritize
> > given what we have. But we do not have to feel constrained. We can get
> more
> > developers to participate in this and help out. And I am very positive
> > about that approach-I know that I helped a user here to get help on using
> > Apache Drill inside a commercial setting where there asks were very
> > specific.
> >
> > Those are my thoughts but please do not give up on us. Your critical
> > feedback may not sound nice to the ears but is exactly the kind of
> feedback
> > that will make this project truly successful.
> >
> > Best,
> > Saurabh
> >
> >
> >
> > I
> >
> > On Fri, Aug 18, 2017 at 1:42 PM, Stefán Baxter <
> ste...@activitystream.com>
> > wrote:
> >
> >> Hi John,
> >>
> >> Love Drill but we no longer use it in production as our main query tool.
> >>
> >> I do have a fairly long list of pet peeves but I also have a long list
> of
> >> features that I love and would not want to be without.
> >>
> >> In my opinion it's time for Drill to decide where its commitment lies
> >> regarding evolving schema and ETL elimination and if it wants to be
> >> something more than a cogil in a Hadoop distribution wheel or an effort
> >> some see as a way to their startup stardom.
> >>
> >> There is no denying the great effect it has had and its usefulness
> (Arrow
> >> also making waves now). I am, as I have been, just frustrated by
> >> shortcomings I feel are not addressed because they are addressed else
> where
> >> (where the true loyalties lie)
> >>
> >> I can name a few (I have not upgraded to 1.11):
> >>   - Empty values still default to double for partial/segment lists which
> >> triggers all sorts of problems  (no attempt is made to convert values to
> >> lowest common denominator (string))
> >>   - Two NullableX values both containing nothing (Null) still produce
> >> schema change errors instead of waiting for a type to become apparent
> >>   - Syntax error reporting is terrible
> >>   - Schema change reporting is almost absent
> >>   - Avro schema is fixed/strict even though text formats support
> >> evolving/variable schema (With all sorts of side effects)
> >>   - Avro still does not support dirN
> >>
> >> and so many more things (not to mention the politics and the defensive
> >> attitude when trying to address shortcomings).
> >>
> >> My only regret here is that I never had proper resources to contribute a
> >> fix to some of these.
> >>
> >> All the best,
> >> -Stefán
> >>
> >> On Thu, Aug 17, 2017 at 2:20 PM, Charles Givre <cgi...@gmail.com>
> wrote:
> >>
> >>> I’m not an Avro user, but I’d definitely vote for improving this.
> >>> — C
> >>>
> >>>> On Aug 17, 2017, at 10:17, John Omernik <j...@omernik.com> wrote:
> >>>>
> >>>> I was guessing you would chime in with a response ;)
> >>>>
> >>>> Are you still using Drill w/ Avro how has things been lately?
> >>>>
> >>>> On Thu, Aug 17, 2017 at 8:00 AM, Stefán Baxter <
> >>> ste...@activitystream.com>
> >>>> wrote:
> >>>>
> >>>>> woha!!!
> >>>>>
> >>>>>
> >>>>> (sorry, I just had to)
> >>>>>
> >>>>>
> >>>>> Best of luck with that!
> >>>>>
> >>>>> Regards,
> >>>>> -Stefán
> >>>>>
> >>>>> On Thu, Aug 17, 2017 at 12:37 PM, John Omernik <j...@omernik.com>
> >>> wrote:
> >>>>>
> >>>>>> I know Avro is the unwanted child of the Drill world. (I know others
> >>> have
> >>>>>> tried to mature the Avro support and that has been something that
> >> still
> >>>>> is
> >>>>>> in a "experiemental" state.
> >>>>>>
> >>>>>> That said, isn't it time for us to clean it up?
> >>>>>>
> >>>>>> I am sure I there are some open JIRAs out there, (last Doc update on
> >>> the
> >>>>>> Avro Page, Nov 21, 2016) points to this
> >>>>>> https://issues.apache.org/jira/browse/DRILL/component/
> >>>>>> 12328941/?selectedTab=com.atlassian.jira.jira-projects-
> >>>>>> plugin:component-summary-panel
> >>>>>>
> >>>>>> And I just ran into a issue... I am going to run it by here to see
> if
> >>>>> it's
> >>>>>> JIRA worthy or known:
> >>>>>>
> >>>>>> I have two directories, one json (brodns) and one avro (brodnsavro)
> >>>>>>
> >>>>>> The both have subdirectories that are YYYY-MM-DD dates.
> >>>>>>
> >>>>>> Where I run
> >>>>>>
> >>>>>> select dir0, count(*) from `brodns` group by dir0  - This works
> >> great!
> >>>>>>
> >>>>>> when I run
> >>>>>>
> >>>>>> select dir0, count(*) from `brodnsavro` group by dir0 - I get:
> >>>>>>
> >>>>>> VALIDATION ERROR: From line 1, column 58 to line 1, column 61:
> Column
> >>>>>> 'dir0' not found in any table
> >>>>>>
> >>>>>>
> >>>>>> If I run
> >>>>>>
> >>>>>>
> >>>>>> select count(*) from `brodnsavro/2017-08-17` this works
> >>>>>>
> >>>>>> if I run
> >>>>>>
> >>>>>>
> >>>>>> select count(*) from `brodnsavro` this also works
> >>>>>>
> >>>>>>
> >>>>>> But dir0 doesn't appear to be applied to Avro.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I really feel this should be consistent (in addition to fixing the
> >>>>>> other issues in Avro) and lets make Avro o a
> >>>>>>
> >>>>>> first class citizen of the Drill world.
> >>>>>>
> >>>>>>
> >>>>>> (If folks are interested, I'd be happy to discuss my use case, it
> >>>>> involves
> >>>>>>
> >>>>>> applying a schema to json records on kafka/maprstreams in
> streamsets,
> >>> and
> >>>>>> then
> >>>>>>
> >>>>>> outputting to avro files... from there I hope to convert to parquet,
> >>> but
> >>>>>>
> >>>>>> don't want to use mapreduce, hence drill!
> >>>>>>
> >>>>>> )
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Drill schema handling [Was: Avro - Let's talk Avro again]

Reply via email to