Re: Avro - Schema is good - Schema validation is bad

Stefán Baxter Fri, 25 Dec 2015 08:07:06 -0800

Hi,

I appreciate being able to turn it off, thank you.


IMO scanning all the files in a directory structure sounds harmless until
you start using directory pruning, on large structures, when this becomes a
overhead.

I really think this is an essential part of an overdue discussion on the
"ground rules of Drill" and what is needed in an environment that
supports evolving
schema and aims to eliminate ETL.

I would suggest that all formats behave in the same way and that the
leniency allowed in Parquet, for example, is also allowed in Avro.

In other words, and to be a bit more direct, this should not be a format
specific thing and strict schema validation should never be done "just
because we can with XX format".

Happy holidays,
 -Stefan

On Fri, Dec 25, 2015 at 2:57 PM, Kamesh <kamesh.had...@gmail.com> wrote:

> For doing schema validation for Avro files, we are thinking of doing the
> following steps.Please provide your suggestions, before we implement this.
>
>    - Schema validation feature will be configurable and user can provide
>    whether he/she wants this feature or not by setting a configuration
>    property like PigAvroStorage
>    <https://cwiki.apache.org/confluence/display/PIG/AvroStorage>.
>    - If the schema validation flag is set, then we can consider the union
>    schema of all the files in a directory recursively.
>
>
> On Fri, Dec 18, 2015 at 9:17 AM, Stefán Baxter <ste...@activitystream.com>
> wrote:
>
> > Hi Kamesh,
> >
> > This is, strictly speaking, not the same issue even though they have in
> > common the fact that the Avro schema validation aspect.
> >
> > Regards,
> >  -Stefán
> >
> > On Fri, Dec 18, 2015 at 2:17 AM, Kamesh <kamesh.had...@gmail.com> wrote:
> >
> > > If there are any suggestion, can we take it in the JIRA. I feel, there
> is
> > > already JIRA for this.
> > >
> > >
> >
> https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070
> > >
> > >
> > > On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <
> > ste...@activitystream.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Directory pruning is great. It allows us, for example, to do
> efficient
> > > > date-range queries even when our data is arranged in a day or week
> > based
> > > > directory structure.
> > > >
> > > > We would like to be able to run the same query for all this data even
> > > > though the schema has changes slightly (new fields added) over time.
> > > >
> > > > For me there are two thing in this scenario that are unreasonable:
> > > >
> > > >    1. For Drill to have to get the schema for all possible files
> (union
> > > >    based) to validate queries
> > > >    - adding 100s of *irrelevant* files to the mix
> > > >
> > > >    2. For Drill to fail the query because a field is not found in the
> > > >    sub-set (directory pruned sub-set)
> > > >
> > > > The current approach results in option 2 and the proposed solution
> > > results
> > > > in option 1 (As I understand it)
> > > >
> > > > We would be perfectly happy with unknown fields resulting in null as
> > > there
> > > > are many ways to deal with null values built into Drill.
> > > >
> > > > Hopefully this a) makes sense and b) is acceptable.
> > > >
> > > > Enforcing a strict schema for Avro could be an optional feature
> (IMO).
> > > >
> > > > Regards,
> > > >   -Stefán
> > > >
> > > > On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <jacq...@dremio.com>
> > > > wrote:
> > > >
> > > > > I think the main problem your hitting is that we should do a union
> of
> > > all
> > > > > files. In that case, as long as the field is in a single file,
> we're
> > > > going
> > > > > to let the field through.
> > > > >
> > > > > There is a balancing between early termination and flexibility that
> > we
> > > > must
> > > > > provide. If someone types a field and it is guaranteed to not be in
> > the
> > > > > data, the thinking is we should fail the query early as that is
> > > probably
> > > > a
> > > > > mistake on the user's part.  If it could be a valid field, we
> proceed
> > > > with
> > > > > execution and null it out until we find something.  That is the
> goal
> > > > > anyway. Clearly we have a bug here as we should never deny a
> possible
> > > or
> > > > > known field.
> > > > >
> > > > > I think of fields in three categories: known, possible, impossible.
> > > > > Impossible fields should fail to validate. Possible and known
> fields
> > > > should
> > > > > validate and execute.
> > > > >
> > > > > With regards to Ted's concern: I agree that applying a filter
> > shouldn't
> > > > > fail a query. That means we will either have to consider the
> complete
> > > > union
> > > > > Schema before pruning files or consider all fields as either known
> or
> > > > > possible after pruning files.
> > > > >
> > > > > Stefan, if you haven't already, please open a bug that known fields
> > are
> > > > > failing to validate in Avro and we will fix shortly. Sorry about
> the
> > > bug.
> > > > > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <
> ste...@activitystream.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Well, at least I'm not alone here.
> > > > > >
> > > > > > I think it must be time to set some ground rules for these things
> > and
> > > > > what
> > > > > > it means to support evolving schema and what is needed to
> eliminate
> > > > ETL.
> > > > > >
> > > > > > I trust that enforcing a strict schema "just because we think we
> > can"
> > > > > must
> > > > > > go against the principles of such rules.
> > > > > >
> > > > > > We moved all our stuff to Avro to avoid various problems with
> type
> > > > > handling
> > > > > > (assuming Double on nulls etc.) and to be hit with this, after
> all
> > > that
> > > > > > work, is like a slap in the face with two pilchards (more here:
> > > > > > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefán
> > > > > >
> > > > > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <
> > ted.dunn...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Sigh of relief is premature.  Nobody has committed to carrying
> > this
> > > > > > > interpretation forward.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > > > > > ste...@activitystream.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > /me sighs of relief
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <
> > > > ted.dunn...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Actually, even without multiple storage types, this could
> be
> > > > > > radically
> > > > > > > > > confusing.
> > > > > > > > >
> > > > > > > > > If I have many avro files that are partitioned into
> > > directories,
> > > > > then
> > > > > > > > > queries that use the partitioning to limit the files that I
> > see
> > > > > could
> > > > > > > > > include or exclude more recent files that have added a new
> > > field.
> > > > > > > > >
> > > > > > > > > That means that a query would succeed or fail according to
> > > which
> > > > > date
> > > > > > > > range
> > > > > > > > > I use for the query.
> > > > > > > > >
> > > > > > > > > That seems pretty radically bad.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > > > > > ste...@activitystream.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > This simply can not be the desired behavior!
> > > > > > > > > >
> > > > > > > > > > This prevents from using a field from a changing schema
> > with
> > > > dir0
> > > > > > > > > > sub-selection (directory pruning) as the altered/full
> > schema
> > > is
> > > > > > never
> > > > > > > > > part
> > > > > > > > > > of the query and it subsequently fails.
> > > > > > > > > >
> > > > > > > > > > Drill should, IMOP, never have rules that are dependent
> on
> > > the
> > > > > > > > underlying
> > > > > > > > > > storage type. If the query runs with JSON and Parquet
> then
> > it
> > > > > > should
> > > > > > > > work
> > > > > > > > > > for Avro as well.
> > > > > > > > > >
> > > > > > > > > > I'm hoping this strict schema validation is all just a
> > > > > > > > misunderstanding.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >  -Stefán
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <
> > > > kamesh.had...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > For Avro files, we first construct the schema, and this
> > > > schema
> > > > > is
> > > > > > > > used
> > > > > > > > > > for
> > > > > > > > > > > validating queries. So, if there are any errors in the
> > > query
> > > > > > (like
> > > > > > > > the
> > > > > > > > > > > invalid field references) it will fail fast. As of now,
> > for
> > > > > other
> > > > > > > > file
> > > > > > > > > > > formats, query validation (checking  for invalid field
> > > > > reference)
> > > > > > > > does
> > > > > > > > > > not
> > > > > > > > > > > happen, and at run time, it constructs the schema for
> > them
> > > > and
> > > > > > > hence
> > > > > > > > > > nulls
> > > > > > > > > > > for invalid fields.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > > > > > ste...@activitystream.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > I'm getting the following error when querying Avro
> > files:
> > > > > > > > > > > >
> > > > > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to
> line
> > > 1,
> > > > > > column
> > > > > > > > 57:
> > > > > > > > > > > > Column 'some_col' not found in any table
> > > > > > > > > > > >
> > > > > > > > > > > > It's true that the field is in none of the tables I'm
> > > > > > targeting,
> > > > > > > in
> > > > > > > > > > that
> > > > > > > > > > > > particular query, but that does not mean that it is
> in
> > > none
> > > > > of
> > > > > > > the
> > > > > > > > > > > possible
> > > > > > > > > > > > files I could be querying.
> > > > > > > > > > > >
> > > > > > > > > > > > We use Avro to get the benefits of the schema but I
> > never
> > > > > > > expected
> > > > > > > > > > Drill
> > > > > > > > > > > to
> > > > > > > > > > > > enforce it this way.
> > > > > > > > > > > >
> > > > > > > > > > > > Why do unresolved  columns not return null?
> > > > > > > > > > > >
> > > > > > > > > > > > This makes no sense to me as I think a fundamental
> > trade
> > > of
> > > > > > > Drill,
> > > > > > > > > when
> > > > > > > > > > > > trying to eliminate ETL, is to return null for any
> > > missing
> > > > > > > fields.
> > > > > > > > > > > >
> > > > > > > > > > > > Please advise.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > >  -Stefán
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Kamesh.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Kamesh.
> > >
> >
>
>
>
> --
> Kamesh.
>

Re: Avro - Schema is good - Schema validation is bad

Reply via email to