Re: Avro - Schema is good - Schema validation is bad

Kamesh Thu, 17 Dec 2015 18:17:37 -0800

If there are any suggestion, can we take it in the JIRA. I feel, there is
already JIRA for this.
https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070



On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <[email protected]>
wrote:

> Hi,
>
> Directory pruning is great. It allows us, for example, to do efficient
> date-range queries even when our data is arranged in a day or week based
> directory structure.
>
> We would like to be able to run the same query for all this data even
> though the schema has changes slightly (new fields added) over time.
>
> For me there are two thing in this scenario that are unreasonable:
>
>    1. For Drill to have to get the schema for all possible files (union
>    based) to validate queries
>    - adding 100s of *irrelevant* files to the mix
>
>    2. For Drill to fail the query because a field is not found in the
>    sub-set (directory pruned sub-set)
>
> The current approach results in option 2 and the proposed solution results
> in option 1 (As I understand it)
>
> We would be perfectly happy with unknown fields resulting in null as there
> are many ways to deal with null values built into Drill.
>
> Hopefully this a) makes sense and b) is acceptable.
>
> Enforcing a strict schema for Avro could be an optional feature (IMO).
>
> Regards,
>   -Stefán
>
> On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <[email protected]>
> wrote:
>
> > I think the main problem your hitting is that we should do a union of all
> > files. In that case, as long as the field is in a single file, we're
> going
> > to let the field through.
> >
> > There is a balancing between early termination and flexibility that we
> must
> > provide. If someone types a field and it is guaranteed to not be in the
> > data, the thinking is we should fail the query early as that is probably
> a
> > mistake on the user's part.  If it could be a valid field, we proceed
> with
> > execution and null it out until we find something.  That is the goal
> > anyway. Clearly we have a bug here as we should never deny a possible or
> > known field.
> >
> > I think of fields in three categories: known, possible, impossible.
> > Impossible fields should fail to validate. Possible and known fields
> should
> > validate and execute.
> >
> > With regards to Ted's concern: I agree that applying a filter shouldn't
> > fail a query. That means we will either have to consider the complete
> union
> > Schema before pruning files or consider all fields as either known or
> > possible after pruning files.
> >
> > Stefan, if you haven't already, please open a bug that known fields are
> > failing to validate in Avro and we will fix shortly. Sorry about the bug.
> > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <[email protected]>
> > wrote:
> >
> > > Well, at least I'm not alone here.
> > >
> > > I think it must be time to set some ground rules for these things and
> > what
> > > it means to support evolving schema and what is needed to eliminate
> ETL.
> > >
> > > I trust that enforcing a strict schema "just because we think we can"
> > must
> > > go against the principles of such rules.
> > >
> > > We moved all our stuff to Avro to avoid various problems with type
> > handling
> > > (assuming Double on nulls etc.) and to be hit with this, after all that
> > > work, is like a slap in the face with two pilchards (more here:
> > > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> > >
> > > Regards,
> > >  -Stefán
> > >
> > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <[email protected]>
> > > wrote:
> > >
> > > > Sigh of relief is premature.  Nobody has committed to carrying this
> > > > interpretation forward.
> > > >
> > > >
> > > >
> > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > > [email protected]
> > > > >
> > > > wrote:
> > > >
> > > > > /me sighs of relief
> > > > >
> > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Actually, even without multiple storage types, this could be
> > > radically
> > > > > > confusing.
> > > > > >
> > > > > > If I have many avro files that are partitioned into directories,
> > then
> > > > > > queries that use the partitioning to limit the files that I see
> > could
> > > > > > include or exclude more recent files that have added a new field.
> > > > > >
> > > > > > That means that a query would succeed or fail according to which
> > date
> > > > > range
> > > > > > I use for the query.
> > > > > >
> > > > > > That seems pretty radically bad.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > This simply can not be the desired behavior!
> > > > > > >
> > > > > > > This prevents from using a field from a changing schema with
> dir0
> > > > > > > sub-selection (directory pruning) as the altered/full schema is
> > > never
> > > > > > part
> > > > > > > of the query and it subsequently fails.
> > > > > > >
> > > > > > > Drill should, IMOP, never have rules that are dependent on the
> > > > > underlying
> > > > > > > storage type. If the query runs with JSON and Parquet then it
> > > should
> > > > > work
> > > > > > > for Avro as well.
> > > > > > >
> > > > > > > I'm hoping this strict schema validation is all just a
> > > > > misunderstanding.
> > > > > > >
> > > > > > > Regards,
> > > > > > >  -Stefán
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > For Avro files, we first construct the schema, and this
> schema
> > is
> > > > > used
> > > > > > > for
> > > > > > > > validating queries. So, if there are any errors in the query
> > > (like
> > > > > the
> > > > > > > > invalid field references) it will fail fast. As of now, for
> > other
> > > > > file
> > > > > > > > formats, query validation (checking  for invalid field
> > reference)
> > > > > does
> > > > > > > not
> > > > > > > > happen, and at run time, it constructs the schema for them
> and
> > > > hence
> > > > > > > nulls
> > > > > > > > for invalid fields.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I'm getting the following error when querying Avro files:
> > > > > > > > >
> > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1,
> > > column
> > > > > 57:
> > > > > > > > > Column 'some_col' not found in any table
> > > > > > > > >
> > > > > > > > > It's true that the field is in none of the tables I'm
> > > targeting,
> > > > in
> > > > > > > that
> > > > > > > > > particular query, but that does not mean that it is in none
> > of
> > > > the
> > > > > > > > possible
> > > > > > > > > files I could be querying.
> > > > > > > > >
> > > > > > > > > We use Avro to get the benefits of the schema but I never
> > > > expected
> > > > > > > Drill
> > > > > > > > to
> > > > > > > > > enforce it this way.
> > > > > > > > >
> > > > > > > > > Why do unresolved  columns not return null?
> > > > > > > > >
> > > > > > > > > This makes no sense to me as I think a fundamental trade of
> > > > Drill,
> > > > > > when
> > > > > > > > > trying to eliminate ETL, is to return null for any missing
> > > > fields.
> > > > > > > > >
> > > > > > > > > Please advise.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >  -Stefán
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Kamesh.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Kamesh.

Re: Avro - Schema is good - Schema validation is bad

Reply via email to