Re: "Death of Schema-on-Read"

Ted Dunning Mon, 02 Apr 2018 13:39:45 -0700

On Mon, Apr 2, 2018 at 10:54 AM, Aman Sinha <amansi...@apache.org> wrote:


> ...
> Although, one may argue that XML died because of the weight of the extra
> structure added to it and people just gravitated towards JSON.
>

My argument would be that it died because it couldn't distinguish well
between an element and a list of elements of length 1.

JSON avoids that kind of problem.


> In that respect,  Avro provides a good middle ground.   A similar approach
> is taken by MapR-DB  JSON database which has data type information for the
> fields of a JSON document.
>

True that.

But another middle ground representation is a JSON with a side file
describing type information derived when the file was previously read.

That said, we still have to (a) deal with JSON data which is one of the
> most prevalent format in big data space and (b) still have to handle schema
> changes even with Avro-like formats.
>

This is a big deal.

To some degree, a lot of this can be handled by two simple mechanisms:

1) record what we learn when scanning a file.  That is, if a column is null
(or missing) until the final record when it is a float, remember that. This
allows subsequent queries to look further ahead when deciding what is
happening in a query.

2) allow queries to be restarted when it is discovered that type
assumptions are untenable. Currently, schema change is what we call this
situation where we can't really recover from mistaken assumptions that are
derived incrementally as we scan the data. If we had (1), then the
information obtained by the reading that we have done up to the point that
schema change was noted could be preserved. That means that we could
restart the query with the knowledge of the data types that might later
cause a schema change exception. In many cases, that would allow us to
avoid that exception entirely on the second pass through the data.

In most cases, restarts would not be necessary. I know this because schema
change exceptions are currently pretty rare and they would be even more
rare if we learned about file schemas from experience. Even when a new file
is seen for the first time, schema change wouldn't happen. As such, the
amortized cost of restarts would be very low. On the other hand, the
advantage of such a mechanism would be that more queries would succeed and
users would be happier.


> ...
> From Drill's perspective, we have in the past discussed the need for 2
> modes:
>  - A fixed schema mode which operates in a manner similar to the RDBMSs.
> This is needed not just to resolve ambiguities but also for performance.
> Why treat a column as nullable when data is non-nullable ?
>  - A variable schema mode which is what it does today...but this part needs
> to be enhanced to be *'declarative' such that ambiguities are removed.*   A
> user may choose not to create any declaration, in which case Drill would
> default to certain documented set of rules that do type conversions.
>

The restart suggestion above avoids the need for modes but also allows the
performance of the fixed schema mode in most cases.

Re: "Death of Schema-on-Read"

Reply via email to