It's an interesting idea, but there are major challenges with per row
schema.

1. Performance - query optimizer and execution use assumptions about schema
and data to generate optimized query plans. Having to re-reason about
schema for each row can substantially slow down the engine, but due to
optimization and due to the overhead of schema information associated with
each row.

2. Data model: per-row schema is fundamentally a different data model. The
current relational model has gone through 40 years of research and have
very well defined semantics. I don't think there are well defined semantics
of a per-row schema data model. For example, what is the semantics of an
UDF function that is operating on a data cell that has incompatible schema?
Should we also coerce or convert the data type? If yes, will that lead to
conflicting semantics with some other rules? We need to answer questions
like this in order to have a robust data model.





On Wed, Jan 28, 2015 at 11:26 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

> Hi Aniket,
>
> In general the schema of all rows in a single table must be same. This is
> a basic assumption made by Spark SQL. Schema union does make sense, and
> we're planning to support this for Parquet. But as you've mentioned, it
> doesn't help if types of different versions of a column differ from each
> other. Also, you need to reload the data source table after schema changes
> happen.
>
> Cheng
>
>
> On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:
>
>> I saw the talk on Spark data sources and looking at the interfaces, it
>> seems that the schema needs to be provided upfront. This works for many
>> data sources but I have a situation in which I would need to integrate a
>> system that supports schema evolutions by allowing users to change schema
>> without affecting existing rows. Basically, each row contains a schema
>> hint
>> (id and version) and this allows developers to evolve schema over time and
>> perform migration at will. Since the schema needs to be specified upfront
>> in the data source API, one possible way would be to build a union of all
>> schema versions and handle populating row values appropriately. This works
>> in case columns have been added or deleted in the schema but doesn't work
>> if types have changed. I was wondering if it is possible to change the API
>>   to provide schema for each row instead of expecting data source to
>> provide
>> schema upfront?
>>
>> Thanks,
>> Aniket
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to