Re: Schema Evolution in Apache Spark

Georg Heiler Thu, 11 Jan 2018 22:20:07 -0800

Isn't this related to the data format used, i.e. parquet, Avro, ... which
already support changing schema?


Dongjoon Hyun <dongjoon.h...@gmail.com> schrieb am Fr., 12. Jan. 2018 um
02:30 Uhr:

> Hi, All.
>
> A data schema can evolve in several ways and Apache Spark 2.3 already
> supports the followings for file-based data sources like
> CSV/JSON/ORC/Parquet.
>
> 1. Add a column
> 2. Remove a column
> 3. Change a column position
> 4. Change a column type
>
> Can we guarantee users some schema evolution coverage on file-based data
> sources by adding schema evolution test suites explicitly? So far, there
> are some test cases.
>
> For simplicity, I have several assumptions on schema evolution.
>
> 1. A safe evolution without data loss.
>     - e.g. from small types to larger types like int-to-long, not vice
> versa.
> 2. Final schema is given by users (or Hive)
> 3. Simple Spark data types supported by Spark vectorized execution.
>
> I made a test case PR to receive your opinions for this.
>
> [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
> data sources
> - https://github.com/apache/spark/pull/20208
>
> Could you take a look and give some opinions?
>
> Bests,
> Dongjoon.
>

Re: Schema Evolution in Apache Spark

Reply via email to