Re: Spark SQL: Preserving Dataframe Schema

Richard Hillegas Tue, 20 Oct 2015 13:58:48 -0700

As an academic aside, note that all datatypes are nullable according to the
SQL Standard. NOT NULL is modelled in the Standard as a constraint on data
values, not as a parallel universe of special data types. However, very few
databases implement NOT NULL via integrity constraints. Instead, almost all
relational database type systems model NOT NULL as an extra bit of metadata
alongside precision, scale, and length.


Thanks,
Rick Hillegas


Xiao Li <gatorsm...@gmail.com> wrote on 10/20/2015 01:17:43 PM:

> From: Xiao Li <gatorsm...@gmail.com>
> To: Michael Armbrust <mich...@databricks.com>
> Cc: Jerry Lam <chiling...@gmail.com>, "user@spark.apache.org"
> <user@spark.apache.org>
> Date: 10/20/2015 01:18 PM
> Subject: Re: Spark SQL: Preserving Dataframe Schema
>
> Sure. Will try to do a pull request this week.
>
> Schema evolution is always painful for database people. IMO, NULL is
> a bad design in the original system R. It introduces a lot of
> problems during the system migration and data integration.
>
> Let me find a possible scenario: RDBMS is used as an ODS. Spark is
> used as an external online data analysis engine. The results could
> be stored in Parquet files and inserted back RDBMS every interval.
> In this case, we could face a few options:
>
> - Change the data types of columns in RDBMS tables to support the
> possible nullable values and the logics of RDBMS applications that
> consume these results must also support NULL. When the applications
> are third-party, changing the applications become harder.
>
> - As what you suggested, before loading the data from the Parquet
> files, we need to add an extra step to do a possible data cleaning,
> value transformation or exception reporting in case of finding NULL.
>
> If having such an external parameter, when writing data schema to
> external data store, Spark will do its best to keep the original
> schema without any change (e.g., keep the initial definition of
> nullability). If some data type/schema conversions are not
> avoidable, it will issue warnings or errors to the users. Does that
> make sense?
>
> Thanks,
>
> Xiao Li
>
>  In this case,
>
> 2015-10-20 12:38 GMT-07:00 Michael Armbrust <mich...@databricks.com>:
> First, this is not documented in the official document. Maybe we
> should do it?
http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> Pull requests welcome.
>
> Second, nullability is a significant concept in the database people.
> It is part of schema. Extra codes are needed for evaluating if a
> value is null for all the nullable data types. Thus, it might cause
> a problem if you need to use Spark to transfer the data between
> parquet and RDBMS. My suggestion is to introduce another external
parameter?
>
> Sure, but a traditional RDBMS has the opportunity to do validation
> before loading data in.  Thats not really an option when you are
> reading random files from S3.  This is why Hive and many other
> systems in this space treat all columns as nullable.
>
> What would the semantics of this proposed external parameter be?

Re: Spark SQL: Preserving Dataframe Schema

Reply via email to