Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
For compatibility reasons, we always write data out as nullable in
parquet.  Given that that bit is only an optimization that we don't
actually make much use of, I'm curious why you are worried that its
changing to true?

On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam  wrote:

> Hi Spark users and developers,
>
> I have a dataframe with the following schema (Spark 1.5.1):
>
> StructType(StructField(type,StringType,true),
> StructField(timestamp,LongType,false))
>
> After I save the dataframe in parquet and read it back, I get the
> following schema:
>
> StructType(StructField(timestamp,LongType,true),
> StructField(type,StringType,true))
>
> As you can see the schema does not match. The nullable field is set to
> true for timestamp upon reading the dataframe back. Is there a way to
> preserve the schema so that what we write to will be what we read back?
>
> Best Regards,
>
> Jerry
>


Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Xiao Li
Let me share my 2 cents.

First, this is not documented in the official document. Maybe we should do
it? http://spark.apache.org/docs/latest/sql-programming-guide.html

Second, nullability is a significant concept in the database people. It is
part of schema. Extra codes are needed for evaluating if a value is null
for all the nullable data types. Thus, it might cause a problem if you need
to use Spark to transfer the data between parquet and RDBMS. My suggestion
is to introduce another external parameter?

Thanks,

Xiao Li


2015-10-20 10:20 GMT-07:00 Michael Armbrust :

> For compatibility reasons, we always write data out as nullable in
> parquet.  Given that that bit is only an optimization that we don't
> actually make much use of, I'm curious why you are worried that its
> changing to true?
>
> On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam  wrote:
>
>> Hi Spark users and developers,
>>
>> I have a dataframe with the following schema (Spark 1.5.1):
>>
>> StructType(StructField(type,StringType,true),
>> StructField(timestamp,LongType,false))
>>
>> After I save the dataframe in parquet and read it back, I get the
>> following schema:
>>
>> StructType(StructField(timestamp,LongType,true),
>> StructField(type,StringType,true))
>>
>> As you can see the schema does not match. The nullable field is set to
>> true for timestamp upon reading the dataframe back. Is there a way to
>> preserve the schema so that what we write to will be what we read back?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>


Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Jerry Lam
Hi Spark users and developers,

I have a dataframe with the following schema (Spark 1.5.1):

StructType(StructField(type,StringType,true),
StructField(timestamp,LongType,false))

After I save the dataframe in parquet and read it back, I get the following
schema:

StructType(StructField(timestamp,LongType,true),
StructField(type,StringType,true))

As you can see the schema does not match. The nullable field is set to true
for timestamp upon reading the dataframe back. Is there a way to preserve
the schema so that what we write to will be what we read back?

Best Regards,

Jerry


Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
>
> First, this is not documented in the official document. Maybe we should do
> it? http://spark.apache.org/docs/latest/sql-programming-guide.html
>

Pull requests welcome.


> Second, nullability is a significant concept in the database people. It is
> part of schema. Extra codes are needed for evaluating if a value is null
> for all the nullable data types. Thus, it might cause a problem if you need
> to use Spark to transfer the data between parquet and RDBMS. My suggestion
> is to introduce another external parameter?
>

Sure, but a traditional RDBMS has the opportunity to do validation before
loading data in.  Thats not really an option when you are reading random
files from S3.  This is why Hive and many other systems in this space treat
all columns as nullable.

What would the semantics of this proposed external parameter be?


Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Xiao Li
Sure. Will try to do a pull request this week.

Schema evolution is always painful for database people. IMO, NULL is a bad
design in the original system R. It introduces a lot of problems during the
system migration and data integration.

Let me find a possible scenario: RDBMS is used as an ODS. Spark is used as
an external online data analysis engine. The results could be stored in
Parquet files and inserted back RDBMS every interval. In this case, we
could face a few options:

- Change the data types of columns in RDBMS tables to support the possible
nullable values and the logics of RDBMS applications that consume these
results must also support NULL. When the applications are third-party,
changing the applications become harder.

- As what you suggested, before loading the data from the Parquet files, we
need to add an extra step to do a possible data cleaning, value
transformation or exception reporting in case of finding NULL.

If having such an external parameter, when writing data schema to external
data store, Spark will do its best to keep the original schema without any
change (e.g., keep the initial definition of nullability). If some data
type/schema conversions are not avoidable, it will issue warnings or errors
to the users. Does that make sense?

Thanks,

Xiao Li






 In this case,



2015-10-20 12:38 GMT-07:00 Michael Armbrust :

> First, this is not documented in the official document. Maybe we should do
>> it? http://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>
> Pull requests welcome.
>
>
>> Second, nullability is a significant concept in the database people. It
>> is part of schema. Extra codes are needed for evaluating if a value is null
>> for all the nullable data types. Thus, it might cause a problem if you need
>> to use Spark to transfer the data between parquet and RDBMS. My suggestion
>> is to introduce another external parameter?
>>
>
> Sure, but a traditional RDBMS has the opportunity to do validation before
> loading data in.  Thats not really an option when you are reading random
> files from S3.  This is why Hive and many other systems in this space treat
> all columns as nullable.
>
> What would the semantics of this proposed external parameter be?
>


Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Richard Hillegas

As an academic aside, note that all datatypes are nullable according to the
SQL Standard. NOT NULL is modelled in the Standard as a constraint on data
values, not as a parallel universe of special data types. However, very few
databases implement NOT NULL via integrity constraints. Instead, almost all
relational database type systems model NOT NULL as an extra bit of metadata
alongside precision, scale, and length.

Thanks,
Rick Hillegas


Xiao Li <gatorsm...@gmail.com> wrote on 10/20/2015 01:17:43 PM:

> From: Xiao Li <gatorsm...@gmail.com>
> To: Michael Armbrust <mich...@databricks.com>
> Cc: Jerry Lam <chiling...@gmail.com>, "user@spark.apache.org"
> <user@spark.apache.org>
> Date: 10/20/2015 01:18 PM
> Subject: Re: Spark SQL: Preserving Dataframe Schema
>
> Sure. Will try to do a pull request this week.
>
> Schema evolution is always painful for database people. IMO, NULL is
> a bad design in the original system R. It introduces a lot of
> problems during the system migration and data integration.
>
> Let me find a possible scenario: RDBMS is used as an ODS. Spark is
> used as an external online data analysis engine. The results could
> be stored in Parquet files and inserted back RDBMS every interval.
> In this case, we could face a few options:
>
> - Change the data types of columns in RDBMS tables to support the
> possible nullable values and the logics of RDBMS applications that
> consume these results must also support NULL. When the applications
> are third-party, changing the applications become harder.
>
> - As what you suggested, before loading the data from the Parquet
> files, we need to add an extra step to do a possible data cleaning,
> value transformation or exception reporting in case of finding NULL.
>
> If having such an external parameter, when writing data schema to
> external data store, Spark will do its best to keep the original
> schema without any change (e.g., keep the initial definition of
> nullability). If some data type/schema conversions are not
> avoidable, it will issue warnings or errors to the users. Does that
> make sense?
>
> Thanks,
>
> Xiao Li
>
>  In this case,
>
> 2015-10-20 12:38 GMT-07:00 Michael Armbrust <mich...@databricks.com>:
> First, this is not documented in the official document. Maybe we
> should do it?
http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> Pull requests welcome.
>
> Second, nullability is a significant concept in the database people.
> It is part of schema. Extra codes are needed for evaluating if a
> value is null for all the nullable data types. Thus, it might cause
> a problem if you need to use Spark to transfer the data between
> parquet and RDBMS. My suggestion is to introduce another external
parameter?
>
> Sure, but a traditional RDBMS has the opportunity to do validation
> before loading data in.  Thats not really an option when you are
> reading random files from S3.  This is why Hive and many other
> systems in this space treat all columns as nullable.
>
> What would the semantics of this proposed external parameter be?