Let me share my 2 cents.

First, this is not documented in the official document. Maybe we should do
it? http://spark.apache.org/docs/latest/sql-programming-guide.html

Second, nullability is a significant concept in the database people. It is
part of schema. Extra codes are needed for evaluating if a value is null
for all the nullable data types. Thus, it might cause a problem if you need
to use Spark to transfer the data between parquet and RDBMS. My suggestion
is to introduce another external parameter?

Thanks,

Xiao Li


2015-10-20 10:20 GMT-07:00 Michael Armbrust <mich...@databricks.com>:

> For compatibility reasons, we always write data out as nullable in
> parquet.  Given that that bit is only an optimization that we don't
> actually make much use of, I'm curious why you are worried that its
> changing to true?
>
> On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Spark users and developers,
>>
>> I have a dataframe with the following schema (Spark 1.5.1):
>>
>> StructType(StructField(type,StringType,true),
>> StructField(timestamp,LongType,false))
>>
>> After I save the dataframe in parquet and read it back, I get the
>> following schema:
>>
>> StructType(StructField(timestamp,LongType,true),
>> StructField(type,StringType,true))
>>
>> As you can see the schema does not match. The nullable field is set to
>> true for timestamp upon reading the dataframe back. Is there a way to
>> preserve the schema so that what we write to will be what we read back?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>

Reply via email to