Let me share my 2 cents. First, this is not documented in the official document. Maybe we should do it? http://spark.apache.org/docs/latest/sql-programming-guide.html
Second, nullability is a significant concept in the database people. It is part of schema. Extra codes are needed for evaluating if a value is null for all the nullable data types. Thus, it might cause a problem if you need to use Spark to transfer the data between parquet and RDBMS. My suggestion is to introduce another external parameter? Thanks, Xiao Li 2015-10-20 10:20 GMT-07:00 Michael Armbrust <mich...@databricks.com>: > For compatibility reasons, we always write data out as nullable in > parquet. Given that that bit is only an optimization that we don't > actually make much use of, I'm curious why you are worried that its > changing to true? > > On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Spark users and developers, >> >> I have a dataframe with the following schema (Spark 1.5.1): >> >> StructType(StructField(type,StringType,true), >> StructField(timestamp,LongType,false)) >> >> After I save the dataframe in parquet and read it back, I get the >> following schema: >> >> StructType(StructField(timestamp,LongType,true), >> StructField(type,StringType,true)) >> >> As you can see the schema does not match. The nullable field is set to >> true for timestamp upon reading the dataframe back. Is there a way to >> preserve the schema so that what we write to will be what we read back? >> >> Best Regards, >> >> Jerry >> > >