Fixed writer version as version1 for Parquet as wring a Parquet file.

Hyukjin Kwon Thu, 08 Oct 2015 23:05:06 -0700

Hi all,

While wring some parquet files by Spark, I found it actually only writes
the parquet files with writer version1.


This differs encoding types of the file.

Is this intendedly fixed for some reasons?


I changed codes and tested to write this as writer version2 and it looks
fine.

In more details,
I found it fixes the writer version in
org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport.scala

def setSchema(schema: StructType, configuration: Configuration): Unit = {
  schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
  configuration.set(SPARK_ROW_SCHEMA, schema.json)
  configuration.set(
    ParquetOutputFormat.WRITER_VERSION,
    ParquetProperties.WriterVersion.PARQUET_1_0.toString)
}



I changed this to this in order to keep the given configuration

def setSchema(schema: StructType, configuration: Configuration): Unit = {
  schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
  configuration.set(SPARK_ROW_SCHEMA, schema.json)
  configuration.set(
    ParquetOutputFormat.WRITER_VERSION,
    configuration.get(ParquetOutputFormat.WRITER_VERSION,
      ParquetProperties.WriterVersion.PARQUET_1_0.toString)
  )
}



and set the version to version2

sc.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION,
    ParquetProperties.WriterVersion.PARQUET_2_0.toString)

Fixed writer version as version1 for Parquet as wring a Parquet file.

Reply via email to