Hi all,

While wring some parquet files by Spark, I found it actually only writes
the parquet files with writer version1.

This differs encoding types of the file.

Is this intendedly fixed for some reasons?


I changed codes and tested to write this as writer version2 and it looks
fine.

In more details,
I found it fixes the writer version in
org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport.scala

def setSchema(schema: StructType, configuration: Configuration): Unit = {
  schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
  configuration.set(SPARK_ROW_SCHEMA, schema.json)
  configuration.set(
    ParquetOutputFormat.WRITER_VERSION,
    ParquetProperties.WriterVersion.PARQUET_1_0.toString)
}

​

I changed this to this in order to keep the given configuration

def setSchema(schema: StructType, configuration: Configuration): Unit = {
  schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
  configuration.set(SPARK_ROW_SCHEMA, schema.json)
  configuration.set(
    ParquetOutputFormat.WRITER_VERSION,
    configuration.get(ParquetOutputFormat.WRITER_VERSION,
      ParquetProperties.WriterVersion.PARQUET_1_0.toString)
  )
}

​

and set the version to version2

sc.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION,
    ParquetProperties.WriterVersion.PARQUET_2_0.toString)

​

Reply via email to