Hi all, While wring some parquet files by Spark, I found it actually only writes the parquet files with writer version1.
This differs encoding types of the file. Is this intendedly fixed for some reasons? I changed codes and tested to write this as writer version2 and it looks fine. In more details, I found it fixes the writer version in org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport.scala def setSchema(schema: StructType, configuration: Configuration): Unit = { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName) configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set( ParquetOutputFormat.WRITER_VERSION, ParquetProperties.WriterVersion.PARQUET_1_0.toString) } I changed this to this in order to keep the given configuration def setSchema(schema: StructType, configuration: Configuration): Unit = { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName) configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set( ParquetOutputFormat.WRITER_VERSION, configuration.get(ParquetOutputFormat.WRITER_VERSION, ParquetProperties.WriterVersion.PARQUET_1_0.toString) ) } and set the version to version2 sc.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION, ParquetProperties.WriterVersion.PARQUET_2_0.toString)