[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308000#comment-17308000 ] Xinli Shang commented on SPARK-26345: - Yes, it needs some synchronization. I have the modified version implementation in Presto. You can check it [here|https://github.com/shangxinli/presto/commit/f6327a161eb6cfd5137f679620e095d8257816b8#diff-bb24b92e28343804ebaf540efe6c1cda0b5e2524e6811f8fe2daee5944dad386R203]. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > Parquet 1.11 supports column indexing. Spark can supports this feature for > better read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 > > Benchmark result: > [https://github.com/apache/spark/pull/31393#issuecomment-769767724] > This feature is enabled by default, and users can disable it by setting > {{parquet.filter.columnindex.enabled}} to false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248241#comment-17248241 ] Xinli Shang commented on SPARK-26345: - The Presto and Iceberg effort are not tied to each other. It is just some common code I can reuse. The PR in Iceberg is https://github.com/apache/iceberg/pull/1566 and the Issue for Presto is https://github.com/prestodb/presto/issues/15454 (PR is under development now). > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248231#comment-17248231 ] Xinli Shang commented on SPARK-26345: - For the performance, there is an Eng Blog I found online written by Zoltán Borók-Nagy& Gábor Szádovszky. Here is the link https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/. Once Spark is on Parquet 1.11.x, we can work on the Column Index for Spark Vectorized reader. Currently, I am working on integrating Column Index to Iceberg and Presto. The local testing on Iceberg also seems promising. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200307#comment-17200307 ] Xinli Shang commented on SPARK-27733: - We talked about the Parquet 1.11.0 adoption in Spark in today's Parquet community sync meeting. The Parquet community would like to help if there is any way to move faster. [~csun][~smilegator][~dongjoon][~iemejia] and others, are you interested in joining our next Parquet meeting to brainstorm solutions to move forward? > Upgrade to Avro 1.10.0 > -- > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.1.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.2 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranamer, no shaded guava, security > updates, so probably a worth upgrade. > Avro 1.10.0 was released and this is still not done. > There is at the moment (2020/08) still a blocker because of Hive related > transitive dependencies bringing older versions of Avro, so we could say that > this is somehow still blocked until HIVE-21737 is solved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162063#comment-17162063 ] Xinli Shang commented on SPARK-26345: - [~yumwang][~FelixKJose], you can assign this JIra to me. When I have time, I can start working on it. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26346) Upgrade parquet to 1.11.0
[ https://issues.apache.org/jira/browse/SPARK-26346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781953#comment-16781953 ] Xinli Shang commented on SPARK-26346: - +1, [~yumwang] , Is there any pre-testing done on RC4 or RC3? I am doing the similar work and we can split the work if you like. > Upgrade parquet to 1.11.0 > - > > Key: SPARK-26346 > URL: https://issues.apache.org/jira/browse/SPARK-26346 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25858) Passing Field Metadata to Parquet
[ https://issues.apache.org/jira/browse/SPARK-25858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang resolved SPARK-25858. - Resolution: Later It is a little early to open this issue. I will re-open it after the dependency issues are designed. > Passing Field Metadata to Parquet > - > > Key: SPARK-25858 > URL: https://issues.apache.org/jira/browse/SPARK-25858 > Project: Spark > Issue Type: New Feature > Components: Input/Output >Affects Versions: 2.3.2 >Reporter: Xinli Shang >Priority: Major > > h1. Problem Statement > The Spark WriteSupport class for Parquet is hardcoded to use > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which > is not configurable. Currently, this class doesn’t carry over the field > metadata in StructType to MessageType. However, Parquet column encryption > (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType > of Parquet, so that the metadata can be used to control column encryption. > h1. Technical Solution > # Extend SparkToParquetSchemaConverter class and override convert() method > to add the functionality of carrying over the field metadata > # Extend ParquetWriteSupport and use the extended converter in #1. The > extension avoids changing the built-in WriteSupport to mitigate the risk. > # Change Spark code to make the WriteSupport class configurable to let the > user configure to use the extended WriteSupport in #2. The default > WriteSupport is still > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. > h1. Technical Details > {{Note: The code below kind of in messy format. The link below shows correct > format. }} > h2. Extend SparkToParquetSchemaConverter class > *SparkToParquetMetadataSchemaConverter* extends > SparkToParquetSchemaConverter { > > *override* def convert(catalystSchema: StructType): MessageType = > { Types ._buildMessage_() > .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*) > .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_) > } > > private def *convertFieldWithMetadata*(field: StructField) : Type = > { val extField = new ExtType[Any](convertField(field)) > val metaBuilder = new MetadataBuilder().withMetadata(field.metadata) > val metaData = metaBuilder.getMap > extField.setMetadata(metaData) return extField } > } > h2. Extend ParquetWriteSupport > class CryptoParquetWriteSupport extends ParquetWriteSupport { > *override* def init(configuration: Configuration): WriteContext = > { val converter = new > *SparkToParquetMetadataSchemaConverter*(configuration) > createContext(configuration, converter) } > } > h2. Make WriteSupport configurable > class ParquetFileFormat{ > > ** override def prepareWrite(...) { > … > *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) > {* > ParquetOutputFormat._setWriteSupportClass_(job, > _classOf_[ParquetWriteSupport]) > ** > ... > } > } > h1. Verification > The > [ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java] > in the github repository > [parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions] > has a sample verification of passing down the field metadata and perform > column encryption. > h1. Dependency > * Parquet-1178 > * Parquet-1396 > * Parquet-1397 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25858) Passing Field Metadata to Parquet
[ https://issues.apache.org/jira/browse/SPARK-25858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated SPARK-25858: Description: h1. Problem Statement The Spark WriteSupport class for Parquet is hardcoded to use org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which is not configurable. Currently, this class doesn’t carry over the field metadata in StructType to MessageType. However, Parquet column encryption (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of Parquet, so that the metadata can be used to control column encryption. h1. Technical Solution # Extend SparkToParquetSchemaConverter class and override convert() method to add the functionality of carrying over the field metadata # Extend ParquetWriteSupport and use the extended converter in #1. The extension avoids changing the built-in WriteSupport to mitigate the risk. # Change Spark code to make the WriteSupport class configurable to let the user configure to use the extended WriteSupport in #2. The default WriteSupport is still org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. h1. Technical Details {{Note: The code below kind of in messy format. The link below shows correct format. }} h2. Extend SparkToParquetSchemaConverter class *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter { *override* def convert(catalystSchema: StructType): MessageType = { Types ._buildMessage_() .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*) .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_) } private def *convertFieldWithMetadata*(field: StructField) : Type = { val extField = new ExtType[Any](convertField(field)) val metaBuilder = new MetadataBuilder().withMetadata(field.metadata) val metaData = metaBuilder.getMap extField.setMetadata(metaData) return extField } } h2. Extend ParquetWriteSupport class CryptoParquetWriteSupport extends ParquetWriteSupport { *override* def init(configuration: Configuration): WriteContext = { val converter = new *SparkToParquetMetadataSchemaConverter*(configuration) createContext(configuration, converter) } } h2. Make WriteSupport configurable class ParquetFileFormat{ ** override def prepareWrite(...) { … *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) {* ParquetOutputFormat._setWriteSupportClass_(job, _classOf_[ParquetWriteSupport]) ** ... } } h1. Verification The [ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java] in the github repository [parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions] has a sample verification of passing down the field metadata and perform column encryption. h1. Dependency * Parquet-1178 * Parquet-1396 * Parquet-1397 was: h1. Problem Statement The Spark WriteSupport class for Parquet is hardcoded to use org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which is not configurable. Currently, this class doesn’t carry over the field metadata in StructType to MessageType. However, Parquet column encryption (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of Parquet, so that the metadata can be used to control column encryption. h1. Technical Solution # Extend SparkToParquetSchemaConverter class and override convert() method to add the functionality of carrying over the field metadata # Extend ParquetWriteSupport and use the extended converter in #1. The extension avoids changing the built-in WriteSupport to mitigate the risk. # Change Spark code to make the WriteSupport class configurable to let the user configure to use the extended WriteSupport in #2. The default WriteSupport is still org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. h1. Technical Details h2. Extend SparkToParquetSchemaConverter class *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter { *override* def convert(catalystSchema: StructType): MessageType = { Types ._buildMessage_() .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*) .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_) } private def *convertFieldWithMetadata*(field: StructField) : Type = { val extField = new ExtType[Any](convertField(field)) val metaBuilder = new MetadataBuilder().withMetadata(field.metadata) val meta
[jira] [Updated] (SPARK-25858) Passing Field Metadata to Parquet
[ https://issues.apache.org/jira/browse/SPARK-25858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated SPARK-25858: Description: h1. Problem Statement The Spark WriteSupport class for Parquet is hardcoded to use org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which is not configurable. Currently, this class doesn’t carry over the field metadata in StructType to MessageType. However, Parquet column encryption (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of Parquet, so that the metadata can be used to control column encryption. h1. Technical Solution # Extend SparkToParquetSchemaConverter class and override convert() method to add the functionality of carrying over the field metadata # Extend ParquetWriteSupport and use the extended converter in #1. The extension avoids changing the built-in WriteSupport to mitigate the risk. # Change Spark code to make the WriteSupport class configurable to let the user configure to use the extended WriteSupport in #2. The default WriteSupport is still org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. h1. Technical Details h2. Extend SparkToParquetSchemaConverter class *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter { *override* def convert(catalystSchema: StructType): MessageType = { Types ._buildMessage_() .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*) .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_) } private def *convertFieldWithMetadata*(field: StructField) : Type = { val extField = new ExtType[Any](convertField(field)) val metaBuilder = new MetadataBuilder().withMetadata(field.metadata) val metaData = metaBuilder.getMap extField.setMetadata(metaData) return extField } } h2. Extend ParquetWriteSupport class CryptoParquetWriteSupport extends ParquetWriteSupport { *override* def init(configuration: Configuration): WriteContext = { val converter = new *SparkToParquetMetadataSchemaConverter*(configuration) createContext(configuration, converter) } } h2. Make WriteSupport configurable class ParquetFileFormat{ ** override def prepareWrite(...) { … *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) {* ParquetOutputFormat._setWriteSupportClass_(job, _classOf_[ParquetWriteSupport]) ** ... } } h1. Verification The [ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java] in the github repository [parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions] has a sample verification of passing down the field metadata and perform column encryption. h1. Dependency * Parquet-1178 * Parquet-1396 * Parquet-1397 was: h1. Problem Statement The Spark WriteSupport class for Parquet is hardcoded to use org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which is not configurable. Currently, this class doesn’t carry over the field metadata in StructType to MessageType. However, Parquet column encryption (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of Parquet, so that the metadata can be used to control column encryption. h1. Technical Solution # Extend SparkToParquetSchemaConverter class and override convert() method to add the functionality of carrying over the field metadata # Extend ParquetWriteSupport and use the extended converter in #1. The extension avoids changing the built-in WriteSupport to mitigate the risk. # Change Spark code to make the WriteSupport class configurable to let the user configure to use the extended WriteSupport in #2. The default WriteSupport is still org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. h1. Technical Details h2. Extend SparkToParquetSchemaConverter class *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter { *override* def convert(catalystSchema: StructType): MessageType = { Types ._buildMessage_() .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*) .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_) } private def *convertFieldWithMetadata*(field: StructField) : Type = { val extField = new ExtType[Any](convertField(field)) val metaBuilder = new MetadataBuilder().withMetadata(field.metadata) val metaData = metaBuilder.getMap extField.setMetadata(metaData) return extField
[jira] [Updated] (SPARK-25858) Passing Field Metadata to Parquet
[ https://issues.apache.org/jira/browse/SPARK-25858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated SPARK-25858: Description: h1. Problem Statement The Spark WriteSupport class for Parquet is hardcoded to use org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which is not configurable. Currently, this class doesn’t carry over the field metadata in StructType to MessageType. However, Parquet column encryption (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of Parquet, so that the metadata can be used to control column encryption. h1. Technical Solution # Extend SparkToParquetSchemaConverter class and override convert() method to add the functionality of carrying over the field metadata # Extend ParquetWriteSupport and use the extended converter in #1. The extension avoids changing the built-in WriteSupport to mitigate the risk. # Change Spark code to make the WriteSupport class configurable to let the user configure to use the extended WriteSupport in #2. The default WriteSupport is still org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. h1. Technical Details h2. Extend SparkToParquetSchemaConverter class *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter { *override* def convert(catalystSchema: StructType): MessageType = { Types ._buildMessage_() .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*) .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_) } private def *convertFieldWithMetadata*(field: StructField) : Type = { val extField = new ExtType[Any](convertField(field)) val metaBuilder = new MetadataBuilder().withMetadata(field.metadata) val metaData = metaBuilder.getMap extField.setMetadata(metaData) return extField } } h2. Extend ParquetWriteSupport class CryptoParquetWriteSupport extends ParquetWriteSupport { *override* def init(configuration: Configuration): WriteContext = { val converter = new *SparkToParquetMetadataSchemaConverter*(configuration) createContext(configuration, converter) } } h2. Make WriteSupport configurable class ParquetFileFormat{ ** override def prepareWrite(...) { … *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) {* ParquetOutputFormat._setWriteSupportClass_(job, _classOf_[ParquetWriteSupport]) ** ... } } h1. Verification The [ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java] in the github repository [parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions] has a sample verification of passing down the field metadata and perform column encryption. h1. Dependency * Parquet-1178 * Parquet-1396 * Parquet-1397 was: h1. Problem Statement The Spark WriteSupport class for Parquet is hardcoded to use org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which is not configurable. Currently, this class doesn’t carry over the field metadata in StructType to MessageType. However, Parquet column encryption (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of Parquet, so that the metadata can be used to control column encryption. h1. Technical Solution # Extend SparkToParquetSchemaConverter class and override convert() method to add the functionality of carrying over the field metadata # Extend ParquetWriteSupport and use the extended converter in #1. The extension avoids changing the built-in WriteSupport to mitigate the risk. # Change Spark code to make the WriteSupport class configurable to let the user configure to use the extended WriteSupport in #2. The default WriteSupport is still org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. h1. Technical Details h2. Extend SparkToParquetSchemaConverter class *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter { *override* def convert(catalystSchema: StructType): MessageType = { Types ._buildMessage_() .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*) .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_) } private def *convertFieldWithMetadata*(field: StructField) : Type = { val extField = new ExtType[Any](convertField(field)) val metaBuilder = new MetadataBuilder().withMetadata(field.metadata) val metaData = metaBuilder.getMap extField.setMetadata(metaData) return extField } } h2. Extend ParquetWriteSupport class CryptoParquetWriteSupport extends ParquetWriteSupport { *override* de
[jira] [Created] (SPARK-25858) Passing Field Metadata to Parquet
Xinli Shang created SPARK-25858: --- Summary: Passing Field Metadata to Parquet Key: SPARK-25858 URL: https://issues.apache.org/jira/browse/SPARK-25858 Project: Spark Issue Type: New Feature Components: Input/Output Affects Versions: 2.3.2 Reporter: Xinli Shang h1. Problem Statement The Spark WriteSupport class for Parquet is hardcoded to use org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which is not configurable. Currently, this class doesn’t carry over the field metadata in StructType to MessageType. However, Parquet column encryption (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of Parquet, so that the metadata can be used to control column encryption. h1. Technical Solution # Extend SparkToParquetSchemaConverter class and override convert() method to add the functionality of carrying over the field metadata # Extend ParquetWriteSupport and use the extended converter in #1. The extension avoids changing the built-in WriteSupport to mitigate the risk. # Change Spark code to make the WriteSupport class configurable to let the user configure to use the extended WriteSupport in #2. The default WriteSupport is still org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. h1. Technical Details h2. Extend SparkToParquetSchemaConverter class *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter { *override* def convert(catalystSchema: StructType): MessageType = { Types ._buildMessage_() .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*) .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_) } private def *convertFieldWithMetadata*(field: StructField) : Type = { val extField = new ExtType[Any](convertField(field)) val metaBuilder = new MetadataBuilder().withMetadata(field.metadata) val metaData = metaBuilder.getMap extField.setMetadata(metaData) return extField } } h2. Extend ParquetWriteSupport class CryptoParquetWriteSupport extends ParquetWriteSupport { *override* def init(configuration: Configuration): WriteContext = { val converter = new *SparkToParquetMetadataSchemaConverter*(configuration) createContext(configuration, converter) } } h2. Make WriteSupport configurable class ParquetFileFormat{ ** override def prepareWrite(...) { … *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) {* ParquetOutputFormat._setWriteSupportClass_(job, _classOf_[ParquetWriteSupport]) ** ... } } h1. Verification The [ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java] in the github repository [parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions] has a sample verification of passing down the field metadata and perform column encryption. h1. Dependency * Parquet-1178 * Parquet-1396 * Parquet-1397 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org