[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2021-03-24 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308000#comment-17308000
 ] 

Xinli Shang commented on SPARK-26345:
-

Yes, it needs some synchronization. I have the modified version implementation 
in Presto. You can check it 
[here|https://github.com/shangxinli/presto/commit/f6327a161eb6cfd5137f679620e095d8257816b8#diff-bb24b92e28343804ebaf540efe6c1cda0b5e2524e6811f8fe2daee5944dad386R203].
 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Parquet 1.11 supports column indexing. Spark can supports this feature for 
> better read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201
>  
> Benchmark result:
> [https://github.com/apache/spark/pull/31393#issuecomment-769767724]
> This feature is enabled by default, and users can disable it by setting 
> {{parquet.filter.columnindex.enabled}} to false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248241#comment-17248241
 ] 

Xinli Shang commented on SPARK-26345:
-

The Presto and Iceberg effort are not tied to each other. It is just some 
common code I can reuse. The PR in Iceberg is 
https://github.com/apache/iceberg/pull/1566 and the Issue for Presto is 
https://github.com/prestodb/presto/issues/15454 (PR is under development now). 


> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-12-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248231#comment-17248231
 ] 

Xinli Shang commented on SPARK-26345:
-

For the performance, there is an Eng Blog I found online written by Zoltán 
Borók-Nagy& Gábor Szádovszky. Here is the link 
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/.
 

Once Spark is on Parquet 1.11.x, we can work on the Column Index for Spark 
Vectorized reader. Currently, I am working on integrating Column Index to 
Iceberg and Presto. The local testing on Iceberg also seems promising. 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0

2020-09-22 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200307#comment-17200307
 ] 

Xinli Shang commented on SPARK-27733:
-

We talked about the Parquet 1.11.0 adoption in Spark in today's Parquet 
community sync meeting. The Parquet community would like to help if there is 
any way to move faster. [~csun][~smilegator][~dongjoon][~iemejia] and others, 
are you interested in joining our next Parquet meeting to brainstorm solutions 
to move forward? 

> Upgrade to Avro 1.10.0
> --
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.2 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranamer, no shaded guava, security 
> updates, so probably a worth upgrade.
> Avro 1.10.0 was released and this is still not done.
> There is at the moment (2020/08) still a blocker because of Hive related 
> transitive dependencies bringing older versions of Avro, so we could say that 
> this is somehow still blocked until HIVE-21737 is solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-07-21 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162063#comment-17162063
 ] 

Xinli Shang commented on SPARK-26345:
-

[~yumwang][~FelixKJose], you can assign this JIra to me. When I have time, I 
can start working on it. 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26346) Upgrade parquet to 1.11.0

2019-03-01 Thread Xinli Shang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781953#comment-16781953
 ] 

Xinli Shang commented on SPARK-26346:
-

+1, [~yumwang] , Is there any pre-testing done on RC4 or RC3?  I am doing the 
similar work and we can split the work if you like. 

> Upgrade parquet to 1.11.0
> -
>
> Key: SPARK-26346
> URL: https://issues.apache.org/jira/browse/SPARK-26346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25858) Passing Field Metadata to Parquet

2018-10-27 Thread Xinli Shang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved SPARK-25858.
-
Resolution: Later

It is a little early to open this issue. I will re-open it after the dependency 
issues are designed. 

> Passing Field Metadata to Parquet
> -
>
> Key: SPARK-25858
> URL: https://issues.apache.org/jira/browse/SPARK-25858
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output
>Affects Versions: 2.3.2
>Reporter: Xinli Shang
>Priority: Major
>
> h1. Problem Statement
> The Spark WriteSupport class for Parquet is hardcoded to use 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which 
> is not configurable. Currently, this class doesn’t carry over the field 
> metadata in StructType to MessageType. However, Parquet column encryption 
> (Parquet-1396, Parquet-1178) requires the field metadata inside MessageType 
> of Parquet, so that the metadata can be used to control column encryption.
> h1. Technical Solution
>  # Extend SparkToParquetSchemaConverter class and override convert() method 
> to add the functionality of carrying over the field metadata
>  # Extend ParquetWriteSupport and use the extended converter in #1. The 
> extension avoids changing the built-in WriteSupport to mitigate the risk.
>  # Change Spark code to make the WriteSupport class configurable to let the 
> user configure to use the extended WriteSupport in #2.  The default 
> WriteSupport is still 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.
> h1. Technical Details
> {{Note: The code below kind of in messy format. The link below shows correct 
> format. }}
> h2. Extend SparkToParquetSchemaConverter class
>  *SparkToParquetMetadataSchemaConverter* extends 
> SparkToParquetSchemaConverter {
>  
>   *override* def convert(catalystSchema: StructType): MessageType =
> {                   Types                   ._buildMessage_()                 
>  .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*)               
>   .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_)              
>      }
>  
>   private def *convertFieldWithMetadata*(field: StructField) : Type =
> {               val extField  = new ExtType[Any](convertField(field))         
>       val metaBuilder = new MetadataBuilder().withMetadata(field.metadata)    
>            val metaData = metaBuilder.getMap              
> extField.setMetadata(metaData)              return extField         }
>  }
> h2. Extend ParquetWriteSupport
> class CryptoParquetWriteSupport extends ParquetWriteSupport {
>   *override* def init(configuration: Configuration): WriteContext =
> {           val converter = new 
> *SparkToParquetMetadataSchemaConverter*(configuration)   
> createContext(configuration, converter)    }
> }
> h2. Make WriteSupport configurable
> class ParquetFileFormat{
>  
>    **    override def prepareWrite(...) {
>    …
>    *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) 
> {*
>    ParquetOutputFormat._setWriteSupportClass_(job, 
> _classOf_[ParquetWriteSupport])
>    ** 
>   ...
>    }
> }
> h1. Verification
> The 
> [ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java]
>  in the github repository 
> [parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions]
>  has a sample verification of passing down the field metadata and perform 
> column encryption.
> h1. Dependency
>  * Parquet-1178
>  * Parquet-1396
>  * Parquet-1397



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25858) Passing Field Metadata to Parquet

2018-10-26 Thread Xinli Shang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated SPARK-25858:

Description: 
h1. Problem Statement

The Spark WriteSupport class for Parquet is hardcoded to use 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which 
is not configurable. Currently, this class doesn’t carry over the field 
metadata in StructType to MessageType. However, Parquet column encryption 
(Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of 
Parquet, so that the metadata can be used to control column encryption.
h1. Technical Solution
 # Extend SparkToParquetSchemaConverter class and override convert() method to 
add the functionality of carrying over the field metadata
 # Extend ParquetWriteSupport and use the extended converter in #1. The 
extension avoids changing the built-in WriteSupport to mitigate the risk.
 # Change Spark code to make the WriteSupport class configurable to let the 
user configure to use the extended WriteSupport in #2.  The default 
WriteSupport is still 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.

h1. Technical Details

{{Note: The code below kind of in messy format. The link below shows correct 
format. }}
h2. Extend SparkToParquetSchemaConverter class

 *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter {

 

  *override* def convert(catalystSchema: StructType): MessageType =

{                   Types                   ._buildMessage_()                  
.addFields(catalystSchema.map(*convertFieldWithMetadata*): _*)                 
.named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_)                  
 }

 

  private def *convertFieldWithMetadata*(field: StructField) : Type =

{               val extField  = new ExtType[Any](convertField(field))           
    val metaBuilder = new MetadataBuilder().withMetadata(field.metadata)        
       val metaData = metaBuilder.getMap              
extField.setMetadata(metaData)              return extField         }

 }
h2. Extend ParquetWriteSupport

class CryptoParquetWriteSupport extends ParquetWriteSupport {

  *override* def init(configuration: Configuration): WriteContext =

{           val converter = new 
*SparkToParquetMetadataSchemaConverter*(configuration)   
createContext(configuration, converter)    }

}
h2. Make WriteSupport configurable

class ParquetFileFormat{

 

   **    override def prepareWrite(...) {

   …

   *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) {*

   ParquetOutputFormat._setWriteSupportClass_(job, 
_classOf_[ParquetWriteSupport])

   ** 

  ...

   }

}
h1. Verification

The 
[ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java]
 in the github repository 
[parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions]
 has a sample verification of passing down the field metadata and perform 
column encryption.
h1. Dependency
 * Parquet-1178
 * Parquet-1396
 * Parquet-1397

  was:
h1. Problem Statement

The Spark WriteSupport class for Parquet is hardcoded to use 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which 
is not configurable. Currently, this class doesn’t carry over the field 
metadata in StructType to MessageType. However, Parquet column encryption 
(Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of 
Parquet, so that the metadata can be used to control column encryption.
h1. Technical Solution
 # Extend SparkToParquetSchemaConverter class and override convert() method to 
add the functionality of carrying over the field metadata
 # Extend ParquetWriteSupport and use the extended converter in #1. The 
extension avoids changing the built-in WriteSupport to mitigate the risk.
 # Change Spark code to make the WriteSupport class configurable to let the 
user configure to use the extended WriteSupport in #2.  The default 
WriteSupport is still 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.

h1. Technical Details
h2. Extend SparkToParquetSchemaConverter class

 *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter {

 

  *override* def convert(catalystSchema: StructType): MessageType = {       
 

          Types

                  ._buildMessage_()

                 .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*)

                .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_)  
        

        }

 

  private def *convertFieldWithMetadata*(field: StructField) : Type =

{               val extField  = new ExtType[Any](convertField(field))           
    val metaBuilder = new MetadataBuilder().withMetadata(field.metadata)        
       val meta

[jira] [Updated] (SPARK-25858) Passing Field Metadata to Parquet

2018-10-26 Thread Xinli Shang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated SPARK-25858:

Description: 
h1. Problem Statement

The Spark WriteSupport class for Parquet is hardcoded to use 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which 
is not configurable. Currently, this class doesn’t carry over the field 
metadata in StructType to MessageType. However, Parquet column encryption 
(Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of 
Parquet, so that the metadata can be used to control column encryption.
h1. Technical Solution
 # Extend SparkToParquetSchemaConverter class and override convert() method to 
add the functionality of carrying over the field metadata
 # Extend ParquetWriteSupport and use the extended converter in #1. The 
extension avoids changing the built-in WriteSupport to mitigate the risk.
 # Change Spark code to make the WriteSupport class configurable to let the 
user configure to use the extended WriteSupport in #2.  The default 
WriteSupport is still 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.

h1. Technical Details
h2. Extend SparkToParquetSchemaConverter class

 *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter {

 

  *override* def convert(catalystSchema: StructType): MessageType = {       
 

          Types

                  ._buildMessage_()

                 .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*)

                .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_)  
        

        }

 

  private def *convertFieldWithMetadata*(field: StructField) : Type =

{               val extField  = new ExtType[Any](convertField(field))           
    val metaBuilder = new MetadataBuilder().withMetadata(field.metadata)        
       val metaData = metaBuilder.getMap              
extField.setMetadata(metaData)              return extField         }

 }
h2. Extend ParquetWriteSupport

class CryptoParquetWriteSupport extends ParquetWriteSupport {

  *override* def init(configuration: Configuration): WriteContext =

{           val converter = new 
*SparkToParquetMetadataSchemaConverter*(configuration)   
createContext(configuration, converter)    }

}
h2. Make WriteSupport configurable

class ParquetFileFormat{

 

   **    override def prepareWrite(...) {

   …

   *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) {*

   ParquetOutputFormat._setWriteSupportClass_(job, 
_classOf_[ParquetWriteSupport])

   ** 

  ...

   }

}
h1. Verification

The 
[ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java]
 in the github repository 
[parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions]
 has a sample verification of passing down the field metadata and perform 
column encryption.
h1. Dependency
 * Parquet-1178
 * Parquet-1396
 * Parquet-1397

  was:
h1. Problem Statement

The Spark WriteSupport class for Parquet is hardcoded to use 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which 
is not configurable. Currently, this class doesn’t carry over the field 
metadata in StructType to MessageType. However, Parquet column encryption 
(Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of 
Parquet, so that the metadata can be used to control column encryption.
h1. Technical Solution
 # Extend SparkToParquetSchemaConverter class and override convert() method to 
add the functionality of carrying over the field metadata
 # Extend ParquetWriteSupport and use the extended converter in #1. The 
extension avoids changing the built-in WriteSupport to mitigate the risk.
 # Change Spark code to make the WriteSupport class configurable to let the 
user configure to use the extended WriteSupport in #2.  The default 
WriteSupport is still 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.

h1. Technical Details
h2. Extend SparkToParquetSchemaConverter class

 *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter {

 

  *override* def convert(catalystSchema: StructType): MessageType = {

        Types   

               ._buildMessage_()   

               .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*)

                .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_)  
  

      }

 

  private def *convertFieldWithMetadata*(field: StructField) : Type = {

              val extField  = new ExtType[Any](convertField(field))

              val metaBuilder = new 
MetadataBuilder().withMetadata(field.metadata)

              val metaData = metaBuilder.getMap

             extField.setMetadata(metaData)

             return extField    

[jira] [Updated] (SPARK-25858) Passing Field Metadata to Parquet

2018-10-26 Thread Xinli Shang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated SPARK-25858:

Description: 
h1. Problem Statement

The Spark WriteSupport class for Parquet is hardcoded to use 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which 
is not configurable. Currently, this class doesn’t carry over the field 
metadata in StructType to MessageType. However, Parquet column encryption 
(Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of 
Parquet, so that the metadata can be used to control column encryption.
h1. Technical Solution
 # Extend SparkToParquetSchemaConverter class and override convert() method to 
add the functionality of carrying over the field metadata
 # Extend ParquetWriteSupport and use the extended converter in #1. The 
extension avoids changing the built-in WriteSupport to mitigate the risk.
 # Change Spark code to make the WriteSupport class configurable to let the 
user configure to use the extended WriteSupport in #2.  The default 
WriteSupport is still 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.

h1. Technical Details
h2. Extend SparkToParquetSchemaConverter class

 *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter {

 

  *override* def convert(catalystSchema: StructType): MessageType = {

        Types   

               ._buildMessage_()   

               .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*)

                .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_)  
  

      }

 

  private def *convertFieldWithMetadata*(field: StructField) : Type = {

              val extField  = new ExtType[Any](convertField(field))

              val metaBuilder = new 
MetadataBuilder().withMetadata(field.metadata)

              val metaData = metaBuilder.getMap

             extField.setMetadata(metaData)

             return extField  

      }

 }
h2. Extend ParquetWriteSupport

class CryptoParquetWriteSupport extends ParquetWriteSupport {

  *override* def init(configuration: Configuration): WriteContext = {   

       val converter = new 
*SparkToParquetMetadataSchemaConverter*(configuration)

  createContext(configuration, converter)

   }

}
h2. Make WriteSupport configurable

class ParquetFileFormat{

 

   **    override def prepareWrite(...) {

   …

   *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) {*

   ParquetOutputFormat._setWriteSupportClass_(job, 
_classOf_[ParquetWriteSupport])

   ** 

  ...

   }

}
h1. Verification

The 
[ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java]
 in the github repository 
[parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions]
 has a sample verification of passing down the field metadata and perform 
column encryption.
h1. Dependency
 * Parquet-1178
 * Parquet-1396
 * Parquet-1397

  was:
h1. Problem Statement

The Spark WriteSupport class for Parquet is hardcoded to use 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which 
is not configurable. Currently, this class doesn’t carry over the field 
metadata in StructType to MessageType. However, Parquet column encryption 
(Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of 
Parquet, so that the metadata can be used to control column encryption. 
h1. Technical Solution 
 # Extend SparkToParquetSchemaConverter class and override convert() method to 
add the functionality of carrying over the field metadata
 # Extend ParquetWriteSupport and use the extended converter in #1. The 
extension avoids changing the built-in WriteSupport to mitigate the risk.
 # Change Spark code to make the WriteSupport class configurable to let the 
user configure to use the extended WriteSupport in #2.  The default 
WriteSupport is still 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. 

h1. Technical Details 
h2. Extend SparkToParquetSchemaConverter class 

  *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter 
{

 

   *override* def convert(catalystSchema: StructType): MessageType = {

 Types

   ._buildMessage_()

   .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*)

   .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_)

    }

 

   private def *convertFieldWithMetadata*(field: StructField) : Type = {

 val extField  = new ExtType[Any](convertField(field))

 val metaBuilder = new MetadataBuilder().withMetadata(field.metadata)

 val metaData = metaBuilder.getMap

 extField.setMetadata(metaData)

 return extField

   }

  }
h2. Extend ParquetWriteSupport

class CryptoParquetWriteSupport extends ParquetWriteSupport {

 

   *override* de

[jira] [Created] (SPARK-25858) Passing Field Metadata to Parquet

2018-10-26 Thread Xinli Shang (JIRA)
Xinli Shang created SPARK-25858:
---

 Summary: Passing Field Metadata to Parquet
 Key: SPARK-25858
 URL: https://issues.apache.org/jira/browse/SPARK-25858
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output
Affects Versions: 2.3.2
Reporter: Xinli Shang


h1. Problem Statement

The Spark WriteSupport class for Parquet is hardcoded to use 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport, which 
is not configurable. Currently, this class doesn’t carry over the field 
metadata in StructType to MessageType. However, Parquet column encryption 
(Parquet-1396, Parquet-1178) requires the field metadata inside MessageType of 
Parquet, so that the metadata can be used to control column encryption. 
h1. Technical Solution 
 # Extend SparkToParquetSchemaConverter class and override convert() method to 
add the functionality of carrying over the field metadata
 # Extend ParquetWriteSupport and use the extended converter in #1. The 
extension avoids changing the built-in WriteSupport to mitigate the risk.
 # Change Spark code to make the WriteSupport class configurable to let the 
user configure to use the extended WriteSupport in #2.  The default 
WriteSupport is still 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. 

h1. Technical Details 
h2. Extend SparkToParquetSchemaConverter class 

  *SparkToParquetMetadataSchemaConverter* extends SparkToParquetSchemaConverter 
{

 

   *override* def convert(catalystSchema: StructType): MessageType = {

 Types

   ._buildMessage_()

   .addFields(catalystSchema.map(*convertFieldWithMetadata*): _*)

   .named(ParquetSchemaConverter._SPARK_PARQUET_SCHEMA_NAME_)

    }

 

   private def *convertFieldWithMetadata*(field: StructField) : Type = {

 val extField  = new ExtType[Any](convertField(field))

 val metaBuilder = new MetadataBuilder().withMetadata(field.metadata)

 val metaData = metaBuilder.getMap

 extField.setMetadata(metaData)

 return extField

   }

  }
h2. Extend ParquetWriteSupport

class CryptoParquetWriteSupport extends ParquetWriteSupport {

 

   *override* def init(configuration: Configuration): WriteContext = {

   val converter = new 
*SparkToParquetMetadataSchemaConverter*(configuration)

   createContext(configuration, converter)

   }

}
h2. Make WriteSupport configurable

class ParquetFileFormat{

 

    **    override def prepareWrite(...) {

    …

    *if (conf.get(ParquetOutputFormat.**_WRITE_SUPPORT_CLASS_**) == null) {*

    ParquetOutputFormat._setWriteSupportClass_(job, 
_classOf_[ParquetWriteSupport])

    ** 

   ...

    }

}
h1. Verification 

The 
[ParquetHelloWorld.java|https://github.com/shangxinli/parquet-writesupport-extensions/blob/master/src/main/java/com/uber/ParquetHelloWorld.java]
 in the github repository 
[parquet-writesupport-extensions|https://github.com/shangxinli/parquet-writesupport-extensions]
 has a sample verification of passing down the field metadata and perform 
column encryption. 
h1. Dependency
 * Parquet-1178
 * Parquet-1396
 * Parquet-1397



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org