[jira] [Resolved] (PARQUET-2364) Encrypt all columns option

2023-11-08 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2364.
---
Fix Version/s: 1.14.0
   Resolution: Fixed

> Encrypt all columns option
> --
>
> Key: PARQUET-2364
> URL: https://issues.apache.org/jira/browse/PARQUET-2364
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0
>
>
> The column encryption mode currently encrypts only the explicitly specified 
> columns. Other columns stay unencrypted. This Jira will add an option to 
> encrypt (and tamper-proof) the other columns with the default footer key.
> Decryption / reading is not affected. The current readers will be able to 
> decrypt the new files, as long as they have access to the required keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2370) Crypto factory activation of "all column encryption" mode

2023-11-08 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2370.
---
Resolution: Fixed

> Crypto factory activation of "all column encryption" mode
> -
>
> Key: PARQUET-2370
> URL: https://issues.apache.org/jira/browse/PARQUET-2370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0
>
>
> Enable the crypto factory to activate the "encrypt all columns" option 
> (https://issues.apache.org/jira/browse/PARQUET-2364) . Add a unitest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2370) Crypto factory activation of "all column encryption" mode

2023-11-08 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2370:
--
Fix Version/s: 1.14.0

> Crypto factory activation of "all column encryption" mode
> -
>
> Key: PARQUET-2370
> URL: https://issues.apache.org/jira/browse/PARQUET-2370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.14.0
>
>
> Enable the crypto factory to activate the "encrypt all columns" option 
> (https://issues.apache.org/jira/browse/PARQUET-2364) . Add a unitest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2370) Crypto factory activation of "all column encryption" mode

2023-10-23 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2370:
-

 Summary: Crypto factory activation of "all column encryption" mode
 Key: PARQUET-2370
 URL: https://issues.apache.org/jira/browse/PARQUET-2370
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Enable the crypto factory to activate the "encrypt all columns" option 
(https://issues.apache.org/jira/browse/PARQUET-2364) . Add a unitest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2364) Encrypt all columns option

2023-10-16 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2364:
-

 Summary: Encrypt all columns option
 Key: PARQUET-2364
 URL: https://issues.apache.org/jira/browse/PARQUET-2364
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


The column encryption mode currently encrypts only the explicitly specified 
columns. Other columns stay unencrypted. This Jira will add an option to 
encrypt (and tamper-proof) the other columns with the default footer key.

Decryption / reading is not affected. The current readers will be able to 
decrypt the new files, as long as they have access to the required keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2223) Parquet Data Masking for Column Encryption

2023-06-16 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733538#comment-17733538
 ] 

Gidon Gershinsky commented on PARQUET-2223:
---

Yep, I also think so. I'll have a look at the current version of the design 
document.

> Parquet Data Masking for Column Encryption
> --
>
> Key: PARQUET-2223
> URL: https://issues.apache.org/jira/browse/PARQUET-2223
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Jiashen Zhang
>Priority: Major
>
> h1. Background
> h2. What is Data Masking?
> Data masking is a technique used to protect sensitive data by replacing it 
> with modified or obscured values. The purpose of data masking is to ensure 
> that sensitive information, such as Personally Identifiable Information 
> (PII), remains hidden from unauthorized users while allowing authorized users 
> to perform their tasks.
> Here are a few key points about data masking:
>  * Protection of Sensitive Data: Data masking helps to safeguard sensitive 
> data, such as Social Security numbers, credit card numbers, names, addresses, 
> and other personally identifiable information. By applying masking 
> techniques, the original values are replaced with fictional or transformed 
> data that retains the format and structure but removes any identifiable 
> information.
>  * Controlled Access: Data masking enables controlled access to sensitive 
> data. Authorized users, typically with appropriate permissions, can access 
> the unmasked or original data, while unauthorized users or users without the 
> necessary permissions will only see the masked data.
>  * Various Masking Techniques: There are different masking techniques 
> available, depending on the specific data privacy requirements and use cases. 
> Some commonly used techniques include:
>  ** Nullification: Replacing original data with NULL values.
>  ** Randomization: Replacing sensitive data with randomly generated values.
>  ** Substitution: Replacing sensitive data with fictional but realistic 
> values.
>  ** Hashing: Transforming sensitive data into irreversible hashed values.
>  ** Redaction: Removing or masking specific parts of sensitive data while 
> retaining other non-sensitive information.
>  * Compliance and Data Privacy: Data masking is often employed to comply with 
> data protection regulations and maintain data privacy. By masking sensitive 
> data, we can reduce the risk of data breaches and unauthorized access while 
> still allowing legitimate users to perform their tasks.
>  * Maintaining Data Consistency: Data masking techniques aim to maintain data 
> consistency and integrity by ensuring that masked data retains the original 
> data's format, structure, and relationships. This allows applications and 
> processes that rely on the data to continue functioning correctly.
> h2. Why do we need it?
> Data masking serves several important purposes and provides numerous 
> benefits. Here are some reasons why we need data masking:
>  * Data Privacy and Compliance: Data masking helps us comply with data 
> privacy regulations such as the General Data Protection Regulation (GDPR) and 
> the Health Insurance Portability and Accountability Act (HIPAA). These 
> regulations require us to protect sensitive data and ensure that it is only 
> accessible to authorized individuals. Data masking enables us to comply with 
> these regulations by de-identifying sensitive data.
>  * Minimize Data Exposure: By masking sensitive data, we can reduce the risk 
> of data breaches and unauthorized access. If a security breach occurs, the 
> exposed data will be meaningless to unauthorized users due to the masking. 
> This helps protect individuals' privacy and prevents misuse of sensitive 
> information.
>  * Secure Testing and Development Environments: Data masking is particularly 
> useful in creating secure testing and development environments. By masking 
> sensitive data, we can use realistic but fictional data for testing, 
> analysis, and development activities without exposing real personal or 
> sensitive information.
>  * Enhanced Data Sharing: Data masking allows us to share data with external 
> parties, such as partners or third-party vendors, while protecting sensitive 
> information. Masked data can be shared with confidence, as the original 
> sensitive values are replaced with transformed or fictional data.
>  * Employee Privacy: Data masking helps protect employee privacy by 
> obfuscating sensitive employee information, such as social security numbers 
> or salary details, in databases or HR systems. This safeguards employees' 
> personal data from unauthorized access or internal misuse.
>  * Insider Threat Mitigation: Data masking reduces the risk posed by insider 
> threats, where authorized individuals intentionally or 

[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys

2023-05-04 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719294#comment-17719294
 ] 

Gidon Gershinsky commented on PARQUET-2193:
---

[~Nageswaran] A couple of updates on this.

We should be able to skip this verification for encrypted files, a pull request 
is sent to parquet-mr.

Also, I've tried the new Spark 3.4.0 (as is, no modifications) with the scala 
test above - no exception was thrown. Probably, the updated Spark code bypasses 
the problematic parquet read path. Can you check if Spark 3.4.0 works ok for 
your usecase.

> Encrypting only one field in nested field prevents reading of other fields in 
> nested field without keys
> ---
>
> Key: PARQUET-2193
> URL: https://issues.apache.org/jira/browse/PARQUET-2193
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> While exploring parquet encryption, it is found that, if a field in nested 
> column is encrypted , and If I want to read this parquet directory from other 
> applications which does not have encryption keys to decrypt it, I cannot read 
> the remaining fields of the nested column without keys. 
> Example 
> `
> {code:java}
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> `{code}
> In the case class `SquareItem` , `nestedCol` field is nested field and I want 
> to encrypt a field `ic` within it. 
>  
> I also want the footer to be non encrypted , so that I can use the encrypted 
> parquet file by legacy applications. 
>  
> Encryption is successful, however, when I query the parquet file using spark 
> 3.3.0 without having any configuration for parquet encryption set up , I 
> cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only 
> `nestedCol` `ic` field will not be querable.
>  
>  
> Reproducer. 
> Spark 3.3.0 Using Spark-shell 
> Downloaded the file 
> [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  and added it to spark-jars folder
> Code to create encrypted data. #  
>  
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==")
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> valpartitionCol = 1
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> val dataRange = (1 to 100).toList
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i
> squares.toDS().show()
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key",
>  "keyz").parquet(encryptedParquetPath)
> {code}
> Code to read the data trying to access non encrypted nested field by opening 
> a new spark-shell
>  
> {code:java}
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test")
> spark.sql("select nestedCol.sic from test").show(){code}
> As you can see that nestedCol.sic is not encrypted , I was expecting the 
> results, but
> I get the below error
>  
> {code:java}
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: 
> [square_int_column]. Null File Decryptor
>   at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>   at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:177)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> 

[jira] [Created] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem

2023-05-04 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2297:
-

 Summary: Encrypted files should not be checked for delta encoding 
problem
 Key: PARQUET-2297
 URL: https://issues.apache.org/jira/browse/PARQUET-2297
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.13.0
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky
 Fix For: 1.14.0, 1.13.1


Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) was 
fixed in writers since parquet-mr-1.8. This fix also added a 
`checkDeltaByteArrayProblem` method in readers, that runs over all columns and 
checks for this problem in older files. 

This now triggers an unrelated exception when reading encrypted files, in the 
following situation: trying to read an unencrypted column, without having keys 
for encrypted columns (see https://issues.apache.org/jira/browse/PARQUET-2193). 
This happens in Spark, with nested columns (files with regular columns are ok).

Possible solution: don't call the `checkDeltaByteArrayProblem` method for 
encrypted files - because these files can be written only with parquet-mr-1.12 
and newer, where the delta encoding problem is already fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys

2023-05-02 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718795#comment-17718795
 ] 

Gidon Gershinsky commented on PARQUET-2193:
---

Yep, sorry about the delay. This turned out to be more challenging than I 
hoped; a fix at the encryption code level will require some changes in the 
format specification.. A rather big deal, and likely unjustified in this case. 
The immediate trigger is the `checkDeltaByteArrayProblem` verification, added 8 
years ago to detect encoding irregularities in older files.  For some reason 
this check is done only on files with nested columns, and not on files with 
regular columns (at least in Spark). Maybe the right thing today is to remove 
that verification. I'll check with the community.

> Encrypting only one field in nested field prevents reading of other fields in 
> nested field without keys
> ---
>
> Key: PARQUET-2193
> URL: https://issues.apache.org/jira/browse/PARQUET-2193
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> While exploring parquet encryption, it is found that, if a field in nested 
> column is encrypted , and If I want to read this parquet directory from other 
> applications which does not have encryption keys to decrypt it, I cannot read 
> the remaining fields of the nested column without keys. 
> Example 
> `
> {code:java}
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> `{code}
> In the case class `SquareItem` , `nestedCol` field is nested field and I want 
> to encrypt a field `ic` within it. 
>  
> I also want the footer to be non encrypted , so that I can use the encrypted 
> parquet file by legacy applications. 
>  
> Encryption is successful, however, when I query the parquet file using spark 
> 3.3.0 without having any configuration for parquet encryption set up , I 
> cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only 
> `nestedCol` `ic` field will not be querable.
>  
>  
> Reproducer. 
> Spark 3.3.0 Using Spark-shell 
> Downloaded the file 
> [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  and added it to spark-jars folder
> Code to create encrypted data. #  
>  
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==")
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> valpartitionCol = 1
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> val dataRange = (1 to 100).toList
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i
> squares.toDS().show()
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key",
>  "keyz").parquet(encryptedParquetPath)
> {code}
> Code to read the data trying to access non encrypted nested field by opening 
> a new spark-shell
>  
> {code:java}
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test")
> spark.sql("select nestedCol.sic from test").show(){code}
> As you can see that nestedCol.sic is not encrypted , I was expecting the 
> results, but
> I get the below error
>  
> {code:java}
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: 
> [square_int_column]. Null File Decryptor
>   at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>   at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191)
>   at 
> 

[jira] [Assigned] (PARQUET-2103) crypto exception in print toPrettyJSON

2023-01-11 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky reassigned PARQUET-2103:
-

Assignee: Gidon Gershinsky

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> 

[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON

2023-01-11 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2103:
--
Affects Version/s: 1.12.3

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3
>Reporter: Gidon Gershinsky
>Priority: Major
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217)
>  

[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON

2023-01-11 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2103:
--
Priority: Minor  (was: Major)

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3
>Reporter: Gidon Gershinsky
>Priority: Minor
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217)
>  

[jira] [Created] (PARQUET-2208) Add details to nested column encryption config doc and exception text

2022-10-31 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2208:
-

 Summary: Add details to nested column encryption config doc and 
exception text
 Key: PARQUET-2208
 URL: https://issues.apache.org/jira/browse/PARQUET-2208
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.3
Reporter: Gidon Gershinsky


Parquet columnar encryption requires an explicit full path for each column to 
be encrypted. If a partial path is configured, the thrown exception is not 
informative enough, doesn't help much in correcting the parameters.
The goal is to make the exception print something like:
_Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted 
column [rider] not in file schema column list: [foo] , [rider.list.element.foo] 
, [rider.list.element.bar] , [ts] , [uuid]_
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys

2022-10-10 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614917#comment-17614917
 ] 

Gidon Gershinsky commented on PARQUET-2193:
---

Welcome.

>From the sound of it, this might require each file to be processed by one 
>thread only (instead of reading a single file by multiple threads); which 
>should be ok in typical usecases where one thread/executor reads multiple 
>files anyway. But I'll have a deeper look at this.

> Encrypting only one field in nested field prevents reading of other fields in 
> nested field without keys
> ---
>
> Key: PARQUET-2193
> URL: https://issues.apache.org/jira/browse/PARQUET-2193
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> While exploring parquet encryption, it is found that, if a field in nested 
> column is encrypted , and If I want to read this parquet directory from other 
> applications which does not have encryption keys to decrypt it, I cannot read 
> the remaining fields of the nested column without keys. 
> Example 
> `
> {code:java}
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> `{code}
> In the case class `SquareItem` , `nestedCol` field is nested field and I want 
> to encrypt a field `ic` within it. 
>  
> I also want the footer to be non encrypted , so that I can use the encrypted 
> parquet file by legacy applications. 
>  
> Encryption is successful, however, when I query the parquet file using spark 
> 3.3.0 without having any configuration for parquet encryption set up , I 
> cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only 
> `nestedCol` `ic` field will not be querable.
>  
>  
> Reproducer. 
> Spark 3.3.0 Using Spark-shell 
> Downloaded the file 
> [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  and added it to spark-jars folder
> Code to create encrypted data. #  
>  
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==")
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> valpartitionCol = 1
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> val dataRange = (1 to 100).toList
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i
> squares.toDS().show()
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key",
>  "keyz").parquet(encryptedParquetPath)
> {code}
> Code to read the data trying to access non encrypted nested field by opening 
> a new spark-shell
>  
> {code:java}
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test")
> spark.sql("select nestedCol.sic from test").show(){code}
> As you can see that nestedCol.sic is not encrypted , I was expecting the 
> results, but
> I get the below error
>  
> {code:java}
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: 
> [square_int_column]. Null File Decryptor
>   at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>   at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:177)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:375)
>   at 
> 

[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys

2022-09-29 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610868#comment-17610868
 ] 

Gidon Gershinsky commented on PARQUET-2193:
---

Hmm, looks like this method runs over all columns, projected and not projected:
org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191)
 

Please check if setting "parquet.split.files" to "false" solves this problem.

> Encrypting only one field in nested field prevents reading of other fields in 
> nested field without keys
> ---
>
> Key: PARQUET-2193
> URL: https://issues.apache.org/jira/browse/PARQUET-2193
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> While exploring parquet encryption, it is found that, if a field in nested 
> column is encrypted , and If I want to read this parquet directory from other 
> applications which does not have encryption keys to decrypt it, I cannot read 
> the remaining fields of the nested column without keys. 
> Example 
> `
> {code:java}
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> `{code}
> In the case class `SquareItem` , `nestedCol` field is nested field and I want 
> to encrypt a field `ic` within it. 
>  
> I also want the footer to be non encrypted , so that I can use the encrypted 
> parquet file by legacy applications. 
>  
> Encryption is successful, however, when I query the parquet file using spark 
> 3.3.0 without having any configuration for parquet encryption set up , I 
> cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only 
> `nestedCol` `ic` field will not be querable.
>  
>  
> Reproducer. 
> Spark 3.3.0 Using Spark-shell 
> Downloaded the file 
> [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  and added it to spark-jars folder
> Code to create encrypted data. #  
>  
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==")
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> valpartitionCol = 1
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> val dataRange = (1 to 100).toList
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i
> squares.toDS().show()
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key",
>  "keyz").parquet(encryptedParquetPath)
> {code}
> Code to read the data trying to access non encrypted nested field by opening 
> a new spark-shell
>  
> {code:java}
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test")
> spark.sql("select nestedCol.sic from test").show(){code}
> As you can see that nestedCol.sic is not encrypted , I was expecting the 
> results, but
> I get the below error
>  
> {code:java}
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: 
> [square_int_column]. Null File Decryptor
>   at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>   at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:177)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:375)
>   at 
> 

[jira] [Commented] (PARQUET-2194) parquet.encryption.plaintext.footer parameter being true, code expects parquet.encryption.footer.key

2022-09-29 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610855#comment-17610855
 ] 

Gidon Gershinsky commented on PARQUET-2194:
---

Footer key is required also in the plaintext footer mode - it is used to sign 
the footer, 
https://github.com/apache/parquet-mr/tree/master/parquet-hadoop#class-propertiesdrivencryptofactory

> parquet.encryption.plaintext.footer parameter being true, code expects 
> parquet.encryption.footer.key
> 
>
> Key: PARQUET-2194
> URL: https://issues.apache.org/jira/browse/PARQUET-2194
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> I want my footer in parquet file to be non encrypted. so I set the 
> _parquet.encryption.plaintext.footer_ to be {_}true{_}, but when I tried to 
> run my code, parquet-mr is expecting __ value __ for the __ property 
> _parquet.encryption.footer.key  **_  
> Reproducer
> Spark 3.3.0 
> Download the 
> [file|[https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  ] and place it in spark - jar directory 
> using spark-shell
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") 
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") 
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==") 
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
>  
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" 
> val partitionCol = 1 
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) 
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem) 
> val dataRange = (1 to 100).toList 
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i 
> squares.toDS().show() 
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).parquet(encryptedParquetPath){code}
> I get the below error, my expectation is if I set properties for my footer to 
> be plain text, why do we need keys for footer.
>  
> {code:java}
>  
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: Undefined 
> footer key
>   at 
> org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.getFileEncryptionProperties(PropertiesDrivenCryptoFactory.java:88)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.createEncryptionProperties(ParquetOutputFormat.java:554)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:478)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:155)
>   at 
> org.apache.spark.sql.execution.datasources.BaseDynamicPartitionDataWriter.renewCurrentWriter(FileFormatDataWriter.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionDataSingleWriter.write(FileFormatDataWriter.scala:365)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:331)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:338)
>   ... 9 more
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2197) Document uniform encryption

2022-09-28 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2197:
-

 Summary: Document uniform encryption
 Key: PARQUET-2197
 URL: https://issues.apache.org/jira/browse/PARQUET-2197
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.3
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Document the hadoop parameter for uniform encryption



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-09-14 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605098#comment-17605098
 ] 

Gidon Gershinsky commented on PARQUET-1711:
---

[~emkornfield] what do you think about these 3 alternatives?

> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-09-08 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602127#comment-17602127
 ] 

Gidon Gershinsky edited comment on PARQUET-1711 at 9/9/22 5:45 AM:
---

Hi to all on this Jira. Looks like we have a number of alternative solutions to 
this problem today, 
[https://github.com/apache/parquet-mr/pull/995]

[https://github.com/apache/parquet-mr/pull/445]

[https://github.com/apache/parquet-mr/pull/988]

Can you take a look and provide your opinion on them?


was (Author: gershinsky):
Hi to all on this Jira. Looks like we have a number of alternative solutions to 
this problem today, 

> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-09-08 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602127#comment-17602127
 ] 

Gidon Gershinsky commented on PARQUET-1711:
---

Hi to all on this Jira. Looks like we have a number of alternative solutions to 
this problem today, 

> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2040) Uniform encryption

2022-07-28 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2040.
---
Resolution: Fixed

> Uniform encryption
> --
>
> Key: PARQUET-2040
> URL: https://issues.apache.org/jira/browse/PARQUET-2040
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> PME low-level spec supports using the same encryption key for all columns, 
> which is useful in a number of scenarios. However, this feature is not 
> exposed yet in the high-level API, because its misuse can break the NIST 
> limit on the number of AES GCM operations with one key. We will develop a 
> limit-enforcing code and provide an API for uniform table encryption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2136) File writer construction with encryptor

2022-07-28 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2136.
---
Resolution: Fixed

> File writer construction with encryptor
> ---
>
> Key: PARQUET-2136
> URL: https://issues.apache.org/jira/browse/PARQUET-2136
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently, a file writer object can be constructed with encryption 
> properties. We need an additional constructor, that can accept an encryptor 
> instead, in order to support lazy materialization of parquet file writers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-06-21 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2120.
---
Resolution: Fixed

> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
> Fix For: 1.12.3
>
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-06-21 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556828#comment-17556828
 ] 

Gidon Gershinsky commented on PARQUET-2120:
---

[~shangxinli] and the Parquet community, can you assign this Jira to [~rshkv] 

> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
> Fix For: 1.12.3
>
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (PARQUET-2148) Enable uniform decryption with plaintext footer

2022-06-21 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2148.
---
Resolution: Fixed

> Enable uniform decryption with plaintext footer
> ---
>
> Key: PARQUET-2148
> URL: https://issues.apache.org/jira/browse/PARQUET-2148
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently, uniform decryption is not enabled in the plaintext footer mode - 
> for no good reason. Column metadata is available, we just need to decrypt and 
> use it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (PARQUET-2144) Fix ColumnIndexBuilder for notIn predicate

2022-06-21 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2144.
---
Resolution: Fixed

> Fix ColumnIndexBuilder for notIn predicate
> --
>
> Key: PARQUET-2144
> URL: https://issues.apache.org/jira/browse/PARQUET-2144
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Huaxin Gao
>Priority: Major
> Fix For: 1.12.3
>
>
> Column Index is not built correctly for notIn predicate. Need to fix the bug.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (PARQUET-2145) Release 1.12.3

2022-06-21 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-2145.
---
Resolution: Fixed

> Release 1.12.3
> --
>
> Key: PARQUET-2145
> URL: https://issues.apache.org/jira/browse/PARQUET-2145
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2145) Release 1.12.3

2022-06-21 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556825#comment-17556825
 ] 

Gidon Gershinsky commented on PARQUET-2145:
---

This version is already released, 
[https://parquet.incubator.apache.org/blog/2022/05/26/1.12.3/]

 

Lets indeed close this Jira.

> Release 1.12.3
> --
>
> Key: PARQUET-2145
> URL: https://issues.apache.org/jira/browse/PARQUET-2145
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-06-13 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553425#comment-17553425
 ] 

Gidon Gershinsky commented on PARQUET-2117:
---

[~sha...@uber.com] Could you add [~prakharjain09] to the Parquet contributors.

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2101) Fix wrong descriptions about the default block size

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2101:
--
Fix Version/s: 1.12.3

> Fix wrong descriptions about the default block size
> ---
>
> Key: PARQUET-2101
> URL: https://issues.apache.org/jira/browse/PARQUET-2101
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-mr, parquet-protobuf
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Trivial
> Fix For: 1.12.3
>
>
> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L90
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L240
> https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoParquetWriter.java#L80
> These javadocs say the default block size is 50 MB but it's actually 128MB.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2081) Encryption translation tool - Parquet-hadoop

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2081:
--
Fix Version/s: 1.12.3
   (was: 1.13.0)

> Encryption translation tool - Parquet-hadoop
> 
>
> Key: PARQUET-2081
> URL: https://issues.apache.org/jira/browse/PARQUET-2081
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.3
>
>
> This is the implement the core part of the Encryption translation tool in 
> parquet-hadoop. After this, we will have another Jira/PR for parquet-cli to 
> integrate with key tools for encryption properties.. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2102) Typo in ColumnIndexBase toString

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2102:
--
Fix Version/s: 1.12.3

> Typo in ColumnIndexBase toString
> 
>
> Key: PARQUET-2102
> URL: https://issues.apache.org/jira/browse/PARQUET-2102
> Project: Parquet
>  Issue Type: Bug
>Reporter: Ryan Rupp
>Assignee: Ryan Rupp
>Priority: Trivial
> Fix For: 1.12.3
>
>
> Trivial thing but noticed [here|https://github.com/trinodb/trino/issues/9890] 
> since ColumnIndexBase.toString() was used in a wrapped exception message - 
> "boundary" has a typo (boudary).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2040) Uniform encryption

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2040:
--
Fix Version/s: 1.12.3

> Uniform encryption
> --
>
> Key: PARQUET-2040
> URL: https://issues.apache.org/jira/browse/PARQUET-2040
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> PME low-level spec supports using the same encryption key for all columns, 
> which is useful in a number of scenarios. However, this feature is not 
> exposed yet in the high-level API, because its misuse can break the NIST 
> limit on the number of AES GCM operations with one key. We will develop a 
> limit-enforcing code and provide an API for uniform table encryption.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2076) Improve Travis CI build Performance

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2076:
--
Fix Version/s: 1.12.3

> Improve Travis CI build Performance
> ---
>
> Key: PARQUET-2076
> URL: https://issues.apache.org/jira/browse/PARQUET-2076
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Chen Zhang
>Priority: Trivial
> Fix For: 1.12.3
>
>
> According to [Common Build Problems - Travis CI 
> (travis-ci.com)|https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received],
>  we should carefully use travis_wait, as it may make the build unstable and 
> extend the build time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2107) Travis failures

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2107:
--
Fix Version/s: 1.12.3

> Travis failures
> ---
>
> Key: PARQUET-2107
> URL: https://issues.apache.org/jira/browse/PARQUET-2107
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.12.3
>
>
> There are Travis failures since a while in our PRs. See e.g. 
> https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598285 or 
> https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598286



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2106) BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2106:
--
Fix Version/s: 1.12.3

> BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
> ---
>
> Key: PARQUET-2106
> URL: https://issues.apache.org/jira/browse/PARQUET-2106
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, 
> profile_48449_alloc_1638494450_sort_by.html
>
>
> *Background*
> While writing out large Parquet tables using Spark, we've noticed that 
> BinaryComparator is the source of substantial churn of extremely short-lived 
> `HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
> allocations in our benchmarks, putting substantial pressure on a Garbage 
> Collector:
> !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
> [^profile_48449_alloc_1638494450_sort_by.html]
>  
> *Proposal*
> We're proposing to adjust lexicographical comparison (at least) to avoid 
> doing any allocations, since this code lies on the hot-path of every Parquet 
> write, therefore causing substantial churn amplification.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2105) Refactor the test code of creating the test file

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2105:
--
Fix Version/s: 1.12.3

> Refactor the test code of creating the test file 
> -
>
> Key: PARQUET-2105
> URL: https://issues.apache.org/jira/browse/PARQUET-2105
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.3
>
>
> In the tests, there are many places that need to create a test parquet file 
> with different settings. Currently, each test file just creates its own code. 
> It would be better to have a test file builder to create that. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2112) Fix typo in MessageColumnIO

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2112:
--
Fix Version/s: 1.12.3
   (was: 1.13.0)

> Fix typo in MessageColumnIO
> ---
>
> Key: PARQUET-2112
> URL: https://issues.apache.org/jira/browse/PARQUET-2112
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.3
>
>
> Typo of the variable 'BitSet vistedIndexes'. Change it to 'visitedIndexes'



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2128) Bump Thrift to 0.16.0

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2128:
--
Fix Version/s: 1.12.3

> Bump Thrift to 0.16.0
> -
>
> Key: PARQUET-2128
> URL: https://issues.apache.org/jira/browse/PARQUET-2128
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Vinoo Ganesh
>Assignee: Vinoo Ganesh
>Priority: Minor
> Fix For: 1.12.3
>
>
> Thrift 0.16.0 has been released 
> https://github.com/apache/thrift/releases/tag/v0.16.0



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2120:
--
Fix Version/s: 1.12.3

> parquet-cli dictionary command fails on pages without dictionary encoding
> -
>
> Key: PARQUET-2120
> URL: https://issues.apache.org/jira/browse/PARQUET-2120
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Willi Raschkowski
>Priority: Minor
> Fix For: 1.12.3
>
>
> parquet-cli's {{dictionary}} command fails with an NPE if a page does not 
> have dictionary encoding:
> {code}
> $ parquet dictionary --column col a-b-c.snappy.parquet
> Unknown error
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" 
> is null
>   at 
> org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78)
>   at org.apache.parquet.cli.Main.run(Main.java:155)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:185)
> $ parquet meta a-b-c.snappy.parquet  
> ...
> Row group 0:  count: 1  46.00 B records  start: 4  total: 46 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS   _ 1 46.00 B0   "a" / "a"
> Row group 1:  count: 200  0.34 B records  start: 50  total: 69 B
> 
>  type  encodings count avg size   nulls   min / max
> col  BINARYS _ R 200   0.34 B 0   "b" / "c"
> {code}
> (Note the missing {{R}} / dictionary encoding on that first page.)
> Someone familiar with Parquet might guess from the NPE that there's no 
> dictionary encoding. But for files that mix pages with and without dictionary 
> encoding (like above), the command will fail before getting to pages that 
> actually have dictionaries.
> The problem is that [this 
> line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76]
>  assumes {{readDictionaryPage}} always returns a page and doesn't handle when 
> it does not, i.e. when it returns {{null}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2129) Add uncompressedSize to "meta" output

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2129:
--
Fix Version/s: 1.12.3

> Add uncompressedSize to "meta" output
> -
>
> Key: PARQUET-2129
> URL: https://issues.apache.org/jira/browse/PARQUET-2129
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Vinoo Ganesh
>Assignee: Vinoo Ganesh
>Priority: Minor
> Fix For: 1.12.3
>
>
> The `uncompressedSize` is currently not printed in the output of the parquet 
> meta command. This PR adds the uncompressedSize in to the output. 
> This was also reported by Deepak Gangwar. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2121) Remove descriptions for the removed modules

2022-05-19 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2121:
--
Fix Version/s: 1.12.3

> Remove descriptions for the removed modules
> ---
>
> Key: PARQUET-2121
> URL: https://issues.apache.org/jira/browse/PARQUET-2121
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
> Fix For: 1.12.3
>
>
> PARQUET-2020 removed some deprecated modules, but the related descriptions 
> still remain in some documents. They should be removed since their existence 
> is misleading.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2136) File writer construction with encryptor

2022-05-18 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2136:
--
Fix Version/s: 1.12.3

> File writer construction with encryptor
> ---
>
> Key: PARQUET-2136
> URL: https://issues.apache.org/jira/browse/PARQUET-2136
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently, a file writer object can be constructed with encryption 
> properties. We need an additional constructor, that can accept an encryptor 
> instead, in order to support lazy materialization of parquet file writers.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2144) Fix ColumnIndexBuilder for notIn predicate

2022-05-18 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2144:
--
Fix Version/s: 1.12.3

> Fix ColumnIndexBuilder for notIn predicate
> --
>
> Key: PARQUET-2144
> URL: https://issues.apache.org/jira/browse/PARQUET-2144
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Huaxin Gao
>Priority: Major
> Fix For: 1.12.3
>
>
> Column Index is not built correctly for notIn predicate. Need to fix the bug.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar

2022-05-18 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2127:
--
Fix Version/s: 1.12.3

> Security risk in latest parquet-jackson-1.12.2.jar
> --
>
> Key: PARQUET-2127
> URL: https://issues.apache.org/jira/browse/PARQUET-2127
> Project: Parquet
>  Issue Type: Improvement
>Reporter: phoebe chen
>Priority: Major
> Fix For: 1.12.3
>
>
> Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK 
> serialization to serialize JsonNode 
> ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to 
> 2.13.1 can fix this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (PARQUET-2148) Enable uniform decryption with plaintext footer

2022-05-16 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2148:
-

 Summary: Enable uniform decryption with plaintext footer
 Key: PARQUET-2148
 URL: https://issues.apache.org/jira/browse/PARQUET-2148
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky
 Fix For: 1.12.3


Currently, uniform decryption is not enabled in the plaintext footer mode - for 
no good reason. Column metadata is available, we just need to decrypt and use 
it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (PARQUET-2145) Release 1.12.3

2022-05-04 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2145:
-

 Summary: Release 1.12.3
 Key: PARQUET-2145
 URL: https://issues.apache.org/jira/browse/PARQUET-2145
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Reporter: Gidon Gershinsky
 Fix For: 1.12.3






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2098) Add more methods into interface of BlockCipher

2022-04-24 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526997#comment-17526997
 ] 

Gidon Gershinsky commented on PARQUET-2098:
---

[~theosib-amazon] I got ~half of this (code; not the unitests yet). But in the 
meantime, it became unclear if we need this functionality (in the upcoming 
release). Do you have a usecase for it?

> Add more methods into interface of BlockCipher
> --
>
> Key: PARQUET-2098
> URL: https://issues.apache.org/jira/browse/PARQUET-2098
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Currently BlockCipher interface has methods without letting caller to specify 
> length/offset. In some use cases like Presto,  it is needed to pass in a byte 
> array and the data to be encrypted only occupys partially of the array.  So 
> we need to add a new methods something like below for decrypt. Similar 
> methods might be needed for encrypt. 
> byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, 
> byte[] aad);



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (PARQUET-2136) File writer construction with encryptor

2022-04-04 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2136:
-

 Summary: File writer construction with encryptor
 Key: PARQUET-2136
 URL: https://issues.apache.org/jira/browse/PARQUET-2136
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.2
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Currently, a file writer object can be constructed with encryption properties. 
We need an additional constructor, that can accept an encryptor instead, in 
order to support lazy materialization of parquet file writers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2098) Add more methods into interface of BlockCipher

2022-01-27 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483575#comment-17483575
 ] 

Gidon Gershinsky commented on PARQUET-2098:
---

sure, I can take this one

> Add more methods into interface of BlockCipher
> --
>
> Key: PARQUET-2098
> URL: https://issues.apache.org/jira/browse/PARQUET-2098
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Currently BlockCipher interface has methods without letting caller to specify 
> length/offset. In some use cases like Presto,  it is needed to pass in a byte 
> array and the data to be encrypted only occupys partially of the array.  So 
> we need to add a new methods something like below for decrypt. Similar 
> methods might be needed for encrypt. 
> byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, 
> byte[] aad);



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-24 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448596#comment-17448596
 ] 

Gidon Gershinsky commented on PARQUET-2103:
---

[~gszadovszky] thanks for pointing in the right direction. We can check that a 
file is encrypted, and then skip printing its column metadata - this solves the 
problem at hand. We will still be able to print the file-wide metadata (as 
opposed to the per-column metadata, which is encrypted with column-specific 
keys). I'll start working on a patch. 

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Gidon Gershinsky
>Priority: Major
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> 

[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-24 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2103:
--
Description: 
In debug mode, this code 
{{if (LOG.isDebugEnabled()) {}}
{{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
{{}}}
called in 
{{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}

 

_*in encrypted files with plaintext footer*_ 

triggers an exception:

 
{{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
Null File Decryptor     }}

{{    at 
org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    at 
org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:?]}}
{{    at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]}}
{{    at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]}}
{{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1059)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    ... 23 more}}

  was:
In debug mode, this code 
{{if (LOG.isDebugEnabled()) {}}
{{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
{{}}}
called in 
{{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}

 

_*for unencrypted files*_ 


[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-22 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447372#comment-17447372
 ] 

Gidon Gershinsky commented on PARQUET-2103:
---

[~gszadovszky] [~sha...@uber.com] will appreciate getting your advise on the 
solution options (here or at the sync call). This seems to be a print with a 
blind reflection loop, that calls all nested classes / methods in an object. 
Since v1.12.0, there is a EncryptedColumnChunkMetaData class inside the 
ColumnChunkMetaData. Creating its instance and calling the "decrypt" method is 
not a good idea, for either unencrypted or encrypted files.

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Gidon Gershinsky
>Priority: Major
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*for unencrypted files*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> 

[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-16 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2103:
--
Description: 
In debug mode, this code 
{{if (LOG.isDebugEnabled()) {}}
{{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
{{}}}
called in 
{{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}

 

_*for unencrypted files*_ 

triggers an exception:

 
{{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
Null File Decryptor     }}

{{    at 
org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    at 
org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:?]}}
{{    at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]}}
{{    at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]}}
{{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1059)
 ~[parquet-jackson-1.12.0jar:1.12.0]}}
{{    at 
org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68)
 ~[parquet-hadoop-1.12.0jar:1.12.0]}}
{{    ... 23 more}}

  was:
In debug mode, this code 
if (LOG.isDebugEnabled()) \{
  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));
}
called in org.apache.parquet.format.converter.

ParquetMetadataConverter.

readParquetMetadata()

 

_*for unencrypted files*_ 

triggers an exception:

 
Caused by: 

[jira] [Created] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-16 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2103:
-

 Summary: crypto exception in print toPrettyJSON
 Key: PARQUET-2103
 URL: https://issues.apache.org/jira/browse/PARQUET-2103
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.12.2, 1.12.1, 1.12.0
Reporter: Gidon Gershinsky


In debug mode, this code 
if (LOG.isDebugEnabled()) \{
  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));
}
called in org.apache.parquet.format.converter.

ParquetMetadataConverter.

readParquetMetadata()

 

_*for unencrypted files*_ 

triggers an exception:

 
Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. Null 
File Decryptor     at 
org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
 ~[parquet-hadoop-1.12.0jar:1.12.0]
    at 
org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
 ~[parquet-hadoop-1.12.0jar:1.12.0]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:?]
    at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]
    at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1059)
 ~[parquet-jackson-1.12.0jar:1.12.0]
    at 
org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68)
 ~[parquet-hadoop-1.12.0jar:1.12.0]
    ... 23 more



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-28 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421323#comment-17421323
 ] 

Gidon Gershinsky commented on PARQUET-2080:
---

Oh, sorry, done.

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet implementation 
> PARQUET-2089



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-28 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421193#comment-17421193
 ] 

Gidon Gershinsky commented on PARQUET-2080:
---

Hi [~gszadovszky] , I've prepared a short writeup on this alternative solution, 
with a discussion of the tradeoffs. After writing it, my feeling is that the 
trade-off is not in favor of this alternative option; but [here it 
goes|https://docs.google.com/document/d/1zr6-4em8C8DGi-D3jGosQe2gvJKluat-8uUbS0y7F-0/edit?usp=sharing],
 just to cover all bases. Will appreciate your opinion on this.

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet implementation 
> PARQUET-2089



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-14 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2080:
--
Description: 
Due to PARQUET-2078 RowGroup.file_offset is not reliable.

This field is also wrongly calculated in the C++ oss parquet implementation 
PARQUET-2089

  was:
Due to PARQUET-2078 RowGroup.file_offset is not reliable.

This field is also wrongly calculated in the C++ oss parquet (Arrow rep).


> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet implementation 
> PARQUET-2089



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-14 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky reassigned PARQUET-2080:
-

Assignee: Gidon Gershinsky  (was: Gabor Szadovszky)

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet implementation 
> PARQUET-2089



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-13 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-2080:
--
Description: 
Due to PARQUET-2078 RowGroup.file_offset is not reliable.

This field is also wrongly calculated in the C++ oss parquet (Arrow rep).

  was:Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall 
deprecate the field and add suggestions how to calculate the value.


> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet (Arrow rep).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-13 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414075#comment-17414075
 ] 

Gidon Gershinsky commented on PARQUET-2080:
---

[~gszadovszky] yes, I'll take it. There might be a different solution (also 
format-related) that bypasses the need to calculate such parameter in any 
implementation, so it can be fully deprecated. I'll get back with the details 
and we'll discuss the trade-offs.

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall deprecate 
> the field and add suggestions how to calculate the value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2071) Encryption translation tool

2021-08-05 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393982#comment-17393982
 ] 

Gidon Gershinsky commented on PARQUET-2071:
---

A very useful tool, I'll be glad to review the pr.

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1908) CLONE - [C++] Update cpp crypto package to match signed-off specification

2021-08-03 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1908.
---
Resolution: Fixed

PR merged in May 2019

> CLONE - [C++] Update cpp crypto package to match signed-off specification
> -
>
> Key: PARQUET-1908
> URL: https://issues.apache.org/jira/browse/PARQUET-1908
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Akshay
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-5.0.0
>
>
> An initial version of crypto package is merged. This Jira updates the crypto 
> code to 
>  # conform the signed off specification (wire protocol updates, signature tag 
> creation, AAD support, etc)
>  # improve performance by extending cipher lifecycle to file writing/reading 
> - instead of creating cipher on each encrypt/decrypt operation  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2053) Pluggable key material store

2021-05-25 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2053:
-

 Summary: Pluggable key material store
 Key: PARQUET-2053
 URL: https://issues.apache.org/jira/browse/PARQUET-2053
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Encryption key material can be stored either inside Parquet files, or outside 
(configurable). For outside storage, Parquet already has a pluggable interface 
for custom implementations, {{FileKeyMaterialStore,}} but no mechanism to load 
them (currently, one implementation is packaged in parquet-mr, and always 
loaded when outside storage is configured). We will provide a way to load 
custom implementations of the {{FileKeyMaterialStore}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1230) CLI tools for encrypted files

2021-05-04 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1230:
--
  Component/s: parquet-mr
Affects Version/s: 1.12.0

> CLI tools for encrypted files
> -
>
> Key: PARQUET-1230
> URL: https://issues.apache.org/jira/browse/PARQUET-1230
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1230) CLI tools for encrypted files

2021-05-04 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1230:
--
Parent: (was: PARQUET-1178)
Issue Type: New Feature  (was: Sub-task)

> CLI tools for encrypted files
> -
>
> Key: PARQUET-1230
> URL: https://issues.apache.org/jira/browse/PARQUET-1230
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2040) Uniform encryption

2021-04-29 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2040:
-

 Summary: Uniform encryption
 Key: PARQUET-2040
 URL: https://issues.apache.org/jira/browse/PARQUET-2040
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.0
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


PME low-level spec supports using the same encryption key for all columns, 
which is useful in a number of scenarios. However, this feature is not exposed 
yet in the high-level API, because its misuse can break the NIST limit on the 
number of AES GCM operations with one key. We will develop a limit-enforcing 
code and provide an API for uniform table encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2033) Make "null decryptor" exception more informative

2021-04-20 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2033:
-

 Summary: Make "null decryptor" exception more informative
 Key: PARQUET-2033
 URL: https://issues.apache.org/jira/browse/PARQUET-2033
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.0
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Forgetting to pass decryption properties when reading an encrypted column in 
files with plaintext footer, results in a "null decryptor" exception thrown in 
the ColumnChunkMetaData class. The exception text can/should be updated to 
point to the possible reason.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2014) Local key wrapping with rotation

2021-04-04 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-2014:
-

 Summary: Local key wrapping with rotation
 Key: PARQUET-2014
 URL: https://issues.apache.org/jira/browse/PARQUET-2014
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


parquet-mr-1.12.0 has an experimental support for local wrapping of encryption 
keys, that doesn't handle master key versions and key rotation. This Jira will 
add these capabilities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1613) Key rotation tool

2021-04-04 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1613.
---
Resolution: Done

handled by pr 615

> Key rotation tool
> -
>
> Key: PARQUET-1613
> URL: https://issues.apache.org/jira/browse/PARQUET-1613
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gidon Gershinsky
>Assignee: Maya Anderson
>Priority: Major
>
> Rotates the master key, for both single and double wrappers.
> For the latter, enables support for a single KMS call per column, in readers 
> of any data sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1612) Double wrapped key manager

2021-04-04 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1612.
---
Resolution: Done

handled by pr 615

> Double wrapped key manager
> --
>
> Key: PARQUET-1612
> URL: https://issues.apache.org/jira/browse/PARQUET-1612
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> To minimize interaction with KMS, this manager will wrap the encryption keys 
> twice.  Might be combined with key rotation for further optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1178) Parquet modular encryption

2021-03-26 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1178.
---
Resolution: Done

Released. Thanks to all who've contributed to this new Parquet capability!

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.0
>
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1178) Parquet modular encryption

2021-03-26 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1178:
--
Fix Version/s: 1.12.0

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.0
>
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1997) [C++] AesEncryptor and AesDecryptor primitives are unsafe

2021-03-10 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299054#comment-17299054
 ] 

Gidon Gershinsky commented on PARQUET-1997:
---

I recall we talked about that with Tham, but I forgot the details.. [~thamha], 
do you remember what the return value is for? 

> [C++] AesEncryptor and AesDecryptor primitives are unsafe
> -
>
> Key: PARQUET-1997
> URL: https://issues.apache.org/jira/browse/PARQUET-1997
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> {{AesEncryptor::Encrypt}}, {{AesDecryptor::Decrypt}} take a pointer to the 
> output buffer but without the output buffer length. The caller is required to 
> guess the expected output length. The functions also return the written 
> output length, but at this point it's too late: data may have been written 
> out of bounds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1997) [C++] AesEncryptor and AesDecryptor primitives are unsafe

2021-03-10 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299041#comment-17299041
 ] 

Gidon Gershinsky commented on PARQUET-1997:
---

[~apitrou] This point is addressed by the _int 
AesEncryptor::CiphertextSizeDelta()_ function - the caller uses it to allocate 
the output buffer. This is not a part of public Parquet API; the caller is the 
parquet code.

> [C++] AesEncryptor and AesDecryptor primitives are unsafe
> -
>
> Key: PARQUET-1997
> URL: https://issues.apache.org/jira/browse/PARQUET-1997
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> {{AesEncryptor::Encrypt}}, {{AesDecryptor::Decrypt}} take a pointer to the 
> output buffer but without the output buffer length. The caller is required to 
> guess the expected output length. The functions also return the written 
> output length, but at this point it's too late: data may have been written 
> out of bounds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1992) Cannot build from tarball because of git submodules

2021-03-02 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293509#comment-17293509
 ] 

Gidon Gershinsky commented on PARQUET-1992:
---

This contribution had been added by [~mayaa], she knows the subject better than 
me. Maya, could you address the comments and the question in this jira?

> Cannot build from tarball because of git submodules
> ---
>
> Key: PARQUET-1992
> URL: https://issues.apache.org/jira/browse/PARQUET-1992
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Priority: Blocker
>
> Because we use git submodules (to get test parquet files) a simple "mvn clean 
> install" fails from the unpacked tarball due to "not a git repository".
> I think we would have 2 options to solve this situation:
> * Include all the required files (even only for testing) in the tarball and 
> somehow avoid the git submodule update in case of executed in a non-git 
> envrionment
> * Make the downloading of the parquet files and the related tests optional so 
> it won't fail the build from the tarball



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1989) Deep verification of encrypted files

2021-02-28 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-1989:
-

 Summary: Deep verification of encrypted files
 Key: PARQUET-1989
 URL: https://issues.apache.org/jira/browse/PARQUET-1989
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-cli
Reporter: Gidon Gershinsky
Assignee: Maya Anderson
 Fix For: 1.13.0


A tools that verifies encryption of parquet files in a given folder. Analyzes 
the footer, and then every module (page headers, pages, column indexes, bloom 
filters) - making sure they are encrypted (in relevant columns). Potentially 
checking the encryption keys.

We'll start with a design doc, open for discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1939) Fix RemoteKmsClient API ambiguity

2020-11-01 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1939:
--
Summary: Fix RemoteKmsClient API ambiguity  (was: RemoteKmsClient API 
confusion)

> Fix RemoteKmsClient API ambiguity
> -
>
> Key: PARQUET-1939
> URL: https://issues.apache.org/jira/browse/PARQUET-1939
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Users complain that RemoteKmsClient name can be confusing, since this class 
> covers both local and in-server (remote) key wrapping. Still, this class 
> supports only remote KMS servers. But to remove any ambiguity, and to make 
> the API simpler, we will rename this class to LocalWrapKmsClient; it will be 
> used only in rare situations where in-server wrapping in not supported. In 
> all other situations, the basic KmsClient interface will be used directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1940) Make KeyEncryptionKey length configurable

2020-11-01 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-1940:
-

 Summary: Make KeyEncryptionKey length configurable
 Key: PARQUET-1940
 URL: https://issues.apache.org/jira/browse/PARQUET-1940
 Project: Parquet
  Issue Type: Improvement
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


KEK length is hardcoded to 128 bits. It should be configurable, to any value 
allowed by AES (128, 192 or 256 bits).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1939) RemoteKmsClient API confusion

2020-11-01 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-1939:
-

 Summary: RemoteKmsClient API confusion
 Key: PARQUET-1939
 URL: https://issues.apache.org/jira/browse/PARQUET-1939
 Project: Parquet
  Issue Type: Improvement
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Users complain that RemoteKmsClient name can be confusing, since this class 
covers both local and in-server (remote) key wrapping. Still, this class 
supports only remote KMS servers. But to remove any ambiguity, and to make the 
API simpler, we will rename this class to LocalWrapKmsClient; it will be used 
only in rare situations where in-server wrapping in not supported. In all other 
situations, the basic KmsClient interface will be used directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1938) Option to get KMS details from key material (in key rotation)

2020-11-01 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-1938:
-

 Summary: Option to get KMS details from key material (in key 
rotation)
 Key: PARQUET-1938
 URL: https://issues.apache.org/jira/browse/PARQUET-1938
 Project: Parquet
  Issue Type: Improvement
Reporter: Gidon Gershinsky
Assignee: Maya Anderson


Currently, key rotation uses explicit parameters to get the KMS details. 
Instead, it can extract these details from the key material files - this is 
more convenient for a user. Still, the explicit parameters (if provided) will 
override these values



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1934) Dictionary page is not decrypted in predicate pushdown path

2020-11-01 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1934.
---
Resolution: Not A Problem

Jira opened by mistake. No problems with dictionary decryption.

> Dictionary page is not decrypted in predicate pushdown path
> ---
>
> Key: PARQUET-1934
> URL: https://issues.apache.org/jira/browse/PARQUET-1934
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Predicate pushdown, based on dictionary pages, uses a page parsing code that 
> doesn't support decryption yet. Will add a few lines to decrypt the 
> dictionary page header and page (for encrypted columns).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1934) Dictionary page is not decrypted in predicate pushdown path

2020-10-22 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-1934:
-

 Summary: Dictionary page is not decrypted in predicate pushdown 
path
 Key: PARQUET-1934
 URL: https://issues.apache.org/jira/browse/PARQUET-1934
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.12.0
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Predicate pushdown, based on dictionary pages, uses a page parsing code that 
doesn't support decryption yet. Will add a few lines to decrypt the dictionary 
page header and page (for encrypted columns).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1373) Encryption key management tools

2020-09-24 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1373:
--
Fix Version/s: (was: encryption-feature-branch)
   1.12.0
Affects Version/s: 1.12.0

> Encryption key management tools 
> 
>
> Key: PARQUET-1373
> URL: https://issues.apache.org/jira/browse/PARQUET-1373
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.0
>
>
> Parquet Modular Encryption 
> ([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides 
> an API that accepts keys, arbitrary key metadata and key retrieval callbacks 
> - which allows to implement basically any key management policy on top of it. 
> This Jira will add tools that implement a set of best practice elements for 
> key management. This is not an end-to-end key management, but rather a set of 
> components that might simplify design and development of an end-to-end 
> solution.
> This tool set is one of many possible. There is no goal to create a single or 
> “standard” toolkit for Parquet encryption keys. Parquet has a Crypto Factory 
> interface [(PARQUET-1817|https://issues.apache.org/jira/browse/PARQUET-1817]) 
> that allows to plug in different implementations of encryption key management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1854) Properties-Driven Interface to Parquet Encryption

2020-09-24 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1854:
--
  Component/s: parquet-mr
Fix Version/s: (was: encryption-feature-branch)
   1.12.0
Affects Version/s: 1.12.0

> Properties-Driven Interface to Parquet Encryption
> -
>
> Key: PARQUET-1854
> URL: https://issues.apache.org/jira/browse/PARQUET-1854
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: 1.12.0
>
>
> A high-level interface to Parquet encryption layer, based on configuration 
> properties (table properties, Hadoop configuration, writer/reader options, 
> etc)  -  will  simplify the activation and configuration of data encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1178) Parquet modular encryption

2020-09-24 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1178:
--
  Component/s: parquet-mr
Affects Version/s: 1.12.0

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1376) Data obfuscation layer for encryption

2020-08-04 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1376:
--
Description: 
Data obfuscation in sensitive columns - for users without access to column 
encryption keys.
 # Implement on top of [basic Parquet 
encryption|https://github.com/apache/parquet-format/blob/encryption/Encryption.md]
 
 # Built-in support for multiple masking mechanisms, with different trade-off 
between data utility, leakage, and size/throughput overhead
 # Provide interface for plug-in custom masking mechanism
 # Enable storing multiple masked versions of the same column in a file
 # Provide readers with explicit list of column’s masked versions in a file
 # Enable readers to select a masked version of a column
 # Stretch: Implement tools for analysis of file data privacy properties and 
information leakage
 # Stretch: Leverage privacy analysis tools for tuning file data anonymity
 # Optional: Support aggregated obfuscation

  was:
Anonymity layer for hidden columns
 # Different data masking options
 ** per-cell
 ** aggregated (average, etc)
 # Reader notification on data access status
 # Providing readers with a choice of masking options (if available)


> Data obfuscation layer for encryption
> -
>
> Key: PARQUET-1376
> URL: https://issues.apache.org/jira/browse/PARQUET-1376
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Data obfuscation in sensitive columns - for users without access to column 
> encryption keys.
>  # Implement on top of [basic Parquet 
> encryption|https://github.com/apache/parquet-format/blob/encryption/Encryption.md]
>  
>  # Built-in support for multiple masking mechanisms, with different trade-off 
> between data utility, leakage, and size/throughput overhead
>  # Provide interface for plug-in custom masking mechanism
>  # Enable storing multiple masked versions of the same column in a file
>  # Provide readers with explicit list of column’s masked versions in a file
>  # Enable readers to select a masked version of a column
>  # Stretch: Implement tools for analysis of file data privacy properties and 
> information leakage
>  # Stretch: Leverage privacy analysis tools for tuning file data anonymity
>  # Optional: Support aggregated obfuscation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1891) Encryption-related light fixes

2020-07-29 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1891:
--
Description: 
hadoop/readme.md, travis.yaml, vault sample fixes
 
Summary: Encryption-related light fixes  (was: hadoop/readme.md and 
travis.yaml fixes)

> Encryption-related light fixes
> --
>
> Key: PARQUET-1891
> URL: https://issues.apache.org/jira/browse/PARQUET-1891
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> hadoop/readme.md, travis.yaml, vault sample fixes
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1892) CRC comment modification in Thrift

2020-07-29 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-1892:
-

 Summary: CRC comment modification in Thrift
 Key: PARQUET-1892
 URL: https://issues.apache.org/jira/browse/PARQUET-1892
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Mention that CRC is calculated after compression and encryption



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1891) hadoop/readme.md and travis.yaml fixes

2020-07-29 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-1891:
-

 Summary: hadoop/readme.md and travis.yaml fixes
 Key: PARQUET-1891
 URL: https://issues.apache.org/jira/browse/PARQUET-1891
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1884) Merge encryption branch into master

2020-07-29 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1884.
---
Resolution: Done

> Merge encryption branch into master
> ---
>
> Key: PARQUET-1884
> URL: https://issues.apache.org/jira/browse/PARQUET-1884
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1884) Merge encryption branch into master

2020-07-12 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-1884:
-

 Summary: Merge encryption branch into master
 Key: PARQUET-1884
 URL: https://issues.apache.org/jira/browse/PARQUET-1884
 Project: Parquet
  Issue Type: Sub-task
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1568) High level interface to Parquet encryption

2020-07-12 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1568.
---
Resolution: Duplicate

> High level interface to Parquet encryption
> --
>
> Key: PARQUET-1568
> URL: https://issues.apache.org/jira/browse/PARQUET-1568
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
>  Per discussion at the community sync, we're working on a 
> design of a high level interface to Parquet encryption. A draft doc will be 
> published soon, that covers the current proposals from Xinli (cryptodata 
> interface), Ryan (table properties) and Gidon (key tools, hadoop config) - 
> and starts to draft a unified/streamlined design based on them. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1854) Properties-Driven Interface to Parquet Encryption

2020-07-12 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky resolved PARQUET-1854.
---
Fix Version/s: encryption-feature-branch
   Resolution: Done

> Properties-Driven Interface to Parquet Encryption
> -
>
> Key: PARQUET-1854
> URL: https://issues.apache.org/jira/browse/PARQUET-1854
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: encryption-feature-branch
>
>
> A high-level interface to Parquet encryption layer, based on configuration 
> properties (table properties, Hadoop configuration, writer/reader options, 
> etc)  -  will  simplify the activation and configuration of data encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1373) Encryption key management tools

2020-05-11 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1373:
--
Description: 
Parquet Modular Encryption 
([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides an 
API that accepts keys, arbitrary key metadata and key retrieval callbacks - 
which allows to implement basically any key management policy on top of it. 
This Jira will add tools that implement a set of best practice elements for key 
management. This is not an end-to-end key management, but rather a set of 
components that might simplify design and development of an end-to-end solution.

This tool set is one of many possible. There is no goal to create a single or 
“standard” toolkit for Parquet encryption keys. Parquet has a Crypto Factory 
interface [(PARQUET-1817|https://issues.apache.org/jira/browse/PARQUET-1817]) 
that allows to plug in different implementations of encryption key management.


  was:
Parquet Modular Encryption (PARQUET-1178) provides an API that accepts keys, 
arbitrary key metadata and key retrieval callbacks - which allows to implement 
basically any key management policy on top of it. This Jira will add tools that 
implement a set of best practice elements for key management. This is not an 
end-to-end key management, but rather a set of components that might simplify 
design and development of an end-to-end solution.

For example, the tools will cover
 * modification of key metadata inside existing Parquet files.
 * support for re-keying that doesn't require modification of Parquet files.

 

Parquet will not mandate a use of these tools. Users will be able to continue 
working with the basic API, to create any custom key management solution that 
addresses their security requirements. If helps, they can also utilize some or 
all of these tools.


> Encryption key management tools 
> 
>
> Key: PARQUET-1373
> URL: https://issues.apache.org/jira/browse/PARQUET-1373
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Parquet Modular Encryption 
> ([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides 
> an API that accepts keys, arbitrary key metadata and key retrieval callbacks 
> - which allows to implement basically any key management policy on top of it. 
> This Jira will add tools that implement a set of best practice elements for 
> key management. This is not an end-to-end key management, but rather a set of 
> components that might simplify design and development of an end-to-end 
> solution.
> This tool set is one of many possible. There is no goal to create a single or 
> “standard” toolkit for Parquet encryption keys. Parquet has a Crypto Factory 
> interface [(PARQUET-1817|https://issues.apache.org/jira/browse/PARQUET-1817]) 
> that allows to plug in different implementations of encryption key management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1373) Encryption key management tools

2020-05-11 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1373:
--
Description: 
Parquet Modular Encryption (PARQUET-1178) provides an API that accepts keys, 
arbitrary key metadata and key retrieval callbacks - which allows to implement 
basically any key management policy on top of it. This Jira will add tools that 
implement a set of best practice elements for key management. This is not an 
end-to-end key management, but rather a set of components that might simplify 
design and development of an end-to-end solution.

For example, the tools will cover
 * modification of key metadata inside existing Parquet files.
 * support for re-keying that doesn't require modification of Parquet files.

 

Parquet will not mandate a use of these tools. Users will be able to continue 
working with the basic API, to create any custom key management solution that 
addresses their security requirements. If helps, they can also utilize some or 
all of these tools.

  was:
Parquet Modular Encryption 
([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides an 
API that accepts keys, arbitrary key metadata and key retrieval callbacks - 
which allows to implement basically any key management policy on top of it. 
This Jira will add tools that implement a set of best practice elements for key 
management. This is not an end-to-end key management, but rather a set of 
components that might simplify design and development of an end-to-end solution.

For example, the tools will cover
 * modification of key metadata inside existing Parquet files.
 * support for re-keying that doesn't require modification of Parquet files.

 

Parquet will not mandate a use of these tools. Users will be able to continue 
working with the basic API, to create any custom key management solution that 
addresses their security requirements. If helps, they can also utilize some or 
all of these tools.


> Encryption key management tools 
> 
>
> Key: PARQUET-1373
> URL: https://issues.apache.org/jira/browse/PARQUET-1373
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Parquet Modular Encryption (PARQUET-1178) provides an API that accepts keys, 
> arbitrary key metadata and key retrieval callbacks - which allows to 
> implement basically any key management policy on top of it. This Jira will 
> add tools that implement a set of best practice elements for key management. 
> This is not an end-to-end key management, but rather a set of components that 
> might simplify design and development of an end-to-end solution.
> For example, the tools will cover
>  * modification of key metadata inside existing Parquet files.
>  * support for re-keying that doesn't require modification of Parquet files.
>  
> Parquet will not mandate a use of these tools. Users will be able to continue 
> working with the basic API, to create any custom key management solution that 
> addresses their security requirements. If helps, they can also utilize some 
> or all of these tools.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1854) Properties-Driven Interface to Parquet Encryption

2020-05-04 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created PARQUET-1854:
-

 Summary: Properties-Driven Interface to Parquet Encryption
 Key: PARQUET-1854
 URL: https://issues.apache.org/jira/browse/PARQUET-1854
 Project: Parquet
  Issue Type: New Feature
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


A high-level interface to Parquet encryption layer, based on configuration 
properties (table properties, Hadoop configuration, writer/reader options, etc) 
 -  will  simplify the activation and configuration of data encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1178) Parquet modular encryption

2020-05-03 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098549#comment-17098549
 ] 

Gidon Gershinsky edited comment on PARQUET-1178 at 5/3/20, 7:52 PM:


hard to say what's best at this point. Here's the arrow/parquet-cpp encryption 
[sample|https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/examples/parquet/low-level-api/encryption-reader-writer-all-crypto-options.cc].


was (Author: gershinsky):
hard to say what's best at this point. Here's the arrow/parquet-cpp encryption 
[sample|[https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/examples/parquet/low-level-api/encryption-reader-writer-all-crypto-options.cc]].

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1178) Parquet modular encryption

2020-05-03 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098549#comment-17098549
 ] 

Gidon Gershinsky commented on PARQUET-1178:
---

hard to say what's best at this point. Here's the arrow/parquet-cpp encryption 
[sample|[https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/examples/parquet/low-level-api/encryption-reader-writer-all-crypto-options.cc]].

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1568) High level interface to Parquet encryption

2020-05-03 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky reassigned PARQUET-1568:
-

Assignee: Gidon Gershinsky

> High level interface to Parquet encryption
> --
>
> Key: PARQUET-1568
> URL: https://issues.apache.org/jira/browse/PARQUET-1568
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
>  Per discussion at the community sync, we're working on a 
> design of a high level interface to Parquet encryption. A draft doc will be 
> published soon, that covers the current proposals from Xinli (cryptodata 
> interface), Ryan (table properties) and Gidon (key tools, hadoop config) - 
> and starts to draft a unified/streamlined design based on them. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >