[jira] [Resolved] (PARQUET-2364) Encrypt all columns option
[ https://issues.apache.org/jira/browse/PARQUET-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-2364. --- Fix Version/s: 1.14.0 Resolution: Fixed > Encrypt all columns option > -- > > Key: PARQUET-2364 > URL: https://issues.apache.org/jira/browse/PARQUET-2364 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.14.0 > > > The column encryption mode currently encrypts only the explicitly specified > columns. Other columns stay unencrypted. This Jira will add an option to > encrypt (and tamper-proof) the other columns with the default footer key. > Decryption / reading is not affected. The current readers will be able to > decrypt the new files, as long as they have access to the required keys. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2370) Crypto factory activation of "all column encryption" mode
[ https://issues.apache.org/jira/browse/PARQUET-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-2370. --- Resolution: Fixed > Crypto factory activation of "all column encryption" mode > - > > Key: PARQUET-2370 > URL: https://issues.apache.org/jira/browse/PARQUET-2370 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.14.0 > > > Enable the crypto factory to activate the "encrypt all columns" option > (https://issues.apache.org/jira/browse/PARQUET-2364) . Add a unitest. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2370) Crypto factory activation of "all column encryption" mode
[ https://issues.apache.org/jira/browse/PARQUET-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2370: -- Fix Version/s: 1.14.0 > Crypto factory activation of "all column encryption" mode > - > > Key: PARQUET-2370 > URL: https://issues.apache.org/jira/browse/PARQUET-2370 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.14.0 > > > Enable the crypto factory to activate the "encrypt all columns" option > (https://issues.apache.org/jira/browse/PARQUET-2364) . Add a unitest. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2370) Crypto factory activation of "all column encryption" mode
Gidon Gershinsky created PARQUET-2370: - Summary: Crypto factory activation of "all column encryption" mode Key: PARQUET-2370 URL: https://issues.apache.org/jira/browse/PARQUET-2370 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Enable the crypto factory to activate the "encrypt all columns" option (https://issues.apache.org/jira/browse/PARQUET-2364) . Add a unitest. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2364) Encrypt all columns option
Gidon Gershinsky created PARQUET-2364: - Summary: Encrypt all columns option Key: PARQUET-2364 URL: https://issues.apache.org/jira/browse/PARQUET-2364 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky The column encryption mode currently encrypts only the explicitly specified columns. Other columns stay unencrypted. This Jira will add an option to encrypt (and tamper-proof) the other columns with the default footer key. Decryption / reading is not affected. The current readers will be able to decrypt the new files, as long as they have access to the required keys. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2223) Parquet Data Masking for Column Encryption
[ https://issues.apache.org/jira/browse/PARQUET-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733538#comment-17733538 ] Gidon Gershinsky commented on PARQUET-2223: --- Yep, I also think so. I'll have a look at the current version of the design document. > Parquet Data Masking for Column Encryption > -- > > Key: PARQUET-2223 > URL: https://issues.apache.org/jira/browse/PARQUET-2223 > Project: Parquet > Issue Type: New Feature >Reporter: Jiashen Zhang >Priority: Major > > h1. Background > h2. What is Data Masking? > Data masking is a technique used to protect sensitive data by replacing it > with modified or obscured values. The purpose of data masking is to ensure > that sensitive information, such as Personally Identifiable Information > (PII), remains hidden from unauthorized users while allowing authorized users > to perform their tasks. > Here are a few key points about data masking: > * Protection of Sensitive Data: Data masking helps to safeguard sensitive > data, such as Social Security numbers, credit card numbers, names, addresses, > and other personally identifiable information. By applying masking > techniques, the original values are replaced with fictional or transformed > data that retains the format and structure but removes any identifiable > information. > * Controlled Access: Data masking enables controlled access to sensitive > data. Authorized users, typically with appropriate permissions, can access > the unmasked or original data, while unauthorized users or users without the > necessary permissions will only see the masked data. > * Various Masking Techniques: There are different masking techniques > available, depending on the specific data privacy requirements and use cases. > Some commonly used techniques include: > ** Nullification: Replacing original data with NULL values. > ** Randomization: Replacing sensitive data with randomly generated values. > ** Substitution: Replacing sensitive data with fictional but realistic > values. > ** Hashing: Transforming sensitive data into irreversible hashed values. > ** Redaction: Removing or masking specific parts of sensitive data while > retaining other non-sensitive information. > * Compliance and Data Privacy: Data masking is often employed to comply with > data protection regulations and maintain data privacy. By masking sensitive > data, we can reduce the risk of data breaches and unauthorized access while > still allowing legitimate users to perform their tasks. > * Maintaining Data Consistency: Data masking techniques aim to maintain data > consistency and integrity by ensuring that masked data retains the original > data's format, structure, and relationships. This allows applications and > processes that rely on the data to continue functioning correctly. > h2. Why do we need it? > Data masking serves several important purposes and provides numerous > benefits. Here are some reasons why we need data masking: > * Data Privacy and Compliance: Data masking helps us comply with data > privacy regulations such as the General Data Protection Regulation (GDPR) and > the Health Insurance Portability and Accountability Act (HIPAA). These > regulations require us to protect sensitive data and ensure that it is only > accessible to authorized individuals. Data masking enables us to comply with > these regulations by de-identifying sensitive data. > * Minimize Data Exposure: By masking sensitive data, we can reduce the risk > of data breaches and unauthorized access. If a security breach occurs, the > exposed data will be meaningless to unauthorized users due to the masking. > This helps protect individuals' privacy and prevents misuse of sensitive > information. > * Secure Testing and Development Environments: Data masking is particularly > useful in creating secure testing and development environments. By masking > sensitive data, we can use realistic but fictional data for testing, > analysis, and development activities without exposing real personal or > sensitive information. > * Enhanced Data Sharing: Data masking allows us to share data with external > parties, such as partners or third-party vendors, while protecting sensitive > information. Masked data can be shared with confidence, as the original > sensitive values are replaced with transformed or fictional data. > * Employee Privacy: Data masking helps protect employee privacy by > obfuscating sensitive employee information, such as social security numbers > or salary details, in databases or HR systems. This safeguards employees' > personal data from unauthorized access or internal misuse. > * Insider Threat Mitigation: Data masking reduces the risk posed by insider > threats, where authorized individuals intentionally or
[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys
[ https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17719294#comment-17719294 ] Gidon Gershinsky commented on PARQUET-2193: --- [~Nageswaran] A couple of updates on this. We should be able to skip this verification for encrypted files, a pull request is sent to parquet-mr. Also, I've tried the new Spark 3.4.0 (as is, no modifications) with the scala test above - no exception was thrown. Probably, the updated Spark code bypasses the problematic parquet read path. Can you check if Spark 3.4.0 works ok for your usecase. > Encrypting only one field in nested field prevents reading of other fields in > nested field without keys > --- > > Key: PARQUET-2193 > URL: https://issues.apache.org/jira/browse/PARQUET-2193 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Vignesh Nageswaran >Priority: Major > > Hi Team, > While exploring parquet encryption, it is found that, if a field in nested > column is encrypted , and If I want to read this parquet directory from other > applications which does not have encryption keys to decrypt it, I cannot read > the remaining fields of the nested column without keys. > Example > ` > {code:java} > case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) > case class SquareItem(int_column: Int, square_int_column : Double, > partitionCol: Int, nestedCol :nestedItem) > `{code} > In the case class `SquareItem` , `nestedCol` field is nested field and I want > to encrypt a field `ic` within it. > > I also want the footer to be non encrypted , so that I can use the encrypted > parquet file by legacy applications. > > Encryption is successful, however, when I query the parquet file using spark > 3.3.0 without having any configuration for parquet encryption set up , I > cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only > `nestedCol` `ic` field will not be querable. > > > Reproducer. > Spark 3.3.0 Using Spark-shell > Downloaded the file > [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar] > and added it to spark-jars folder > Code to create encrypted data. # > > {code:java} > sc.hadoopConfiguration.set("parquet.crypto.factory.class" > ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") > sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" > ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") > sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: > BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: > BAECAAECAAECAAECAAECAA==") > sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false") > val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" > valpartitionCol = 1 > case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) > case class SquareItem(int_column: Int, square_int_column : Double, > partitionCol: Int, nestedCol :nestedItem) > val dataRange = (1 to 100).toList > val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, > scala.math.pow(i,2), partitionCol,nestedItem(i,i > squares.toDS().show() > squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys", > > "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key", > "keyz").parquet(encryptedParquetPath) > {code} > Code to read the data trying to access non encrypted nested field by opening > a new spark-shell > > {code:java} > val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" > spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test") > spark.sql("select nestedCol.sic from test").show(){code} > As you can see that nestedCol.sic is not encrypted , I was expecting the > results, but > I get the below error > > {code:java} > Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: > [square_int_column]. Null File Decryptor > at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348) > at > org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:177) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at >
[jira] [Created] (PARQUET-2297) Encrypted files should not be checked for delta encoding problem
Gidon Gershinsky created PARQUET-2297: - Summary: Encrypted files should not be checked for delta encoding problem Key: PARQUET-2297 URL: https://issues.apache.org/jira/browse/PARQUET-2297 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.13.0 Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Fix For: 1.14.0, 1.13.1 Delta encoding problem (https://issues.apache.org/jira/browse/PARQUET-246) was fixed in writers since parquet-mr-1.8. This fix also added a `checkDeltaByteArrayProblem` method in readers, that runs over all columns and checks for this problem in older files. This now triggers an unrelated exception when reading encrypted files, in the following situation: trying to read an unencrypted column, without having keys for encrypted columns (see https://issues.apache.org/jira/browse/PARQUET-2193). This happens in Spark, with nested columns (files with regular columns are ok). Possible solution: don't call the `checkDeltaByteArrayProblem` method for encrypted files - because these files can be written only with parquet-mr-1.12 and newer, where the delta encoding problem is already fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys
[ https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718795#comment-17718795 ] Gidon Gershinsky commented on PARQUET-2193: --- Yep, sorry about the delay. This turned out to be more challenging than I hoped; a fix at the encryption code level will require some changes in the format specification.. A rather big deal, and likely unjustified in this case. The immediate trigger is the `checkDeltaByteArrayProblem` verification, added 8 years ago to detect encoding irregularities in older files. For some reason this check is done only on files with nested columns, and not on files with regular columns (at least in Spark). Maybe the right thing today is to remove that verification. I'll check with the community. > Encrypting only one field in nested field prevents reading of other fields in > nested field without keys > --- > > Key: PARQUET-2193 > URL: https://issues.apache.org/jira/browse/PARQUET-2193 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Vignesh Nageswaran >Priority: Major > > Hi Team, > While exploring parquet encryption, it is found that, if a field in nested > column is encrypted , and If I want to read this parquet directory from other > applications which does not have encryption keys to decrypt it, I cannot read > the remaining fields of the nested column without keys. > Example > ` > {code:java} > case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) > case class SquareItem(int_column: Int, square_int_column : Double, > partitionCol: Int, nestedCol :nestedItem) > `{code} > In the case class `SquareItem` , `nestedCol` field is nested field and I want > to encrypt a field `ic` within it. > > I also want the footer to be non encrypted , so that I can use the encrypted > parquet file by legacy applications. > > Encryption is successful, however, when I query the parquet file using spark > 3.3.0 without having any configuration for parquet encryption set up , I > cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only > `nestedCol` `ic` field will not be querable. > > > Reproducer. > Spark 3.3.0 Using Spark-shell > Downloaded the file > [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar] > and added it to spark-jars folder > Code to create encrypted data. # > > {code:java} > sc.hadoopConfiguration.set("parquet.crypto.factory.class" > ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") > sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" > ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") > sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: > BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: > BAECAAECAAECAAECAAECAA==") > sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false") > val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" > valpartitionCol = 1 > case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) > case class SquareItem(int_column: Int, square_int_column : Double, > partitionCol: Int, nestedCol :nestedItem) > val dataRange = (1 to 100).toList > val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, > scala.math.pow(i,2), partitionCol,nestedItem(i,i > squares.toDS().show() > squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys", > > "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key", > "keyz").parquet(encryptedParquetPath) > {code} > Code to read the data trying to access non encrypted nested field by opening > a new spark-shell > > {code:java} > val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" > spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test") > spark.sql("select nestedCol.sic from test").show(){code} > As you can see that nestedCol.sic is not encrypted , I was expecting the > results, but > I get the below error > > {code:java} > Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: > [square_int_column]. Null File Decryptor > at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348) > at > org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191) > at >
[jira] [Assigned] (PARQUET-2103) crypto exception in print toPrettyJSON
[ https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky reassigned PARQUET-2103: - Assignee: Gidon Gershinsky > crypto exception in print toPrettyJSON > -- > > Key: PARQUET-2103 > URL: https://issues.apache.org/jira/browse/PARQUET-2103 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Minor > > In debug mode, this code > {{if (LOG.isDebugEnabled()) {}} > {{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}} > {{}}} > called in > {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}} > > _*in encrypted files with plaintext footer*_ > triggers an exception: > > {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. > Null File Decryptor }} > {{ at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:?]}} > {{ at > jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:?]}} > {{ at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:?]}} > {{ at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at >
[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON
[ https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2103: -- Affects Version/s: 1.12.3 > crypto exception in print toPrettyJSON > -- > > Key: PARQUET-2103 > URL: https://issues.apache.org/jira/browse/PARQUET-2103 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3 >Reporter: Gidon Gershinsky >Priority: Major > > In debug mode, this code > {{if (LOG.isDebugEnabled()) {}} > {{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}} > {{}}} > called in > {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}} > > _*in encrypted files with plaintext footer*_ > triggers an exception: > > {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. > Null File Decryptor }} > {{ at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:?]}} > {{ at > jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:?]}} > {{ at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:?]}} > {{ at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217) >
[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON
[ https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2103: -- Priority: Minor (was: Major) > crypto exception in print toPrettyJSON > -- > > Key: PARQUET-2103 > URL: https://issues.apache.org/jira/browse/PARQUET-2103 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3 >Reporter: Gidon Gershinsky >Priority: Minor > > In debug mode, this code > {{if (LOG.isDebugEnabled()) {}} > {{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}} > {{}}} > called in > {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}} > > _*in encrypted files with plaintext footer*_ > triggers an exception: > > {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. > Null File Decryptor }} > {{ at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:?]}} > {{ at > jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:?]}} > {{ at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:?]}} > {{ at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217) >
[jira] [Created] (PARQUET-2208) Add details to nested column encryption config doc and exception text
Gidon Gershinsky created PARQUET-2208: - Summary: Add details to nested column encryption config doc and exception text Key: PARQUET-2208 URL: https://issues.apache.org/jira/browse/PARQUET-2208 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.12.3 Reporter: Gidon Gershinsky Parquet columnar encryption requires an explicit full path for each column to be encrypted. If a partial path is configured, the thrown exception is not informative enough, doesn't help much in correcting the parameters. The goal is to make the exception print something like: _Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted column [rider] not in file schema column list: [foo] , [rider.list.element.foo] , [rider.list.element.bar] , [ts] , [uuid]_ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys
[ https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614917#comment-17614917 ] Gidon Gershinsky commented on PARQUET-2193: --- Welcome. >From the sound of it, this might require each file to be processed by one >thread only (instead of reading a single file by multiple threads); which >should be ok in typical usecases where one thread/executor reads multiple >files anyway. But I'll have a deeper look at this. > Encrypting only one field in nested field prevents reading of other fields in > nested field without keys > --- > > Key: PARQUET-2193 > URL: https://issues.apache.org/jira/browse/PARQUET-2193 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Vignesh Nageswaran >Priority: Major > > Hi Team, > While exploring parquet encryption, it is found that, if a field in nested > column is encrypted , and If I want to read this parquet directory from other > applications which does not have encryption keys to decrypt it, I cannot read > the remaining fields of the nested column without keys. > Example > ` > {code:java} > case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) > case class SquareItem(int_column: Int, square_int_column : Double, > partitionCol: Int, nestedCol :nestedItem) > `{code} > In the case class `SquareItem` , `nestedCol` field is nested field and I want > to encrypt a field `ic` within it. > > I also want the footer to be non encrypted , so that I can use the encrypted > parquet file by legacy applications. > > Encryption is successful, however, when I query the parquet file using spark > 3.3.0 without having any configuration for parquet encryption set up , I > cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only > `nestedCol` `ic` field will not be querable. > > > Reproducer. > Spark 3.3.0 Using Spark-shell > Downloaded the file > [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar] > and added it to spark-jars folder > Code to create encrypted data. # > > {code:java} > sc.hadoopConfiguration.set("parquet.crypto.factory.class" > ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") > sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" > ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") > sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: > BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: > BAECAAECAAECAAECAAECAA==") > sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false") > val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" > valpartitionCol = 1 > case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) > case class SquareItem(int_column: Int, square_int_column : Double, > partitionCol: Int, nestedCol :nestedItem) > val dataRange = (1 to 100).toList > val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, > scala.math.pow(i,2), partitionCol,nestedItem(i,i > squares.toDS().show() > squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys", > > "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key", > "keyz").parquet(encryptedParquetPath) > {code} > Code to read the data trying to access non encrypted nested field by opening > a new spark-shell > > {code:java} > val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" > spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test") > spark.sql("select nestedCol.sic from test").show(){code} > As you can see that nestedCol.sic is not encrypted , I was expecting the > results, but > I get the below error > > {code:java} > Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: > [square_int_column]. Null File Decryptor > at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348) > at > org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:177) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:375) > at >
[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys
[ https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610868#comment-17610868 ] Gidon Gershinsky commented on PARQUET-2193: --- Hmm, looks like this method runs over all columns, projected and not projected: org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191) Please check if setting "parquet.split.files" to "false" solves this problem. > Encrypting only one field in nested field prevents reading of other fields in > nested field without keys > --- > > Key: PARQUET-2193 > URL: https://issues.apache.org/jira/browse/PARQUET-2193 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Vignesh Nageswaran >Priority: Major > > Hi Team, > While exploring parquet encryption, it is found that, if a field in nested > column is encrypted , and If I want to read this parquet directory from other > applications which does not have encryption keys to decrypt it, I cannot read > the remaining fields of the nested column without keys. > Example > ` > {code:java} > case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) > case class SquareItem(int_column: Int, square_int_column : Double, > partitionCol: Int, nestedCol :nestedItem) > `{code} > In the case class `SquareItem` , `nestedCol` field is nested field and I want > to encrypt a field `ic` within it. > > I also want the footer to be non encrypted , so that I can use the encrypted > parquet file by legacy applications. > > Encryption is successful, however, when I query the parquet file using spark > 3.3.0 without having any configuration for parquet encryption set up , I > cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only > `nestedCol` `ic` field will not be querable. > > > Reproducer. > Spark 3.3.0 Using Spark-shell > Downloaded the file > [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar] > and added it to spark-jars folder > Code to create encrypted data. # > > {code:java} > sc.hadoopConfiguration.set("parquet.crypto.factory.class" > ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") > sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" > ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") > sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: > BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: > BAECAAECAAECAAECAAECAA==") > sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false") > val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" > valpartitionCol = 1 > case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) > case class SquareItem(int_column: Int, square_int_column : Double, > partitionCol: Int, nestedCol :nestedItem) > val dataRange = (1 to 100).toList > val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, > scala.math.pow(i,2), partitionCol,nestedItem(i,i > squares.toDS().show() > squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys", > > "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key", > "keyz").parquet(encryptedParquetPath) > {code} > Code to read the data trying to access non encrypted nested field by opening > a new spark-shell > > {code:java} > val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" > spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test") > spark.sql("select nestedCol.sic from test").show(){code} > As you can see that nestedCol.sic is not encrypted , I was expecting the > results, but > I get the below error > > {code:java} > Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: > [square_int_column]. Null File Decryptor > at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348) > at > org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:177) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:375) > at >
[jira] [Commented] (PARQUET-2194) parquet.encryption.plaintext.footer parameter being true, code expects parquet.encryption.footer.key
[ https://issues.apache.org/jira/browse/PARQUET-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610855#comment-17610855 ] Gidon Gershinsky commented on PARQUET-2194: --- Footer key is required also in the plaintext footer mode - it is used to sign the footer, https://github.com/apache/parquet-mr/tree/master/parquet-hadoop#class-propertiesdrivencryptofactory > parquet.encryption.plaintext.footer parameter being true, code expects > parquet.encryption.footer.key > > > Key: PARQUET-2194 > URL: https://issues.apache.org/jira/browse/PARQUET-2194 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Vignesh Nageswaran >Priority: Major > > Hi Team, > I want my footer in parquet file to be non encrypted. so I set the > _parquet.encryption.plaintext.footer_ to be {_}true{_}, but when I tried to > run my code, parquet-mr is expecting __ value __ for the __ property > _parquet.encryption.footer.key **_ > Reproducer > Spark 3.3.0 > Download the > [file|[https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar] > ] and place it in spark - jar directory > using spark-shell > {code:java} > sc.hadoopConfiguration.set("parquet.crypto.factory.class" > ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") > sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" > ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") > sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: > BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: > BAECAAECAAECAAECAAECAA==") > sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false") > > val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted" > val partitionCol = 1 > case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0) > case class SquareItem(int_column: Int, square_int_column : Double, > partitionCol: Int, nestedCol :nestedItem) > val dataRange = (1 to 100).toList > val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, > scala.math.pow(i,2), partitionCol,nestedItem(i,i > squares.toDS().show() > squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys", > > "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).parquet(encryptedParquetPath){code} > I get the below error, my expectation is if I set properties for my footer to > be plain text, why do we need keys for footer. > > {code:java} > > Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: Undefined > footer key > at > org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.getFileEncryptionProperties(PropertiesDrivenCryptoFactory.java:88) > at > org.apache.parquet.hadoop.ParquetOutputFormat.createEncryptionProperties(ParquetOutputFormat.java:554) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:478) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:155) > at > org.apache.spark.sql.execution.datasources.BaseDynamicPartitionDataWriter.renewCurrentWriter(FileFormatDataWriter.scala:298) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataSingleWriter.write(FileFormatDataWriter.scala:365) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:85) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:92) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:331) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:338) > ... 9 more > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2197) Document uniform encryption
Gidon Gershinsky created PARQUET-2197: - Summary: Document uniform encryption Key: PARQUET-2197 URL: https://issues.apache.org/jira/browse/PARQUET-2197 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.12.3 Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Document the hadoop parameter for uniform encryption -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type
[ https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605098#comment-17605098 ] Gidon Gershinsky commented on PARQUET-1711: --- [~emkornfield] what do you think about these 3 alternatives? > [parquet-protobuf] stack overflow when work with well known json type > - > > Key: PARQUET-1711 > URL: https://issues.apache.org/jira/browse/PARQUET-1711 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: Lawrence He >Priority: Major > > Writing following protobuf message as parquet file is not possible: > {code:java} > syntax = "proto3"; > import "google/protobuf/struct.proto"; > package test; > option java_outer_classname = "CustomMessage"; > message TestMessage { > map data = 1; > } {code} > Protobuf introduced "well known json type" such like > [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue] > to work around json schema conversion. > However writing above messages traps parquet writer into an infinite loop due > to the "general type" support in protobuf. Current implementation will keep > referencing 6 possible types defined in protobuf (null, bool, number, string, > struct, list) and entering infinite loop when referencing "struct". > {code:java} > java.lang.StackOverflowErrorjava.lang.StackOverflowError at > java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at > java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at > java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044) > at > java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type
[ https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602127#comment-17602127 ] Gidon Gershinsky edited comment on PARQUET-1711 at 9/9/22 5:45 AM: --- Hi to all on this Jira. Looks like we have a number of alternative solutions to this problem today, [https://github.com/apache/parquet-mr/pull/995] [https://github.com/apache/parquet-mr/pull/445] [https://github.com/apache/parquet-mr/pull/988] Can you take a look and provide your opinion on them? was (Author: gershinsky): Hi to all on this Jira. Looks like we have a number of alternative solutions to this problem today, > [parquet-protobuf] stack overflow when work with well known json type > - > > Key: PARQUET-1711 > URL: https://issues.apache.org/jira/browse/PARQUET-1711 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: Lawrence He >Priority: Major > > Writing following protobuf message as parquet file is not possible: > {code:java} > syntax = "proto3"; > import "google/protobuf/struct.proto"; > package test; > option java_outer_classname = "CustomMessage"; > message TestMessage { > map data = 1; > } {code} > Protobuf introduced "well known json type" such like > [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue] > to work around json schema conversion. > However writing above messages traps parquet writer into an infinite loop due > to the "general type" support in protobuf. Current implementation will keep > referencing 6 possible types defined in protobuf (null, bool, number, string, > struct, list) and entering infinite loop when referencing "struct". > {code:java} > java.lang.StackOverflowErrorjava.lang.StackOverflowError at > java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at > java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at > java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044) > at > java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type
[ https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602127#comment-17602127 ] Gidon Gershinsky commented on PARQUET-1711: --- Hi to all on this Jira. Looks like we have a number of alternative solutions to this problem today, > [parquet-protobuf] stack overflow when work with well known json type > - > > Key: PARQUET-1711 > URL: https://issues.apache.org/jira/browse/PARQUET-1711 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: Lawrence He >Priority: Major > > Writing following protobuf message as parquet file is not possible: > {code:java} > syntax = "proto3"; > import "google/protobuf/struct.proto"; > package test; > option java_outer_classname = "CustomMessage"; > message TestMessage { > map data = 1; > } {code} > Protobuf introduced "well known json type" such like > [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue] > to work around json schema conversion. > However writing above messages traps parquet writer into an infinite loop due > to the "general type" support in protobuf. Current implementation will keep > referencing 6 possible types defined in protobuf (null, bool, number, string, > struct, list) and entering infinite loop when referencing "struct". > {code:java} > java.lang.StackOverflowErrorjava.lang.StackOverflowError at > java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at > java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at > java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044) > at > java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > at > org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66) > at > org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2040) Uniform encryption
[ https://issues.apache.org/jira/browse/PARQUET-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-2040. --- Resolution: Fixed > Uniform encryption > -- > > Key: PARQUET-2040 > URL: https://issues.apache.org/jira/browse/PARQUET-2040 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.12.3 > > > PME low-level spec supports using the same encryption key for all columns, > which is useful in a number of scenarios. However, this feature is not > exposed yet in the high-level API, because its misuse can break the NIST > limit on the number of AES GCM operations with one key. We will develop a > limit-enforcing code and provide an API for uniform table encryption. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2136) File writer construction with encryptor
[ https://issues.apache.org/jira/browse/PARQUET-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-2136. --- Resolution: Fixed > File writer construction with encryptor > --- > > Key: PARQUET-2136 > URL: https://issues.apache.org/jira/browse/PARQUET-2136 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.12.3 > > > Currently, a file writer object can be constructed with encryption > properties. We need an additional constructor, that can accept an encryptor > instead, in order to support lazy materialization of parquet file writers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding
[ https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-2120. --- Resolution: Fixed > parquet-cli dictionary command fails on pages without dictionary encoding > - > > Key: PARQUET-2120 > URL: https://issues.apache.org/jira/browse/PARQUET-2120 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.12.2 >Reporter: Willi Raschkowski >Priority: Minor > Fix For: 1.12.3 > > > parquet-cli's {{dictionary}} command fails with an NPE if a page does not > have dictionary encoding: > {code} > $ parquet dictionary --column col a-b-c.snappy.parquet > Unknown error > java.lang.NullPointerException: Cannot invoke > "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" > is null > at > org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78) > at org.apache.parquet.cli.Main.run(Main.java:155) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:185) > $ parquet meta a-b-c.snappy.parquet > ... > Row group 0: count: 1 46.00 B records start: 4 total: 46 B > > type encodings count avg size nulls min / max > col BINARYS _ 1 46.00 B0 "a" / "a" > Row group 1: count: 200 0.34 B records start: 50 total: 69 B > > type encodings count avg size nulls min / max > col BINARYS _ R 200 0.34 B 0 "b" / "c" > {code} > (Note the missing {{R}} / dictionary encoding on that first page.) > Someone familiar with Parquet might guess from the NPE that there's no > dictionary encoding. But for files that mix pages with and without dictionary > encoding (like above), the command will fail before getting to pages that > actually have dictionaries. > The problem is that [this > line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76] > assumes {{readDictionaryPage}} always returns a page and doesn't handle when > it does not, i.e. when it returns {{null}}. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding
[ https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556828#comment-17556828 ] Gidon Gershinsky commented on PARQUET-2120: --- [~shangxinli] and the Parquet community, can you assign this Jira to [~rshkv] > parquet-cli dictionary command fails on pages without dictionary encoding > - > > Key: PARQUET-2120 > URL: https://issues.apache.org/jira/browse/PARQUET-2120 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.12.2 >Reporter: Willi Raschkowski >Priority: Minor > Fix For: 1.12.3 > > > parquet-cli's {{dictionary}} command fails with an NPE if a page does not > have dictionary encoding: > {code} > $ parquet dictionary --column col a-b-c.snappy.parquet > Unknown error > java.lang.NullPointerException: Cannot invoke > "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" > is null > at > org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78) > at org.apache.parquet.cli.Main.run(Main.java:155) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:185) > $ parquet meta a-b-c.snappy.parquet > ... > Row group 0: count: 1 46.00 B records start: 4 total: 46 B > > type encodings count avg size nulls min / max > col BINARYS _ 1 46.00 B0 "a" / "a" > Row group 1: count: 200 0.34 B records start: 50 total: 69 B > > type encodings count avg size nulls min / max > col BINARYS _ R 200 0.34 B 0 "b" / "c" > {code} > (Note the missing {{R}} / dictionary encoding on that first page.) > Someone familiar with Parquet might guess from the NPE that there's no > dictionary encoding. But for files that mix pages with and without dictionary > encoding (like above), the command will fail before getting to pages that > actually have dictionaries. > The problem is that [this > line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76] > assumes {{readDictionaryPage}} always returns a page and doesn't handle when > it does not, i.e. when it returns {{null}}. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (PARQUET-2148) Enable uniform decryption with plaintext footer
[ https://issues.apache.org/jira/browse/PARQUET-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-2148. --- Resolution: Fixed > Enable uniform decryption with plaintext footer > --- > > Key: PARQUET-2148 > URL: https://issues.apache.org/jira/browse/PARQUET-2148 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.12.3 > > > Currently, uniform decryption is not enabled in the plaintext footer mode - > for no good reason. Column metadata is available, we just need to decrypt and > use it. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (PARQUET-2144) Fix ColumnIndexBuilder for notIn predicate
[ https://issues.apache.org/jira/browse/PARQUET-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-2144. --- Resolution: Fixed > Fix ColumnIndexBuilder for notIn predicate > -- > > Key: PARQUET-2144 > URL: https://issues.apache.org/jira/browse/PARQUET-2144 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Huaxin Gao >Priority: Major > Fix For: 1.12.3 > > > Column Index is not built correctly for notIn predicate. Need to fix the bug. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (PARQUET-2145) Release 1.12.3
[ https://issues.apache.org/jira/browse/PARQUET-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-2145. --- Resolution: Fixed > Release 1.12.3 > -- > > Key: PARQUET-2145 > URL: https://issues.apache.org/jira/browse/PARQUET-2145 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Reporter: Gidon Gershinsky >Priority: Major > Fix For: 1.12.3 > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2145) Release 1.12.3
[ https://issues.apache.org/jira/browse/PARQUET-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556825#comment-17556825 ] Gidon Gershinsky commented on PARQUET-2145: --- This version is already released, [https://parquet.incubator.apache.org/blog/2022/05/26/1.12.3/] Lets indeed close this Jira. > Release 1.12.3 > -- > > Key: PARQUET-2145 > URL: https://issues.apache.org/jira/browse/PARQUET-2145 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Reporter: Gidon Gershinsky >Priority: Major > Fix For: 1.12.3 > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553425#comment-17553425 ] Gidon Gershinsky commented on PARQUET-2117: --- [~sha...@uber.com] Could you add [~prakharjain09] to the Parquet contributors. > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.12.3 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2101) Fix wrong descriptions about the default block size
[ https://issues.apache.org/jira/browse/PARQUET-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2101: -- Fix Version/s: 1.12.3 > Fix wrong descriptions about the default block size > --- > > Key: PARQUET-2101 > URL: https://issues.apache.org/jira/browse/PARQUET-2101 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-mr, parquet-protobuf >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Trivial > Fix For: 1.12.3 > > > https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L90 > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L240 > https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoParquetWriter.java#L80 > These javadocs say the default block size is 50 MB but it's actually 128MB. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2081) Encryption translation tool - Parquet-hadoop
[ https://issues.apache.org/jira/browse/PARQUET-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2081: -- Fix Version/s: 1.12.3 (was: 1.13.0) > Encryption translation tool - Parquet-hadoop > > > Key: PARQUET-2081 > URL: https://issues.apache.org/jira/browse/PARQUET-2081 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.3 > > > This is the implement the core part of the Encryption translation tool in > parquet-hadoop. After this, we will have another Jira/PR for parquet-cli to > integrate with key tools for encryption properties.. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2102) Typo in ColumnIndexBase toString
[ https://issues.apache.org/jira/browse/PARQUET-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2102: -- Fix Version/s: 1.12.3 > Typo in ColumnIndexBase toString > > > Key: PARQUET-2102 > URL: https://issues.apache.org/jira/browse/PARQUET-2102 > Project: Parquet > Issue Type: Bug >Reporter: Ryan Rupp >Assignee: Ryan Rupp >Priority: Trivial > Fix For: 1.12.3 > > > Trivial thing but noticed [here|https://github.com/trinodb/trino/issues/9890] > since ColumnIndexBase.toString() was used in a wrapped exception message - > "boundary" has a typo (boudary). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2040) Uniform encryption
[ https://issues.apache.org/jira/browse/PARQUET-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2040: -- Fix Version/s: 1.12.3 > Uniform encryption > -- > > Key: PARQUET-2040 > URL: https://issues.apache.org/jira/browse/PARQUET-2040 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.12.3 > > > PME low-level spec supports using the same encryption key for all columns, > which is useful in a number of scenarios. However, this feature is not > exposed yet in the high-level API, because its misuse can break the NIST > limit on the number of AES GCM operations with one key. We will develop a > limit-enforcing code and provide an API for uniform table encryption. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2076) Improve Travis CI build Performance
[ https://issues.apache.org/jira/browse/PARQUET-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2076: -- Fix Version/s: 1.12.3 > Improve Travis CI build Performance > --- > > Key: PARQUET-2076 > URL: https://issues.apache.org/jira/browse/PARQUET-2076 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Chen Zhang >Priority: Trivial > Fix For: 1.12.3 > > > According to [Common Build Problems - Travis CI > (travis-ci.com)|https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received], > we should carefully use travis_wait, as it may make the build unstable and > extend the build time. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2107) Travis failures
[ https://issues.apache.org/jira/browse/PARQUET-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2107: -- Fix Version/s: 1.12.3 > Travis failures > --- > > Key: PARQUET-2107 > URL: https://issues.apache.org/jira/browse/PARQUET-2107 > Project: Parquet > Issue Type: Bug >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.12.3 > > > There are Travis failures since a while in our PRs. See e.g. > https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598285 or > https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598286 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2106) BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
[ https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2106: -- Fix Version/s: 1.12.3 > BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path > --- > > Key: PARQUET-2106 > URL: https://issues.apache.org/jira/browse/PARQUET-2106 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Major > Fix For: 1.12.3 > > Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, > profile_48449_alloc_1638494450_sort_by.html > > > *Background* > While writing out large Parquet tables using Spark, we've noticed that > BinaryComparator is the source of substantial churn of extremely short-lived > `HeapByteBuffer` objects – It's taking up to *16%* of total amount of > allocations in our benchmarks, putting substantial pressure on a Garbage > Collector: > !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521! > [^profile_48449_alloc_1638494450_sort_by.html] > > *Proposal* > We're proposing to adjust lexicographical comparison (at least) to avoid > doing any allocations, since this code lies on the hot-path of every Parquet > write, therefore causing substantial churn amplification. > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2105) Refactor the test code of creating the test file
[ https://issues.apache.org/jira/browse/PARQUET-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2105: -- Fix Version/s: 1.12.3 > Refactor the test code of creating the test file > - > > Key: PARQUET-2105 > URL: https://issues.apache.org/jira/browse/PARQUET-2105 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.3 > > > In the tests, there are many places that need to create a test parquet file > with different settings. Currently, each test file just creates its own code. > It would be better to have a test file builder to create that. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2112) Fix typo in MessageColumnIO
[ https://issues.apache.org/jira/browse/PARQUET-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2112: -- Fix Version/s: 1.12.3 (was: 1.13.0) > Fix typo in MessageColumnIO > --- > > Key: PARQUET-2112 > URL: https://issues.apache.org/jira/browse/PARQUET-2112 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.3 > > > Typo of the variable 'BitSet vistedIndexes'. Change it to 'visitedIndexes' -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2128) Bump Thrift to 0.16.0
[ https://issues.apache.org/jira/browse/PARQUET-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2128: -- Fix Version/s: 1.12.3 > Bump Thrift to 0.16.0 > - > > Key: PARQUET-2128 > URL: https://issues.apache.org/jira/browse/PARQUET-2128 > Project: Parquet > Issue Type: Improvement >Reporter: Vinoo Ganesh >Assignee: Vinoo Ganesh >Priority: Minor > Fix For: 1.12.3 > > > Thrift 0.16.0 has been released > https://github.com/apache/thrift/releases/tag/v0.16.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding
[ https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2120: -- Fix Version/s: 1.12.3 > parquet-cli dictionary command fails on pages without dictionary encoding > - > > Key: PARQUET-2120 > URL: https://issues.apache.org/jira/browse/PARQUET-2120 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.12.2 >Reporter: Willi Raschkowski >Priority: Minor > Fix For: 1.12.3 > > > parquet-cli's {{dictionary}} command fails with an NPE if a page does not > have dictionary encoding: > {code} > $ parquet dictionary --column col a-b-c.snappy.parquet > Unknown error > java.lang.NullPointerException: Cannot invoke > "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" > is null > at > org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78) > at org.apache.parquet.cli.Main.run(Main.java:155) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:185) > $ parquet meta a-b-c.snappy.parquet > ... > Row group 0: count: 1 46.00 B records start: 4 total: 46 B > > type encodings count avg size nulls min / max > col BINARYS _ 1 46.00 B0 "a" / "a" > Row group 1: count: 200 0.34 B records start: 50 total: 69 B > > type encodings count avg size nulls min / max > col BINARYS _ R 200 0.34 B 0 "b" / "c" > {code} > (Note the missing {{R}} / dictionary encoding on that first page.) > Someone familiar with Parquet might guess from the NPE that there's no > dictionary encoding. But for files that mix pages with and without dictionary > encoding (like above), the command will fail before getting to pages that > actually have dictionaries. > The problem is that [this > line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76] > assumes {{readDictionaryPage}} always returns a page and doesn't handle when > it does not, i.e. when it returns {{null}}. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2129) Add uncompressedSize to "meta" output
[ https://issues.apache.org/jira/browse/PARQUET-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2129: -- Fix Version/s: 1.12.3 > Add uncompressedSize to "meta" output > - > > Key: PARQUET-2129 > URL: https://issues.apache.org/jira/browse/PARQUET-2129 > Project: Parquet > Issue Type: Improvement >Reporter: Vinoo Ganesh >Assignee: Vinoo Ganesh >Priority: Minor > Fix For: 1.12.3 > > > The `uncompressedSize` is currently not printed in the output of the parquet > meta command. This PR adds the uncompressedSize in to the output. > This was also reported by Deepak Gangwar. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2121) Remove descriptions for the removed modules
[ https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2121: -- Fix Version/s: 1.12.3 > Remove descriptions for the removed modules > --- > > Key: PARQUET-2121 > URL: https://issues.apache.org/jira/browse/PARQUET-2121 > Project: Parquet > Issue Type: Improvement >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > Fix For: 1.12.3 > > > PARQUET-2020 removed some deprecated modules, but the related descriptions > still remain in some documents. They should be removed since their existence > is misleading. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2136) File writer construction with encryptor
[ https://issues.apache.org/jira/browse/PARQUET-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2136: -- Fix Version/s: 1.12.3 > File writer construction with encryptor > --- > > Key: PARQUET-2136 > URL: https://issues.apache.org/jira/browse/PARQUET-2136 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.12.3 > > > Currently, a file writer object can be constructed with encryption > properties. We need an additional constructor, that can accept an encryptor > instead, in order to support lazy materialization of parquet file writers. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2144) Fix ColumnIndexBuilder for notIn predicate
[ https://issues.apache.org/jira/browse/PARQUET-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2144: -- Fix Version/s: 1.12.3 > Fix ColumnIndexBuilder for notIn predicate > -- > > Key: PARQUET-2144 > URL: https://issues.apache.org/jira/browse/PARQUET-2144 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Huaxin Gao >Priority: Major > Fix For: 1.12.3 > > > Column Index is not built correctly for notIn predicate. Need to fix the bug. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar
[ https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2127: -- Fix Version/s: 1.12.3 > Security risk in latest parquet-jackson-1.12.2.jar > -- > > Key: PARQUET-2127 > URL: https://issues.apache.org/jira/browse/PARQUET-2127 > Project: Parquet > Issue Type: Improvement >Reporter: phoebe chen >Priority: Major > Fix For: 1.12.3 > > > Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK > serialization to serialize JsonNode > ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to > 2.13.1 can fix this. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (PARQUET-2148) Enable uniform decryption with plaintext footer
Gidon Gershinsky created PARQUET-2148: - Summary: Enable uniform decryption with plaintext footer Key: PARQUET-2148 URL: https://issues.apache.org/jira/browse/PARQUET-2148 Project: Parquet Issue Type: Bug Components: parquet-mr Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Fix For: 1.12.3 Currently, uniform decryption is not enabled in the plaintext footer mode - for no good reason. Column metadata is available, we just need to decrypt and use it. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (PARQUET-2145) Release 1.12.3
Gidon Gershinsky created PARQUET-2145: - Summary: Release 1.12.3 Key: PARQUET-2145 URL: https://issues.apache.org/jira/browse/PARQUET-2145 Project: Parquet Issue Type: Task Components: parquet-mr Reporter: Gidon Gershinsky Fix For: 1.12.3 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2098) Add more methods into interface of BlockCipher
[ https://issues.apache.org/jira/browse/PARQUET-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526997#comment-17526997 ] Gidon Gershinsky commented on PARQUET-2098: --- [~theosib-amazon] I got ~half of this (code; not the unitests yet). But in the meantime, it became unclear if we need this functionality (in the upcoming release). Do you have a usecase for it? > Add more methods into interface of BlockCipher > -- > > Key: PARQUET-2098 > URL: https://issues.apache.org/jira/browse/PARQUET-2098 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > Currently BlockCipher interface has methods without letting caller to specify > length/offset. In some use cases like Presto, it is needed to pass in a byte > array and the data to be encrypted only occupys partially of the array. So > we need to add a new methods something like below for decrypt. Similar > methods might be needed for encrypt. > byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, > byte[] aad); -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (PARQUET-2136) File writer construction with encryptor
Gidon Gershinsky created PARQUET-2136: - Summary: File writer construction with encryptor Key: PARQUET-2136 URL: https://issues.apache.org/jira/browse/PARQUET-2136 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.12.2 Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Currently, a file writer object can be constructed with encryption properties. We need an additional constructor, that can accept an encryptor instead, in order to support lazy materialization of parquet file writers. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2098) Add more methods into interface of BlockCipher
[ https://issues.apache.org/jira/browse/PARQUET-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483575#comment-17483575 ] Gidon Gershinsky commented on PARQUET-2098: --- sure, I can take this one > Add more methods into interface of BlockCipher > -- > > Key: PARQUET-2098 > URL: https://issues.apache.org/jira/browse/PARQUET-2098 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > Currently BlockCipher interface has methods without letting caller to specify > length/offset. In some use cases like Presto, it is needed to pass in a byte > array and the data to be encrypted only occupys partially of the array. So > we need to add a new methods something like below for decrypt. Similar > methods might be needed for encrypt. > byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, > byte[] aad); -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON
[ https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448596#comment-17448596 ] Gidon Gershinsky commented on PARQUET-2103: --- [~gszadovszky] thanks for pointing in the right direction. We can check that a file is encrypted, and then skip printing its column metadata - this solves the problem at hand. We will still be able to print the file-wide metadata (as opposed to the per-column metadata, which is encrypted with column-specific keys). I'll start working on a patch. > crypto exception in print toPrettyJSON > -- > > Key: PARQUET-2103 > URL: https://issues.apache.org/jira/browse/PARQUET-2103 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0, 1.12.1, 1.12.2 >Reporter: Gidon Gershinsky >Priority: Major > > In debug mode, this code > {{if (LOG.isDebugEnabled()) {}} > {{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}} > {{}}} > called in > {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}} > > _*in encrypted files with plaintext footer*_ > triggers an exception: > > {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. > Null File Decryptor }} > {{ at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:?]}} > {{ at > jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:?]}} > {{ at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:?]}} > {{ at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at >
[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON
[ https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2103: -- Description: In debug mode, this code {{if (LOG.isDebugEnabled()) {}} {{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}} {{}}} called in {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}} _*in encrypted files with plaintext footer*_ triggers an exception: {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. Null File Decryptor }} {{ at org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) ~[parquet-hadoop-1.12.0jar:1.12.0]}} {{ at org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353) ~[parquet-hadoop-1.12.0jar:1.12.0]}} {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]}} {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]}} {{ at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]}} {{ at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1059) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68) ~[parquet-hadoop-1.12.0jar:1.12.0]}} {{ ... 23 more}} was: In debug mode, this code {{if (LOG.isDebugEnabled()) {}} {{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}} {{}}} called in {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}} _*for unencrypted files*_
[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON
[ https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447372#comment-17447372 ] Gidon Gershinsky commented on PARQUET-2103: --- [~gszadovszky] [~sha...@uber.com] will appreciate getting your advise on the solution options (here or at the sync call). This seems to be a print with a blind reflection loop, that calls all nested classes / methods in an object. Since v1.12.0, there is a EncryptedColumnChunkMetaData class inside the ColumnChunkMetaData. Creating its instance and calling the "decrypt" method is not a good idea, for either unencrypted or encrypted files. > crypto exception in print toPrettyJSON > -- > > Key: PARQUET-2103 > URL: https://issues.apache.org/jira/browse/PARQUET-2103 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0, 1.12.1, 1.12.2 >Reporter: Gidon Gershinsky >Priority: Major > > In debug mode, this code > {{if (LOG.isDebugEnabled()) {}} > {{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}} > {{}}} > called in > {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}} > > _*for unencrypted files*_ > triggers an exception: > > {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. > Null File Decryptor }} > {{ at > org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at > org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353) > ~[parquet-hadoop-1.12.0jar:1.12.0]}} > {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:?]}} > {{ at > jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:?]}} > {{ at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:?]}} > {{ at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at > shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) > ~[parquet-jackson-1.12.0jar:1.12.0]}} > {{ at >
[jira] [Updated] (PARQUET-2103) crypto exception in print toPrettyJSON
[ https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2103: -- Description: In debug mode, this code {{if (LOG.isDebugEnabled()) {}} {{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}} {{}}} called in {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}} _*for unencrypted files*_ triggers an exception: {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. Null File Decryptor }} {{ at org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) ~[parquet-hadoop-1.12.0jar:1.12.0]}} {{ at org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353) ~[parquet-hadoop-1.12.0jar:1.12.0]}} {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]}} {{ at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]}} {{ at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]}} {{ at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1059) ~[parquet-jackson-1.12.0jar:1.12.0]}} {{ at org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68) ~[parquet-hadoop-1.12.0jar:1.12.0]}} {{ ... 23 more}} was: In debug mode, this code if (LOG.isDebugEnabled()) \{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata)); } called in org.apache.parquet.format.converter. ParquetMetadataConverter. readParquetMetadata() _*for unencrypted files*_ triggers an exception: Caused by:
[jira] [Created] (PARQUET-2103) crypto exception in print toPrettyJSON
Gidon Gershinsky created PARQUET-2103: - Summary: crypto exception in print toPrettyJSON Key: PARQUET-2103 URL: https://issues.apache.org/jira/browse/PARQUET-2103 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.12.2, 1.12.1, 1.12.0 Reporter: Gidon Gershinsky In debug mode, this code if (LOG.isDebugEnabled()) \{ LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata)); } called in org.apache.parquet.format.converter. ParquetMetadataConverter. readParquetMetadata() _*for unencrypted files*_ triggers an exception: Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. Null File Decryptor at org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602) ~[parquet-hadoop-1.12.0jar:1.12.0] at org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353) ~[parquet-hadoop-1.12.0jar:1.12.0] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217) ~[parquet-jackson-1.12.0jar:1.12.0] at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1059) ~[parquet-jackson-1.12.0jar:1.12.0] at org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68) ~[parquet-hadoop-1.12.0jar:1.12.0] ... 23 more -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset
[ https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421323#comment-17421323 ] Gidon Gershinsky commented on PARQUET-2080: --- Oh, sorry, done. > Deprecate RowGroup.file_offset > -- > > Key: PARQUET-2080 > URL: https://issues.apache.org/jira/browse/PARQUET-2080 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gidon Gershinsky >Priority: Major > > Due to PARQUET-2078 RowGroup.file_offset is not reliable. > This field is also wrongly calculated in the C++ oss parquet implementation > PARQUET-2089 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset
[ https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421193#comment-17421193 ] Gidon Gershinsky commented on PARQUET-2080: --- Hi [~gszadovszky] , I've prepared a short writeup on this alternative solution, with a discussion of the tradeoffs. After writing it, my feeling is that the trade-off is not in favor of this alternative option; but [here it goes|https://docs.google.com/document/d/1zr6-4em8C8DGi-D3jGosQe2gvJKluat-8uUbS0y7F-0/edit?usp=sharing], just to cover all bases. Will appreciate your opinion on this. > Deprecate RowGroup.file_offset > -- > > Key: PARQUET-2080 > URL: https://issues.apache.org/jira/browse/PARQUET-2080 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gidon Gershinsky >Priority: Major > > Due to PARQUET-2078 RowGroup.file_offset is not reliable. > This field is also wrongly calculated in the C++ oss parquet implementation > PARQUET-2089 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2080) Deprecate RowGroup.file_offset
[ https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2080: -- Description: Due to PARQUET-2078 RowGroup.file_offset is not reliable. This field is also wrongly calculated in the C++ oss parquet implementation PARQUET-2089 was: Due to PARQUET-2078 RowGroup.file_offset is not reliable. This field is also wrongly calculated in the C++ oss parquet (Arrow rep). > Deprecate RowGroup.file_offset > -- > > Key: PARQUET-2080 > URL: https://issues.apache.org/jira/browse/PARQUET-2080 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Due to PARQUET-2078 RowGroup.file_offset is not reliable. > This field is also wrongly calculated in the C++ oss parquet implementation > PARQUET-2089 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-2080) Deprecate RowGroup.file_offset
[ https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky reassigned PARQUET-2080: - Assignee: Gidon Gershinsky (was: Gabor Szadovszky) > Deprecate RowGroup.file_offset > -- > > Key: PARQUET-2080 > URL: https://issues.apache.org/jira/browse/PARQUET-2080 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gidon Gershinsky >Priority: Major > > Due to PARQUET-2078 RowGroup.file_offset is not reliable. > This field is also wrongly calculated in the C++ oss parquet implementation > PARQUET-2089 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2080) Deprecate RowGroup.file_offset
[ https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-2080: -- Description: Due to PARQUET-2078 RowGroup.file_offset is not reliable. This field is also wrongly calculated in the C++ oss parquet (Arrow rep). was:Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall deprecate the field and add suggestions how to calculate the value. > Deprecate RowGroup.file_offset > -- > > Key: PARQUET-2080 > URL: https://issues.apache.org/jira/browse/PARQUET-2080 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Due to PARQUET-2078 RowGroup.file_offset is not reliable. > This field is also wrongly calculated in the C++ oss parquet (Arrow rep). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset
[ https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414075#comment-17414075 ] Gidon Gershinsky commented on PARQUET-2080: --- [~gszadovszky] yes, I'll take it. There might be a different solution (also format-related) that bypasses the need to calculate such parameter in any implementation, so it can be fully deprecated. I'll get back with the details and we'll discuss the trade-offs. > Deprecate RowGroup.file_offset > -- > > Key: PARQUET-2080 > URL: https://issues.apache.org/jira/browse/PARQUET-2080 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall deprecate > the field and add suggestions how to calculate the value. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2071) Encryption translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393982#comment-17393982 ] Gidon Gershinsky commented on PARQUET-2071: --- A very useful tool, I'll be glad to review the pr. > Encryption translation tool > > > Key: PARQUET-2071 > URL: https://issues.apache.org/jira/browse/PARQUET-2071 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When translating existing data to encryption state, we could develop a tool > like TransCompression to translate the data at page level to encryption state > without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1908) CLONE - [C++] Update cpp crypto package to match signed-off specification
[ https://issues.apache.org/jira/browse/PARQUET-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-1908. --- Resolution: Fixed PR merged in May 2019 > CLONE - [C++] Update cpp crypto package to match signed-off specification > - > > Key: PARQUET-1908 > URL: https://issues.apache.org/jira/browse/PARQUET-1908 > Project: Parquet > Issue Type: Sub-task >Reporter: Akshay >Assignee: Gidon Gershinsky >Priority: Major > Labels: pull-request-available > Fix For: cpp-5.0.0 > > > An initial version of crypto package is merged. This Jira updates the crypto > code to > # conform the signed off specification (wire protocol updates, signature tag > creation, AAD support, etc) > # improve performance by extending cipher lifecycle to file writing/reading > - instead of creating cipher on each encrypt/decrypt operation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2053) Pluggable key material store
Gidon Gershinsky created PARQUET-2053: - Summary: Pluggable key material store Key: PARQUET-2053 URL: https://issues.apache.org/jira/browse/PARQUET-2053 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Encryption key material can be stored either inside Parquet files, or outside (configurable). For outside storage, Parquet already has a pluggable interface for custom implementations, {{FileKeyMaterialStore,}} but no mechanism to load them (currently, one implementation is packaged in parquet-mr, and always loaded when outside storage is configured). We will provide a way to load custom implementations of the {{FileKeyMaterialStore}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1230) CLI tools for encrypted files
[ https://issues.apache.org/jira/browse/PARQUET-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1230: -- Component/s: parquet-mr Affects Version/s: 1.12.0 > CLI tools for encrypted files > - > > Key: PARQUET-1230 > URL: https://issues.apache.org/jira/browse/PARQUET-1230 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1230) CLI tools for encrypted files
[ https://issues.apache.org/jira/browse/PARQUET-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1230: -- Parent: (was: PARQUET-1178) Issue Type: New Feature (was: Sub-task) > CLI tools for encrypted files > - > > Key: PARQUET-1230 > URL: https://issues.apache.org/jira/browse/PARQUET-1230 > Project: Parquet > Issue Type: New Feature >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2040) Uniform encryption
Gidon Gershinsky created PARQUET-2040: - Summary: Uniform encryption Key: PARQUET-2040 URL: https://issues.apache.org/jira/browse/PARQUET-2040 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.12.0 Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky PME low-level spec supports using the same encryption key for all columns, which is useful in a number of scenarios. However, this feature is not exposed yet in the high-level API, because its misuse can break the NIST limit on the number of AES GCM operations with one key. We will develop a limit-enforcing code and provide an API for uniform table encryption. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2033) Make "null decryptor" exception more informative
Gidon Gershinsky created PARQUET-2033: - Summary: Make "null decryptor" exception more informative Key: PARQUET-2033 URL: https://issues.apache.org/jira/browse/PARQUET-2033 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.12.0 Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Forgetting to pass decryption properties when reading an encrypted column in files with plaintext footer, results in a "null decryptor" exception thrown in the ColumnChunkMetaData class. The exception text can/should be updated to point to the possible reason. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2014) Local key wrapping with rotation
Gidon Gershinsky created PARQUET-2014: - Summary: Local key wrapping with rotation Key: PARQUET-2014 URL: https://issues.apache.org/jira/browse/PARQUET-2014 Project: Parquet Issue Type: New Feature Components: parquet-mr Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky parquet-mr-1.12.0 has an experimental support for local wrapping of encryption keys, that doesn't handle master key versions and key rotation. This Jira will add these capabilities. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1613) Key rotation tool
[ https://issues.apache.org/jira/browse/PARQUET-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-1613. --- Resolution: Done handled by pr 615 > Key rotation tool > - > > Key: PARQUET-1613 > URL: https://issues.apache.org/jira/browse/PARQUET-1613 > Project: Parquet > Issue Type: Sub-task >Reporter: Gidon Gershinsky >Assignee: Maya Anderson >Priority: Major > > Rotates the master key, for both single and double wrappers. > For the latter, enables support for a single KMS call per column, in readers > of any data sets. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1612) Double wrapped key manager
[ https://issues.apache.org/jira/browse/PARQUET-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-1612. --- Resolution: Done handled by pr 615 > Double wrapped key manager > -- > > Key: PARQUET-1612 > URL: https://issues.apache.org/jira/browse/PARQUET-1612 > Project: Parquet > Issue Type: Sub-task >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > To minimize interaction with KMS, this manager will wrap the encryption keys > twice. Might be combined with key rotation for further optimization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1178) Parquet modular encryption
[ https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-1178. --- Resolution: Done Released. Thanks to all who've contributed to this new Parquet capability! > Parquet modular encryption > -- > > Key: PARQUET-1178 > URL: https://issues.apache.org/jira/browse/PARQUET-1178 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.12.0 > > > A mechanism for modular encryption and decryption of Parquet files. Allows to > keep data fully encrypted in the storage - while enabling efficient analytics > on the data, via reader-side extraction / authentication / decryption of data > subsets required by columnar projection and predicate push-down. > Enables fine-grained access control to column data by encrypting different > columns with different keys. > Supports a number of encryption algorithms, to account for different security > and performance requirements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1178) Parquet modular encryption
[ https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1178: -- Fix Version/s: 1.12.0 > Parquet modular encryption > -- > > Key: PARQUET-1178 > URL: https://issues.apache.org/jira/browse/PARQUET-1178 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.12.0 > > > A mechanism for modular encryption and decryption of Parquet files. Allows to > keep data fully encrypted in the storage - while enabling efficient analytics > on the data, via reader-side extraction / authentication / decryption of data > subsets required by columnar projection and predicate push-down. > Enables fine-grained access control to column data by encrypting different > columns with different keys. > Supports a number of encryption algorithms, to account for different security > and performance requirements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1997) [C++] AesEncryptor and AesDecryptor primitives are unsafe
[ https://issues.apache.org/jira/browse/PARQUET-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299054#comment-17299054 ] Gidon Gershinsky commented on PARQUET-1997: --- I recall we talked about that with Tham, but I forgot the details.. [~thamha], do you remember what the return value is for? > [C++] AesEncryptor and AesDecryptor primitives are unsafe > - > > Key: PARQUET-1997 > URL: https://issues.apache.org/jira/browse/PARQUET-1997 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Antoine Pitrou >Priority: Major > > {{AesEncryptor::Encrypt}}, {{AesDecryptor::Decrypt}} take a pointer to the > output buffer but without the output buffer length. The caller is required to > guess the expected output length. The functions also return the written > output length, but at this point it's too late: data may have been written > out of bounds. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1997) [C++] AesEncryptor and AesDecryptor primitives are unsafe
[ https://issues.apache.org/jira/browse/PARQUET-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299041#comment-17299041 ] Gidon Gershinsky commented on PARQUET-1997: --- [~apitrou] This point is addressed by the _int AesEncryptor::CiphertextSizeDelta()_ function - the caller uses it to allocate the output buffer. This is not a part of public Parquet API; the caller is the parquet code. > [C++] AesEncryptor and AesDecryptor primitives are unsafe > - > > Key: PARQUET-1997 > URL: https://issues.apache.org/jira/browse/PARQUET-1997 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Antoine Pitrou >Priority: Major > > {{AesEncryptor::Encrypt}}, {{AesDecryptor::Decrypt}} take a pointer to the > output buffer but without the output buffer length. The caller is required to > guess the expected output length. The functions also return the written > output length, but at this point it's too late: data may have been written > out of bounds. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1992) Cannot build from tarball because of git submodules
[ https://issues.apache.org/jira/browse/PARQUET-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293509#comment-17293509 ] Gidon Gershinsky commented on PARQUET-1992: --- This contribution had been added by [~mayaa], she knows the subject better than me. Maya, could you address the comments and the question in this jira? > Cannot build from tarball because of git submodules > --- > > Key: PARQUET-1992 > URL: https://issues.apache.org/jira/browse/PARQUET-1992 > Project: Parquet > Issue Type: Bug >Reporter: Gabor Szadovszky >Priority: Blocker > > Because we use git submodules (to get test parquet files) a simple "mvn clean > install" fails from the unpacked tarball due to "not a git repository". > I think we would have 2 options to solve this situation: > * Include all the required files (even only for testing) in the tarball and > somehow avoid the git submodule update in case of executed in a non-git > envrionment > * Make the downloading of the parquet files and the related tests optional so > it won't fail the build from the tarball -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1989) Deep verification of encrypted files
Gidon Gershinsky created PARQUET-1989: - Summary: Deep verification of encrypted files Key: PARQUET-1989 URL: https://issues.apache.org/jira/browse/PARQUET-1989 Project: Parquet Issue Type: New Feature Components: parquet-cli Reporter: Gidon Gershinsky Assignee: Maya Anderson Fix For: 1.13.0 A tools that verifies encryption of parquet files in a given folder. Analyzes the footer, and then every module (page headers, pages, column indexes, bloom filters) - making sure they are encrypted (in relevant columns). Potentially checking the encryption keys. We'll start with a design doc, open for discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1939) Fix RemoteKmsClient API ambiguity
[ https://issues.apache.org/jira/browse/PARQUET-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1939: -- Summary: Fix RemoteKmsClient API ambiguity (was: RemoteKmsClient API confusion) > Fix RemoteKmsClient API ambiguity > - > > Key: PARQUET-1939 > URL: https://issues.apache.org/jira/browse/PARQUET-1939 > Project: Parquet > Issue Type: Improvement >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > Users complain that RemoteKmsClient name can be confusing, since this class > covers both local and in-server (remote) key wrapping. Still, this class > supports only remote KMS servers. But to remove any ambiguity, and to make > the API simpler, we will rename this class to LocalWrapKmsClient; it will be > used only in rare situations where in-server wrapping in not supported. In > all other situations, the basic KmsClient interface will be used directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1940) Make KeyEncryptionKey length configurable
Gidon Gershinsky created PARQUET-1940: - Summary: Make KeyEncryptionKey length configurable Key: PARQUET-1940 URL: https://issues.apache.org/jira/browse/PARQUET-1940 Project: Parquet Issue Type: Improvement Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky KEK length is hardcoded to 128 bits. It should be configurable, to any value allowed by AES (128, 192 or 256 bits). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1939) RemoteKmsClient API confusion
Gidon Gershinsky created PARQUET-1939: - Summary: RemoteKmsClient API confusion Key: PARQUET-1939 URL: https://issues.apache.org/jira/browse/PARQUET-1939 Project: Parquet Issue Type: Improvement Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Users complain that RemoteKmsClient name can be confusing, since this class covers both local and in-server (remote) key wrapping. Still, this class supports only remote KMS servers. But to remove any ambiguity, and to make the API simpler, we will rename this class to LocalWrapKmsClient; it will be used only in rare situations where in-server wrapping in not supported. In all other situations, the basic KmsClient interface will be used directly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1938) Option to get KMS details from key material (in key rotation)
Gidon Gershinsky created PARQUET-1938: - Summary: Option to get KMS details from key material (in key rotation) Key: PARQUET-1938 URL: https://issues.apache.org/jira/browse/PARQUET-1938 Project: Parquet Issue Type: Improvement Reporter: Gidon Gershinsky Assignee: Maya Anderson Currently, key rotation uses explicit parameters to get the KMS details. Instead, it can extract these details from the key material files - this is more convenient for a user. Still, the explicit parameters (if provided) will override these values -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1934) Dictionary page is not decrypted in predicate pushdown path
[ https://issues.apache.org/jira/browse/PARQUET-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-1934. --- Resolution: Not A Problem Jira opened by mistake. No problems with dictionary decryption. > Dictionary page is not decrypted in predicate pushdown path > --- > > Key: PARQUET-1934 > URL: https://issues.apache.org/jira/browse/PARQUET-1934 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > Predicate pushdown, based on dictionary pages, uses a page parsing code that > doesn't support decryption yet. Will add a few lines to decrypt the > dictionary page header and page (for encrypted columns). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1934) Dictionary page is not decrypted in predicate pushdown path
Gidon Gershinsky created PARQUET-1934: - Summary: Dictionary page is not decrypted in predicate pushdown path Key: PARQUET-1934 URL: https://issues.apache.org/jira/browse/PARQUET-1934 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.12.0 Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Predicate pushdown, based on dictionary pages, uses a page parsing code that doesn't support decryption yet. Will add a few lines to decrypt the dictionary page header and page (for encrypted columns). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1373) Encryption key management tools
[ https://issues.apache.org/jira/browse/PARQUET-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1373: -- Fix Version/s: (was: encryption-feature-branch) 1.12.0 Affects Version/s: 1.12.0 > Encryption key management tools > > > Key: PARQUET-1373 > URL: https://issues.apache.org/jira/browse/PARQUET-1373 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.12.0 > > > Parquet Modular Encryption > ([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides > an API that accepts keys, arbitrary key metadata and key retrieval callbacks > - which allows to implement basically any key management policy on top of it. > This Jira will add tools that implement a set of best practice elements for > key management. This is not an end-to-end key management, but rather a set of > components that might simplify design and development of an end-to-end > solution. > This tool set is one of many possible. There is no goal to create a single or > “standard” toolkit for Parquet encryption keys. Parquet has a Crypto Factory > interface [(PARQUET-1817|https://issues.apache.org/jira/browse/PARQUET-1817]) > that allows to plug in different implementations of encryption key management. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1854) Properties-Driven Interface to Parquet Encryption
[ https://issues.apache.org/jira/browse/PARQUET-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1854: -- Component/s: parquet-mr Fix Version/s: (was: encryption-feature-branch) 1.12.0 Affects Version/s: 1.12.0 > Properties-Driven Interface to Parquet Encryption > - > > Key: PARQUET-1854 > URL: https://issues.apache.org/jira/browse/PARQUET-1854 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: 1.12.0 > > > A high-level interface to Parquet encryption layer, based on configuration > properties (table properties, Hadoop configuration, writer/reader options, > etc) - will simplify the activation and configuration of data encryption. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1178) Parquet modular encryption
[ https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1178: -- Component/s: parquet-mr Affects Version/s: 1.12.0 > Parquet modular encryption > -- > > Key: PARQUET-1178 > URL: https://issues.apache.org/jira/browse/PARQUET-1178 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > A mechanism for modular encryption and decryption of Parquet files. Allows to > keep data fully encrypted in the storage - while enabling efficient analytics > on the data, via reader-side extraction / authentication / decryption of data > subsets required by columnar projection and predicate push-down. > Enables fine-grained access control to column data by encrypting different > columns with different keys. > Supports a number of encryption algorithms, to account for different security > and performance requirements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1376) Data obfuscation layer for encryption
[ https://issues.apache.org/jira/browse/PARQUET-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1376: -- Description: Data obfuscation in sensitive columns - for users without access to column encryption keys. # Implement on top of [basic Parquet encryption|https://github.com/apache/parquet-format/blob/encryption/Encryption.md] # Built-in support for multiple masking mechanisms, with different trade-off between data utility, leakage, and size/throughput overhead # Provide interface for plug-in custom masking mechanism # Enable storing multiple masked versions of the same column in a file # Provide readers with explicit list of column’s masked versions in a file # Enable readers to select a masked version of a column # Stretch: Implement tools for analysis of file data privacy properties and information leakage # Stretch: Leverage privacy analysis tools for tuning file data anonymity # Optional: Support aggregated obfuscation was: Anonymity layer for hidden columns # Different data masking options ** per-cell ** aggregated (average, etc) # Reader notification on data access status # Providing readers with a choice of masking options (if available) > Data obfuscation layer for encryption > - > > Key: PARQUET-1376 > URL: https://issues.apache.org/jira/browse/PARQUET-1376 > Project: Parquet > Issue Type: New Feature >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > Data obfuscation in sensitive columns - for users without access to column > encryption keys. > # Implement on top of [basic Parquet > encryption|https://github.com/apache/parquet-format/blob/encryption/Encryption.md] > > # Built-in support for multiple masking mechanisms, with different trade-off > between data utility, leakage, and size/throughput overhead > # Provide interface for plug-in custom masking mechanism > # Enable storing multiple masked versions of the same column in a file > # Provide readers with explicit list of column’s masked versions in a file > # Enable readers to select a masked version of a column > # Stretch: Implement tools for analysis of file data privacy properties and > information leakage > # Stretch: Leverage privacy analysis tools for tuning file data anonymity > # Optional: Support aggregated obfuscation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1891) Encryption-related light fixes
[ https://issues.apache.org/jira/browse/PARQUET-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1891: -- Description: hadoop/readme.md, travis.yaml, vault sample fixes Summary: Encryption-related light fixes (was: hadoop/readme.md and travis.yaml fixes) > Encryption-related light fixes > -- > > Key: PARQUET-1891 > URL: https://issues.apache.org/jira/browse/PARQUET-1891 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > hadoop/readme.md, travis.yaml, vault sample fixes > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1892) CRC comment modification in Thrift
Gidon Gershinsky created PARQUET-1892: - Summary: CRC comment modification in Thrift Key: PARQUET-1892 URL: https://issues.apache.org/jira/browse/PARQUET-1892 Project: Parquet Issue Type: Improvement Components: parquet-format Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Mention that CRC is calculated after compression and encryption -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1891) hadoop/readme.md and travis.yaml fixes
Gidon Gershinsky created PARQUET-1891: - Summary: hadoop/readme.md and travis.yaml fixes Key: PARQUET-1891 URL: https://issues.apache.org/jira/browse/PARQUET-1891 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1884) Merge encryption branch into master
[ https://issues.apache.org/jira/browse/PARQUET-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-1884. --- Resolution: Done > Merge encryption branch into master > --- > > Key: PARQUET-1884 > URL: https://issues.apache.org/jira/browse/PARQUET-1884 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1884) Merge encryption branch into master
Gidon Gershinsky created PARQUET-1884: - Summary: Merge encryption branch into master Key: PARQUET-1884 URL: https://issues.apache.org/jira/browse/PARQUET-1884 Project: Parquet Issue Type: Sub-task Components: parquet-mr Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1568) High level interface to Parquet encryption
[ https://issues.apache.org/jira/browse/PARQUET-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-1568. --- Resolution: Duplicate > High level interface to Parquet encryption > -- > > Key: PARQUET-1568 > URL: https://issues.apache.org/jira/browse/PARQUET-1568 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp, parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > Per discussion at the community sync, we're working on a > design of a high level interface to Parquet encryption. A draft doc will be > published soon, that covers the current proposals from Xinli (cryptodata > interface), Ryan (table properties) and Gidon (key tools, hadoop config) - > and starts to draft a unified/streamlined design based on them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1854) Properties-Driven Interface to Parquet Encryption
[ https://issues.apache.org/jira/browse/PARQUET-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky resolved PARQUET-1854. --- Fix Version/s: encryption-feature-branch Resolution: Done > Properties-Driven Interface to Parquet Encryption > - > > Key: PARQUET-1854 > URL: https://issues.apache.org/jira/browse/PARQUET-1854 > Project: Parquet > Issue Type: New Feature >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > Fix For: encryption-feature-branch > > > A high-level interface to Parquet encryption layer, based on configuration > properties (table properties, Hadoop configuration, writer/reader options, > etc) - will simplify the activation and configuration of data encryption. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1373) Encryption key management tools
[ https://issues.apache.org/jira/browse/PARQUET-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1373: -- Description: Parquet Modular Encryption ([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides an API that accepts keys, arbitrary key metadata and key retrieval callbacks - which allows to implement basically any key management policy on top of it. This Jira will add tools that implement a set of best practice elements for key management. This is not an end-to-end key management, but rather a set of components that might simplify design and development of an end-to-end solution. This tool set is one of many possible. There is no goal to create a single or “standard” toolkit for Parquet encryption keys. Parquet has a Crypto Factory interface [(PARQUET-1817|https://issues.apache.org/jira/browse/PARQUET-1817]) that allows to plug in different implementations of encryption key management. was: Parquet Modular Encryption (PARQUET-1178) provides an API that accepts keys, arbitrary key metadata and key retrieval callbacks - which allows to implement basically any key management policy on top of it. This Jira will add tools that implement a set of best practice elements for key management. This is not an end-to-end key management, but rather a set of components that might simplify design and development of an end-to-end solution. For example, the tools will cover * modification of key metadata inside existing Parquet files. * support for re-keying that doesn't require modification of Parquet files. Parquet will not mandate a use of these tools. Users will be able to continue working with the basic API, to create any custom key management solution that addresses their security requirements. If helps, they can also utilize some or all of these tools. > Encryption key management tools > > > Key: PARQUET-1373 > URL: https://issues.apache.org/jira/browse/PARQUET-1373 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > Parquet Modular Encryption > ([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides > an API that accepts keys, arbitrary key metadata and key retrieval callbacks > - which allows to implement basically any key management policy on top of it. > This Jira will add tools that implement a set of best practice elements for > key management. This is not an end-to-end key management, but rather a set of > components that might simplify design and development of an end-to-end > solution. > This tool set is one of many possible. There is no goal to create a single or > “standard” toolkit for Parquet encryption keys. Parquet has a Crypto Factory > interface [(PARQUET-1817|https://issues.apache.org/jira/browse/PARQUET-1817]) > that allows to plug in different implementations of encryption key management. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1373) Encryption key management tools
[ https://issues.apache.org/jira/browse/PARQUET-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky updated PARQUET-1373: -- Description: Parquet Modular Encryption (PARQUET-1178) provides an API that accepts keys, arbitrary key metadata and key retrieval callbacks - which allows to implement basically any key management policy on top of it. This Jira will add tools that implement a set of best practice elements for key management. This is not an end-to-end key management, but rather a set of components that might simplify design and development of an end-to-end solution. For example, the tools will cover * modification of key metadata inside existing Parquet files. * support for re-keying that doesn't require modification of Parquet files. Parquet will not mandate a use of these tools. Users will be able to continue working with the basic API, to create any custom key management solution that addresses their security requirements. If helps, they can also utilize some or all of these tools. was: Parquet Modular Encryption ([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides an API that accepts keys, arbitrary key metadata and key retrieval callbacks - which allows to implement basically any key management policy on top of it. This Jira will add tools that implement a set of best practice elements for key management. This is not an end-to-end key management, but rather a set of components that might simplify design and development of an end-to-end solution. For example, the tools will cover * modification of key metadata inside existing Parquet files. * support for re-keying that doesn't require modification of Parquet files. Parquet will not mandate a use of these tools. Users will be able to continue working with the basic API, to create any custom key management solution that addresses their security requirements. If helps, they can also utilize some or all of these tools. > Encryption key management tools > > > Key: PARQUET-1373 > URL: https://issues.apache.org/jira/browse/PARQUET-1373 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > Parquet Modular Encryption (PARQUET-1178) provides an API that accepts keys, > arbitrary key metadata and key retrieval callbacks - which allows to > implement basically any key management policy on top of it. This Jira will > add tools that implement a set of best practice elements for key management. > This is not an end-to-end key management, but rather a set of components that > might simplify design and development of an end-to-end solution. > For example, the tools will cover > * modification of key metadata inside existing Parquet files. > * support for re-keying that doesn't require modification of Parquet files. > > Parquet will not mandate a use of these tools. Users will be able to continue > working with the basic API, to create any custom key management solution that > addresses their security requirements. If helps, they can also utilize some > or all of these tools. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1854) Properties-Driven Interface to Parquet Encryption
Gidon Gershinsky created PARQUET-1854: - Summary: Properties-Driven Interface to Parquet Encryption Key: PARQUET-1854 URL: https://issues.apache.org/jira/browse/PARQUET-1854 Project: Parquet Issue Type: New Feature Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky A high-level interface to Parquet encryption layer, based on configuration properties (table properties, Hadoop configuration, writer/reader options, etc) - will simplify the activation and configuration of data encryption. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1178) Parquet modular encryption
[ https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098549#comment-17098549 ] Gidon Gershinsky edited comment on PARQUET-1178 at 5/3/20, 7:52 PM: hard to say what's best at this point. Here's the arrow/parquet-cpp encryption [sample|https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/examples/parquet/low-level-api/encryption-reader-writer-all-crypto-options.cc]. was (Author: gershinsky): hard to say what's best at this point. Here's the arrow/parquet-cpp encryption [sample|[https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/examples/parquet/low-level-api/encryption-reader-writer-all-crypto-options.cc]]. > Parquet modular encryption > -- > > Key: PARQUET-1178 > URL: https://issues.apache.org/jira/browse/PARQUET-1178 > Project: Parquet > Issue Type: New Feature >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > A mechanism for modular encryption and decryption of Parquet files. Allows to > keep data fully encrypted in the storage - while enabling efficient analytics > on the data, via reader-side extraction / authentication / decryption of data > subsets required by columnar projection and predicate push-down. > Enables fine-grained access control to column data by encrypting different > columns with different keys. > Supports a number of encryption algorithms, to account for different security > and performance requirements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1178) Parquet modular encryption
[ https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098549#comment-17098549 ] Gidon Gershinsky commented on PARQUET-1178: --- hard to say what's best at this point. Here's the arrow/parquet-cpp encryption [sample|[https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/examples/parquet/low-level-api/encryption-reader-writer-all-crypto-options.cc]]. > Parquet modular encryption > -- > > Key: PARQUET-1178 > URL: https://issues.apache.org/jira/browse/PARQUET-1178 > Project: Parquet > Issue Type: New Feature >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > A mechanism for modular encryption and decryption of Parquet files. Allows to > keep data fully encrypted in the storage - while enabling efficient analytics > on the data, via reader-side extraction / authentication / decryption of data > subsets required by columnar projection and predicate push-down. > Enables fine-grained access control to column data by encrypting different > columns with different keys. > Supports a number of encryption algorithms, to account for different security > and performance requirements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-1568) High level interface to Parquet encryption
[ https://issues.apache.org/jira/browse/PARQUET-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gidon Gershinsky reassigned PARQUET-1568: - Assignee: Gidon Gershinsky > High level interface to Parquet encryption > -- > > Key: PARQUET-1568 > URL: https://issues.apache.org/jira/browse/PARQUET-1568 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp, parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > Per discussion at the community sync, we're working on a > design of a high level interface to Parquet encryption. A draft doc will be > published soon, that covers the current proposals from Xinli (cryptodata > interface), Ryan (table properties) and Gidon (key tools, hadoop config) - > and starts to draft a unified/streamlined design based on them. -- This message was sent by Atlassian Jira (v8.3.4#803005)