[jira] [Comment Edited] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5
[ https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715304#comment-17715304 ] Xinli Shang edited comment on PARQUET-2276 at 4/22/23 4:36 PM: --- [~a2l]Did you try Hadoop 2.9.x? I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. Parquet is widely used by so many companies and breaking change means big to the industry. We should have made it clear when taking the breaking changes like this. [~a2l]Do you think you can work on it? was (Author: sha...@uber.com): [~Aufderhar]Did you try Hadoop 2.9.x? I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. Parquet is widely used by so many companies and breaking change means big to the industry. We should have made it clear when taking the breaking changes like this. [~a2l]Do you think you can work on it? > ParquetReader reads do not work with Hadoop version 2.8.5 > - > > Key: PARQUET-2276 > URL: https://issues.apache.org/jira/browse/PARQUET-2276 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Atul Mohan >Priority: Major > > {{ParquetReader.read() fails with the following exception on parquet-mr > version 1.13.0 when using hadoop version 2.8.5:}} > {code:java} > java.lang.NoSuchMethodError: 'boolean > org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' > at > org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74) > > at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) > at > org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) > > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787) > > at > org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) > at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > {code} > > > > From an initial investigation, it looks like HadoopStreams has started using > [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74] > but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop > 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5
[ https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715304#comment-17715304 ] Xinli Shang edited comment on PARQUET-2276 at 4/22/23 4:36 PM: --- [~a2l] Did you try Hadoop 2.9.x? I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. Parquet is widely used by so many companies and breaking change means big to the industry. We should have made it clear when taking the breaking changes like this. [~a2l]Do you think you can work on it? was (Author: sha...@uber.com): [~a2l]Did you try Hadoop 2.9.x? I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. Parquet is widely used by so many companies and breaking change means big to the industry. We should have made it clear when taking the breaking changes like this. [~a2l]Do you think you can work on it? > ParquetReader reads do not work with Hadoop version 2.8.5 > - > > Key: PARQUET-2276 > URL: https://issues.apache.org/jira/browse/PARQUET-2276 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Atul Mohan >Priority: Major > > {{ParquetReader.read() fails with the following exception on parquet-mr > version 1.13.0 when using hadoop version 2.8.5:}} > {code:java} > java.lang.NoSuchMethodError: 'boolean > org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' > at > org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74) > > at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) > at > org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) > > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787) > > at > org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) > at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > {code} > > > > From an initial investigation, it looks like HadoopStreams has started using > [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74] > but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop > 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5
[ https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715304#comment-17715304 ] Xinli Shang commented on PARQUET-2276: -- [~Aufderhar]Did you try Hadoop 2.9.x? I agree with [~gszadovszky]. Let's find a way to add back the support hadoop2. Parquet is widely used by so many companies and breaking change means big to the industry. We should have made it clear when taking the breaking changes like this. [~a2l]Do you think you can work on it? > ParquetReader reads do not work with Hadoop version 2.8.5 > - > > Key: PARQUET-2276 > URL: https://issues.apache.org/jira/browse/PARQUET-2276 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Atul Mohan >Priority: Major > > {{ParquetReader.read() fails with the following exception on parquet-mr > version 1.13.0 when using hadoop version 2.8.5:}} > {code:java} > java.lang.NoSuchMethodError: 'boolean > org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' > at > org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74) > > at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) > at > org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) > > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787) > > at > org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) > at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > {code} > > > > From an initial investigation, it looks like HadoopStreams has started using > [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74] > but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop > 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1690) Integer Overflow of BinaryStatistics#isSmallerThan()
[ https://issues.apache.org/jira/browse/PARQUET-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701789#comment-17701789 ] Xinli Shang commented on PARQUET-1690: -- It is a quite long time ago. I don't remember. Yeah, it would be great to start off a new PR. > Integer Overflow of BinaryStatistics#isSmallerThan() > > > Key: PARQUET-1690 > URL: https://issues.apache.org/jira/browse/PARQUET-1690 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Labels: pull-request-available > > "(min.length() + max.length()) < size" didn't handle integer overflow > [https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L103] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th
[ https://issues.apache.org/jira/browse/PARQUET-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680705#comment-17680705 ] Xinli Shang commented on PARQUET-2233: -- [~Jiashen Zhang]Please have a look and we can discuss if there are still blocking issues. > Parquet Travis CI jobs to be turned off February 15th > - > > Key: PARQUET-2233 > URL: https://issues.apache.org/jira/browse/PARQUET-2233 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Priority: Major > > Greetings Parquet PMC, > Infrastructure has reached out to you regarding the Travis CI Open Source > policy changes, and the resulting need for Apache projects to migrate away > from using Travis. > So far, we have received no response from your PMC. > On February 15th, we will begin the final phase of this migration, turning > off Travis builds in order to bring our Travis usage down to 0. > We have found the following repositories mention or make use of .travis.yml > files: > * parquet-mr.git > * parquet-cpp.git > You must immediately move to migrate your builds from Travis. If you do not, > you will soon be unable to do builds that now rely on Travis. > Many projects have moved to using GitHub Actions, and migrating to GHA is > quite straightforward. Other projects use Jenkins providing ARM support, with > nodes using the arm label > If you are unsure how to proceed, I would be happy to explain your next steps. > Please at least respond to acknowledge the need to migrate away from Travis, > and to tell us your current plans. > Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th
[ https://issues.apache.org/jira/browse/PARQUET-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680363#comment-17680363 ] Xinli Shang commented on PARQUET-2233: -- Were you able to log in? > Parquet Travis CI jobs to be turned off February 15th > - > > Key: PARQUET-2233 > URL: https://issues.apache.org/jira/browse/PARQUET-2233 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Priority: Major > > Greetings Parquet PMC, > Infrastructure has reached out to you regarding the Travis CI Open Source > policy changes, and the resulting need for Apache projects to migrate away > from using Travis. > So far, we have received no response from your PMC. > On February 15th, we will begin the final phase of this migration, turning > off Travis builds in order to bring our Travis usage down to 0. > We have found the following repositories mention or make use of .travis.yml > files: > * parquet-mr.git > * parquet-cpp.git > You must immediately move to migrate your builds from Travis. If you do not, > you will soon be unable to do builds that now rely on Travis. > Many projects have moved to using GitHub Actions, and migrating to GHA is > quite straightforward. Other projects use Jenkins providing ARM support, with > nodes using the arm label > If you are unsure how to proceed, I would be happy to explain your next steps. > Please at least respond to acknowledge the need to migrate away from Travis, > and to tell us your current plans. > Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th
[ https://issues.apache.org/jira/browse/PARQUET-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680345#comment-17680345 ] Xinli Shang edited comment on PARQUET-2233 at 1/24/23 8:19 PM: --- In this Issue, we are going to migrate parquet-mr.git and parquet-format.git. The hard deadline is 2/15/2023. More information can be found https://cwiki.apache.org/confluence/display/INFRA/Travis+Migrations. We will see if we can migrate to Action. was (Author: sha...@uber.com): In this Issue, we are going to migrate parquet-mr.git and parquet-format.git. The hard deadline is 2/15/2023 > Parquet Travis CI jobs to be turned off February 15th > - > > Key: PARQUET-2233 > URL: https://issues.apache.org/jira/browse/PARQUET-2233 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Priority: Major > > Greetings Parquet PMC, > Infrastructure has reached out to you regarding the Travis CI Open Source > policy changes, and the resulting need for Apache projects to migrate away > from using Travis. > So far, we have received no response from your PMC. > On February 15th, we will begin the final phase of this migration, turning > off Travis builds in order to bring our Travis usage down to 0. > We have found the following repositories mention or make use of .travis.yml > files: > * parquet-mr.git > * parquet-cpp.git > You must immediately move to migrate your builds from Travis. If you do not, > you will soon be unable to do builds that now rely on Travis. > Many projects have moved to using GitHub Actions, and migrating to GHA is > quite straightforward. Other projects use Jenkins providing ARM support, with > nodes using the arm label > If you are unsure how to proceed, I would be happy to explain your next steps. > Please at least respond to acknowledge the need to migrate away from Travis, > and to tell us your current plans. > Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th
[ https://issues.apache.org/jira/browse/PARQUET-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680345#comment-17680345 ] Xinli Shang commented on PARQUET-2233: -- In this Issue, we are going to migrate parquet-mr.git and parquet-format.git. The hard deadline is 2/15/2023 > Parquet Travis CI jobs to be turned off February 15th > - > > Key: PARQUET-2233 > URL: https://issues.apache.org/jira/browse/PARQUET-2233 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Priority: Major > > Greetings Parquet PMC, > Infrastructure has reached out to you regarding the Travis CI Open Source > policy changes, and the resulting need for Apache projects to migrate away > from using Travis. > So far, we have received no response from your PMC. > On February 15th, we will begin the final phase of this migration, turning > off Travis builds in order to bring our Travis usage down to 0. > We have found the following repositories mention or make use of .travis.yml > files: > * parquet-mr.git > * parquet-cpp.git > You must immediately move to migrate your builds from Travis. If you do not, > you will soon be unable to do builds that now rely on Travis. > Many projects have moved to using GitHub Actions, and migrating to GHA is > quite straightforward. Other projects use Jenkins providing ARM support, with > nodes using the arm label > If you are unsure how to proceed, I would be happy to explain your next steps. > Please at least respond to acknowledge the need to migrate away from Travis, > and to tell us your current plans. > Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2233) Parquet Travis CI jobs to be turned off February 15th
Xinli Shang created PARQUET-2233: Summary: Parquet Travis CI jobs to be turned off February 15th Key: PARQUET-2233 URL: https://issues.apache.org/jira/browse/PARQUET-2233 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Xinli Shang Greetings Parquet PMC, Infrastructure has reached out to you regarding the Travis CI Open Source policy changes, and the resulting need for Apache projects to migrate away from using Travis. So far, we have received no response from your PMC. On February 15th, we will begin the final phase of this migration, turning off Travis builds in order to bring our Travis usage down to 0. We have found the following repositories mention or make use of .travis.yml files: * parquet-mr.git * parquet-cpp.git You must immediately move to migrate your builds from Travis. If you do not, you will soon be unable to do builds that now rely on Travis. Many projects have moved to using GitHub Actions, and migrating to GHA is quite straightforward. Other projects use Jenkins providing ARM support, with nodes using the arm label If you are unsure how to proceed, I would be happy to explain your next steps. Please at least respond to acknowledge the need to migrate away from Travis, and to tell us your current plans. Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2183) Fix statistics issue of Column Encryptor
Xinli Shang created PARQUET-2183: Summary: Fix statistics issue of Column Encryptor Key: PARQUET-2183 URL: https://issues.apache.org/jira/browse/PARQUET-2183 Project: Parquet Issue Type: Improvement Reporter: Xinli Shang Assignee: Xinli Shang There is an issue that missing column statistics if that column is re-encrypted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
[ https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519686#comment-17519686 ] Xinli Shang commented on PARQUET-1681: -- [~theosib-amazon]It seems different. > Avro's isElementType() change breaks the reading of some parquet(1.8.1) files > - > > Key: PARQUET-1681 > URL: https://issues.apache.org/jira/browse/PARQUET-1681 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.10.0, 1.9.1, 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Critical > > When using the Avro schema below to write a parquet(1.8.1) file and then read > back by using parquet 1.10.1 without passing any schema, the reading throws > an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. > { > "name": "phones", > "type": [ > "null", > { > "type": "array", > "items": { > "type": "record", > "name": "phones_items", > "fields": [ > > { "name": "phone_number", > "type": [ "null", > "string" ], "default": null > } > ] > } > } > ], > "default": null > } > The code to read is as below > val reader = > AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new* > Configuration).build() > reader.read() > PARQUET-651 changed the method isElementType() by relying on Avro's > checkReaderWriterCompatibility() to check the compatibility. However, > checkReaderWriterCompatibility() consider the ParquetSchema and the > AvroSchema(converted from File schema) as not compatible(the name in avro > schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence > not compatible) . Hence return false and caused the “phone_number” field in > the above schema to be considered as group type which is not true. Then the > exception throws as .asGroupType(). > I didn’t try writing via parquet 1.10.1 would reproduce the same problem or > not. But it could because the translation of Avro schema to Parquet schema is > not changed(didn’t verify yet). > I hesitate to revert PARQUET-651 because it solved several problems. I would > like to hear the community's thoughts on it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-1595) Parquet proto writer de-nest Protobuf wrapper classes
[ https://issues.apache.org/jira/browse/PARQUET-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509500#comment-17509500 ] Xinli Shang commented on PARQUET-1595: -- Is it a typo for Int32Value -> int64? > Parquet proto writer de-nest Protobuf wrapper classes > - > > Key: PARQUET-1595 > URL: https://issues.apache.org/jira/browse/PARQUET-1595 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Ying Xu >Priority: Major > > Existing Parquet protobuf writer support preserves the structure of any > Protobuf Message objects. This works well in most cases. However, when > dealing with [Protobuf wrapper > messages|https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/wrappers.proto], > users may prefer directly writing the de-nested value into the Parquet > files, for ease of querying them directly (in query engine such as > Hive/Presto). > Proposal: > * Implement a control flag, e.g., enableDenestingWrappers, to control > whether or not to denest Protobuf wrapper classes. > * When this flag is set to true, write the Protobuf wrapper classes as > single primitive fields, based on the type of the wrapped *value* field. > > ||Protobuf Type||Parquet Type|| > |BoolValue|boolean| > |BytesValue|binary| > |DoubleValue|double| > |FloatValue|float| > |Int32Value|int64 (32-bit, signed)| > |Int64Value|int64 (64-bit, signed)| > |StringValue|binary (string)| > |UInt32Value|int64 (32-bit, unsigned)| > |UInt64Value|int64 (64-bit, unsigned)| > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (PARQUET-2116) Cell Level Encryption
[ https://issues.apache.org/jira/browse/PARQUET-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-2116: - External issue URL: https://docs.google.com/document/d/1PUonl9i_fVlRhUmqEmWBQJ8zesX7mlvnu3ubemT11rk/edit#heading=h.kkuoyw5u0ywe (was: https://docs.google.com/document/d/1Q-d98Os_aJahUynznPrWvXwWQeN0aFDRhZj3hXt_JOM/edit#) > Cell Level Encryption > -- > > Key: PARQUET-2116 > URL: https://issues.apache.org/jira/browse/PARQUET-2116 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > Cell level encryption can do finer-grained encryption than modular > encryption(Parquet-1178) or file encryption. The idea is only some fields > inside the column are encrypted based on a filter expression. For example, a > table with column a, b, c.x, c.y, d, we can encrypt column a, c.x where d == > 5 and c.y > 0. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar
[ https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494321#comment-17494321 ] Xinli Shang edited comment on PARQUET-2127 at 2/18/22, 2:23 AM: Thanks for reporting [~phoebemaomao]! Will you be able to come up with the fix? I will be happy to review and merge. was (Author: sha...@uber.com): Thanks for reporting [~phoebemaomao]! Will you be able to come up with the fix? I will be happy to review and merge.. > Security risk in latest parquet-jackson-1.12.2.jar > -- > > Key: PARQUET-2127 > URL: https://issues.apache.org/jira/browse/PARQUET-2127 > Project: Parquet > Issue Type: Improvement >Reporter: phoebe chen >Priority: Major > > Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK > serialization to serialize JsonNode > ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to > 2.13.1 can fix this. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar
[ https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494321#comment-17494321 ] Xinli Shang commented on PARQUET-2127: -- Thanks for reporting [~phoebemaomao]! Will you be able to come up with the fix? I will be happy to review and merge.. > Security risk in latest parquet-jackson-1.12.2.jar > -- > > Key: PARQUET-2127 > URL: https://issues.apache.org/jira/browse/PARQUET-2127 > Project: Parquet > Issue Type: Improvement >Reporter: phoebe chen >Priority: Major > > Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK > serialization to serialize JsonNode > ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to > 2.13.1 can fix this. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492099#comment-17492099 ] Xinli Shang edited comment on PARQUET-2122 at 2/14/22, 4:56 PM: [~junjie] Do you know why? was (Author: sha...@uber.com): [~junjie]Do you know why? > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492099#comment-17492099 ] Xinli Shang commented on PARQUET-2122: -- [~junjie]Do you know why? > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485949#comment-17485949 ] Xinli Shang commented on PARQUET-2117: -- Thanks for opening this Jira! Look forward to the PR. > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (PARQUET-2116) Cell Level Encryption
[ https://issues.apache.org/jira/browse/PARQUET-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-2116: - External issue URL: https://docs.google.com/document/d/1Q-d98Os_aJahUynznPrWvXwWQeN0aFDRhZj3hXt_JOM/edit# > Cell Level Encryption > -- > > Key: PARQUET-2116 > URL: https://issues.apache.org/jira/browse/PARQUET-2116 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > Cell level encryption can do finer-grained encryption than modular > encryption(Parquet-1178) or file encryption. The idea is only some fields > inside the column are encrypted based on a filter expression. For example, a > table with column a, b, c.x, c.y, d, we can encrypt column a, c.x where d == > 5 and c.y > 0. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (PARQUET-2116) Cell Level Encryption
Xinli Shang created PARQUET-2116: Summary: Cell Level Encryption Key: PARQUET-2116 URL: https://issues.apache.org/jira/browse/PARQUET-2116 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Xinli Shang Assignee: Xinli Shang Cell level encryption can do finer-grained encryption than modular encryption(Parquet-1178) or file encryption. The idea is only some fields inside the column are encrypted based on a filter expression. For example, a table with column a, b, c.x, c.y, d, we can encrypt column a, c.x where d == 5 and c.y > 0. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (PARQUET-2091) Fix release build error introduced by PARQUET-2043
[ https://issues.apache.org/jira/browse/PARQUET-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang resolved PARQUET-2091. -- Resolution: Won't Fix > Fix release build error introduced by PARQUET-2043 > -- > > Key: PARQUET-2091 > URL: https://issues.apache.org/jira/browse/PARQUET-2091 > Project: Parquet > Issue Type: Task >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > After PARQUET-2043 when building for a release like 1.12.1, there is build > error complaining 'used undeclared dependency'. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2098) Add more methods into interface of BlockCipher
[ https://issues.apache.org/jira/browse/PARQUET-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483225#comment-17483225 ] Xinli Shang commented on PARQUET-2098: -- [~gershinsky] Do you have time to work on it as we discussed to release the new version? > Add more methods into interface of BlockCipher > -- > > Key: PARQUET-2098 > URL: https://issues.apache.org/jira/browse/PARQUET-2098 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > Currently BlockCipher interface has methods without letting caller to specify > length/offset. In some use cases like Presto, it is needed to pass in a byte > array and the data to be encrypted only occupys partially of the array. So > we need to add a new methods something like below for decrypt. Similar > methods might be needed for encrypt. > byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, > byte[] aad); -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (PARQUET-2112) Fix typo in MessageColumnIO
[ https://issues.apache.org/jira/browse/PARQUET-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang resolved PARQUET-2112. -- Resolution: Fixed > Fix typo in MessageColumnIO > --- > > Key: PARQUET-2112 > URL: https://issues.apache.org/jira/browse/PARQUET-2112 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.2 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.13.0 > > > Typo of the variable 'BitSet vistedIndexes'. Change it to 'visitedIndexes' -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (PARQUET-2112) Fix typo in MessageColumnIO
Xinli Shang created PARQUET-2112: Summary: Fix typo in MessageColumnIO Key: PARQUET-2112 URL: https://issues.apache.org/jira/browse/PARQUET-2112 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.12.2 Reporter: Xinli Shang Assignee: Xinli Shang Fix For: 1.13.0 Typo of the variable 'BitSet vistedIndexes'. Change it to 'visitedIndexes' -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2111) Support limit push down and stop early for RecordReader
[ https://issues.apache.org/jira/browse/PARQUET-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480128#comment-17480128 ] Xinli Shang commented on PARQUET-2111: -- Look forward to the PR > Support limit push down and stop early for RecordReader > --- > > Key: PARQUET-2111 > URL: https://issues.apache.org/jira/browse/PARQUET-2111 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Jackey Lee >Priority: Major > > With limit push down, it can stop scanning parquet early, and reduce network > and disk IO. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (PARQUET-2071) Encryption translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang resolved PARQUET-2071. -- Resolution: Fixed > Encryption translation tool > > > Key: PARQUET-2071 > URL: https://issues.apache.org/jira/browse/PARQUET-2071 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When translating existing data to encryption state, we could develop a tool > like TransCompression to translate the data at page level to encryption state > without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (PARQUET-1872) Add TransCompression Feature
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang resolved PARQUET-1872. -- Resolution: Fixed > Add TransCompression Feature > - > > Key: PARQUET-1872 > URL: https://issues.apache.org/jira/browse/PARQUET-1872 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When ZSTD becomes more popular, there is a need to translate existing data to > ZSTD compressed which can achieve a higher compression ratio. It would be > useful if we can have a tool to convert a Parquet file directly by just > decompressing/compressing each page without decoding/encoding or assembling > the record because it is much faster. The initial result shows it is ~5 times > faster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (PARQUET-2105) Refactor the test code of creating the test file
[ https://issues.apache.org/jira/browse/PARQUET-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang resolved PARQUET-2105. -- Resolution: Fixed > Refactor the test code of creating the test file > - > > Key: PARQUET-2105 > URL: https://issues.apache.org/jira/browse/PARQUET-2105 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > In the tests, there are many places that need to create a test parquet file > with different settings. Currently, each test file just creates its own code. > It would be better to have a test file builder to create that. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-1889) Register a MIME type for the Parquet format.
[ https://issues.apache.org/jira/browse/PARQUET-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17473147#comment-17473147 ] Xinli Shang commented on PARQUET-1889: -- +1 on [~westonpace]'s point > Register a MIME type for the Parquet format. > > > Key: PARQUET-1889 > URL: https://issues.apache.org/jira/browse/PARQUET-1889 > Project: Parquet > Issue Type: Wish > Components: parquet-format >Affects Versions: format-2.7.0 >Reporter: Mark Wood >Priority: Major > > There is currently no MIME type registered for Parquet. Perhaps this is > intentional. > If it is not intentional, I suggest steps be taken to register a MIME type > with IANA. > > [https://www.iana.org/assignments/media-types/media-types.xhtml] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-1911) Add way to disables statistics on a per column basis
[ https://issues.apache.org/jira/browse/PARQUET-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17468759#comment-17468759 ] Xinli Shang commented on PARQUET-1911: -- [~panthony] Thanks for working on this! Just FYI that there was an effort to truncate the min/max https://issues.apache.org/jira/browse/PARQUET-1685. It can be enabled with a flag. With that said, your changes are still welcome. Feel free to create a PR if you haven't and I will review it. > Add way to disables statistics on a per column basis > > > Key: PARQUET-1911 > URL: https://issues.apache.org/jira/browse/PARQUET-1911 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Anthony Pessy >Priority: Major > Attachments: NoOpStatistics.java, > add_config_to_opt-out_of_a_column's_statistics.patch > > > When you write dataset with BINARY columns that can be fairly large (several > Mbs) you can often end with an OutOfMemory error where you either have to: > > - Throw more RAM > - Increase number of output files > - Play with Block size > > Using a fork with increased checks frequency for row group size help but it > is not enough. (PR: [https://github.com/apache/parquet-mr/pull/470]) > > > The OutOfMemory error is now caused due to the accumulation of min/max values > for those columns for each BlockMetaData. > > The "parquet.statistics.truncate.length" configuration is of no help because > it is applied during the footer serialization whereas the OOM occurs before > that. > > I think it would be nice to have, like for dictionary or bloom filter, a way > to disable the statistic on a per-column basis. > > Could be very useful to lower memory consumption when stats of huge binary > column are unnecessary. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (PARQUET-1874) Add to parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang resolved PARQUET-1874. -- Resolution: Fixed > Add to parquet-cli > -- > > Key: PARQUET-1874 > URL: https://issues.apache.org/jira/browse/PARQUET-1874 > Project: Parquet > Issue Type: Sub-task >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (PARQUET-1873) Add to Parquet-tools
[ https://issues.apache.org/jira/browse/PARQUET-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang resolved PARQUET-1873. -- Resolution: Fixed > Add to Parquet-tools > - > > Key: PARQUET-1873 > URL: https://issues.apache.org/jira/browse/PARQUET-1873 > Project: Parquet > Issue Type: Sub-task >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (PARQUET-1396) EncryptionPropertiesFactory and DecryptionPropertiesFactory
[ https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-1396: - Summary: EncryptionPropertiesFactory and DecryptionPropertiesFactory (was: Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory) > EncryptionPropertiesFactory and DecryptionPropertiesFactory > --- > > Key: PARQUET-1396 > URL: https://issues.apache.org/jira/browse/PARQUET-1396 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.10.0, 1.10.1 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) > that will provide the basic building blocks and APIs for the encryption > support. > This JIRA provides a crypto data interface for schema activation of Parquet > encryption and serves as a high-level layer on top of PARQUET-1178 to make > the adoption of Parquet-1178 easier, with pluggable key access module, > without a need to use the low-level encryption APIs. Also, this feature will > enable seamless integration with existing clients. > No change to specifications (Parquet-format), no new Parquet APIs, and no > changes in existing Parquet APIs. All current applications, tests, etc, will > work. > From developer perspective, they can just implement the interface into a > plugin which can be attached any Parquet application like Hive/Spark etc. > This decouples the complexity of dealing with KMS and schema from Parquet > applications. In large organization, they may have hundreds or even thousands > of Parquet applications and pipelines. The decoupling would make Parquet > encryption easier to be adopted. > From end user(for example data owner) perspective, if they think a column is > sensitive, they can just set that column’s schema as sensitive and then the > Parquet application just encrypt that column automatically. This makes end > user easy to manage the encryptions of their columns. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (PARQUET-1872) Add TransCompression Feature
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-1872: - Summary: Add TransCompression Feature (was: Add TransCompression command ) > Add TransCompression Feature > - > > Key: PARQUET-1872 > URL: https://issues.apache.org/jira/browse/PARQUET-1872 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When ZSTD becomes more popular, there is a need to translate existing data to > ZSTD compressed which can achieve a higher compression ratio. It would be > useful if we can have a tool to convert a Parquet file directly by just > decompressing/compressing each page without decoding/encoding or assembling > the record because it is much faster. The initial result shows it is ~5 times > faster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (PARQUET-2105) Refactor the test code of creating the test file
Xinli Shang created PARQUET-2105: Summary: Refactor the test code of creating the test file Key: PARQUET-2105 URL: https://issues.apache.org/jira/browse/PARQUET-2105 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Xinli Shang Assignee: Xinli Shang In the tests, there are many places that need to create a test parquet file with different settings. Currently, each test file just creates its own code. It would be better to have a test file builder to create that. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (PARQUET-2098) Add more methods into interface of BlockCipher
Xinli Shang created PARQUET-2098: Summary: Add more methods into interface of BlockCipher Key: PARQUET-2098 URL: https://issues.apache.org/jira/browse/PARQUET-2098 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Xinli Shang Assignee: Xinli Shang Currently BlockCipher interface has methods without letting caller to specify length/offset. In some use cases like Presto, it is needed to pass in a byte array and the data to be encrypted only occupys partially of the array. So we need to add a new methods something like below for decrypt. Similar methods might be needed for encrypt. byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, byte[] aad); -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (PARQUET-2027) Merging parquet files created in 1.11.1 not possible using 1.12.0
[ https://issues.apache.org/jira/browse/PARQUET-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang closed PARQUET-2027. > Merging parquet files created in 1.11.1 not possible using 1.12.0 > -- > > Key: PARQUET-2027 > URL: https://issues.apache.org/jira/browse/PARQUET-2027 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Matthew M >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.12.1 > > > I have parquet files created using 1.11.1. In the process I join two files > (with the same schema) into a one output file. I create Hadoop writer: > {code:scala} > val hadoopWriter = new ParquetFileWriter( > HadoopOutputFile.fromPath( > new Path(outputPath.toString), > new Configuration() > ), outputSchema, Mode.OVERWRITE, > 8 * 1024 * 1024, > 2097152, > DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, > DEFAULT_STATISTICS_TRUNCATE_LENGTH, > DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED > ) > hadoopWriter.start() > {code} > and try to append one file into another: > {code:scala} > hadoopWriter.appendFile(HadoopInputFile.fromPath(new Path(file), new > Configuration())) > {code} > Everything works on 1.11.1. But when I've switched to 1.12.0 it fails with > that error: > {code:scala} > STDERR: Exception in thread "main" java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' > was not found in serialized data! Struct: > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4 > at org.apache.parquet.format.Util.read(Util.java:365) > at org.apache.parquet.format.Util.readPageHeader(Util.java:132) > at org.apache.parquet.format.Util.readPageHeader(Util.java:127) > at org.apache.parquet.hadoop.Offsets.readDictionaryPageSize(Offsets.java:75) > at org.apache.parquet.hadoop.Offsets.getOffsets(Offsets.java:58) > at > org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroup(ParquetFileWriter.java:998) > at > org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroups(ParquetFileWriter.java:918) > at > org.apache.parquet.hadoop.ParquetFileReader.appendTo(ParquetFileReader.java:888) > at > org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:895) > at [...] > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'uncompressed_page_size' was not found in serialized data! > Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4 > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108) > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019) > at org.apache.parquet.format.PageHeader.read(PageHeader.java:896) > at org.apache.parquet.format.Util.read(Util.java:362) > ... 14 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version
[ https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang closed PARQUET-2078. > Failed to read parquet file after writing with the same parquet version > --- > > Key: PARQUET-2078 > URL: https://issues.apache.org/jira/browse/PARQUET-2078 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Nemon Lou >Assignee: Nemon Lou >Priority: Critical > Fix For: 1.13.0, 1.12.1 > > Attachments: > PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, > tpcds_customer_footer.json > > > Writing parquet file with version 1.12.0 in Apache Hive, then read that > file, returns the following error: > {noformat} > Caused by: java.lang.IllegalStateException: All of the offsets in the split > should be found in the file. expected: [4, 133961161] found: > [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED > [c_customer_sk] optional int64 c_customer_sk [PLAIN, RLE, BIT_PACKED], 4}, > ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id > (STRING) [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED > [c_current_cdemo_sk] optional int64 c_current_cdemo_sk [PLAIN, RLE, > BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] > optional int64 c_current_hdemo_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 > c_current_addr_sk [PLAIN, RLE, BIT_PACKED], 57421932}, > ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 > c_first_shipto_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, > ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 > c_first_sales_date_sk [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, > ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, > ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, > ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name > (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, > ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary > c_preferred_cust_flag (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], > 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 > c_birth_day [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, > ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month > [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED > [c_birth_year] optional int32 c_birth_year [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] > optional binary c_birth_country (STRING) [RLE, PLAIN_DICTIONARY, > BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary > c_login (STRING) [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, > ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address > (STRING) [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED > [c_last_review_date_sk] optional int64 c_last_review_date_sk [RLE, > PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}] > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) ~[?:1.8.0_292] > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > ~[?:1.8.0_292] > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > ~[?:1.8.0_292] > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > ~[?:1.8.0_292] > at > org.apache.hado
[jira] [Created] (PARQUET-2093) Add rewriter version to Parquet footer
Xinli Shang created PARQUET-2093: Summary: Add rewriter version to Parquet footer Key: PARQUET-2093 URL: https://issues.apache.org/jira/browse/PARQUET-2093 Project: Parquet Issue Type: Improvement Affects Versions: 1.13.0 Reporter: Xinli Shang Assignee: Xinli Shang Parquet footer records the writer's version in the field of 'create-by'. As we introduce several rewrites, the new file is written partially by the rewriter. In this case, we need to record the rewriter's version also. Some questions (about a common rewriter) we need to answer before step forward: What would be the place of the rewriter versions? (New specific field or key-value metadata? Which key shall we use?) Shall we somehow also save what the rewriter has done? How? At what level shall we copy the original created_by field and what level shall we write the version of the rewriter to that field instead? (What different levels are possible?) >From the introduction of this rewriter(s) field in case of any related writer >version dependent fix we need to check this field as well and not only the >created_by one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2075) Unified Rewriter Tool
[ https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-2075: - External issue URL: https://docs.google.com/document/d/1Ryt5uXnp-YwOrsnIDrGdTMoFfbOaTM6X39pBLHOa_50 > Unified Rewriter Tool > --- > > Key: PARQUET-2075 > URL: https://issues.apache.org/jira/browse/PARQUET-2075 > Project: Parquet > Issue Type: New Feature >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > During the discussion of PARQUET-2071, we came up with the idea of a > universal tool to translate the existing file to a different state while > skipping some level steps like encoding/decoding, to gain speed. For example, > only decompress pages and then compress directly. For PARQUET-2071, we only > decrypt and then encrypt directly. This will be useful for the existing data > to onboard Parquet features like column encryption, zstd etc. > We already have tools like trans-compression, column pruning etc. We will > consolidate all these tools with this universal tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2075) Unified Rewriter Tool
[ https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-2075: - Summary: Unified Rewriter Tool(was: Unified translation tool ) > Unified Rewriter Tool > --- > > Key: PARQUET-2075 > URL: https://issues.apache.org/jira/browse/PARQUET-2075 > Project: Parquet > Issue Type: New Feature >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > During the discussion of PARQUET-2071, we came up with the idea of a > universal tool to translate the existing file to a different state while > skipping some level steps like encoding/decoding, to gain speed. For example, > only decompress pages and then compress directly. For PARQUET-2071, we only > decrypt and then encrypt directly. This will be useful for the existing data > to onboard Parquet features like column encryption, zstd etc. > We already have tools like trans-compression, column pruning etc. We will > consolidate all these tools with this universal tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-2087) Release parquet-mr 1.12.1
[ https://issues.apache.org/jira/browse/PARQUET-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang resolved PARQUET-2087. -- Resolution: Fixed > Release parquet-mr 1.12.1 > - > > Key: PARQUET-2087 > URL: https://issues.apache.org/jira/browse/PARQUET-2087 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2091) Fix release build error introduced by PARQUET-2043
[ https://issues.apache.org/jira/browse/PARQUET-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416806#comment-17416806 ] Xinli Shang commented on PARQUET-2091: -- No issues on build but when run the release command, it will show up. > Fix release build error introduced by PARQUET-2043 > -- > > Key: PARQUET-2091 > URL: https://issues.apache.org/jira/browse/PARQUET-2091 > Project: Parquet > Issue Type: Task >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > After PARQUET-2043 when building for a release like 1.12.1, there is build > error complaining 'used undeclared dependency'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2091) Fix release build error introduced by PARQUET-2043
Xinli Shang created PARQUET-2091: Summary: Fix release build error introduced by PARQUET-2043 Key: PARQUET-2091 URL: https://issues.apache.org/jira/browse/PARQUET-2091 Project: Parquet Issue Type: Task Reporter: Xinli Shang Assignee: Xinli Shang After PARQUET-2043 when building for a release like 1.12.1, there is build error complaining 'used undeclared dependency'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2087) Release parquet-mr 1.12.0
Xinli Shang created PARQUET-2087: Summary: Release parquet-mr 1.12.0 Key: PARQUET-2087 URL: https://issues.apache.org/jira/browse/PARQUET-2087 Project: Parquet Issue Type: Task Components: parquet-mr Reporter: Xinli Shang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-2087) Release parquet-mr 1.12.1
[ https://issues.apache.org/jira/browse/PARQUET-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang reassigned PARQUET-2087: Assignee: Xinli Shang Due Date: 18/Sep/21 > Release parquet-mr 1.12.1 > - > > Key: PARQUET-2087 > URL: https://issues.apache.org/jira/browse/PARQUET-2087 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2087) Release parquet-mr 1.12.1
[ https://issues.apache.org/jira/browse/PARQUET-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-2087: - Summary: Release parquet-mr 1.12.1 (was: Release parquet-mr 1.12.0) > Release parquet-mr 1.12.1 > - > > Key: PARQUET-2087 > URL: https://issues.apache.org/jira/browse/PARQUET-2087 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Reporter: Xinli Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2082) Encryption translation tool - Parquet-cli
Xinli Shang created PARQUET-2082: Summary: Encryption translation tool - Parquet-cli Key: PARQUET-2082 URL: https://issues.apache.org/jira/browse/PARQUET-2082 Project: Parquet Issue Type: Task Reporter: Xinli Shang This is to implement the parquet-cli part of the encryption translation tool. It integrates with key tools to build the encryption properties, handle the parameters and call the parquet-hadoop API to encrypt. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2081) Encryption translation tool - Parquet-hadoop
Xinli Shang created PARQUET-2081: Summary: Encryption translation tool - Parquet-hadoop Key: PARQUET-2081 URL: https://issues.apache.org/jira/browse/PARQUET-2081 Project: Parquet Issue Type: Task Components: parquet-mr Reporter: Xinli Shang Fix For: 1.13.0 This is the implement the core part of the Encryption translation tool in parquet-hadoop. After this, we will have another Jira/PR for parquet-cli to integrate with key tools for encryption properties.. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-2071) Encryption translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402670#comment-17402670 ] Xinli Shang edited comment on PARQUET-2071 at 8/21/21, 5:40 PM: I just drafted the tool and had [~gershinsky] to have an earlier look(Thanks Gidon!). It is working now and I just had a comparison with a regular tool(I simply write a tool that read each record and write it back immediately. I have the code example in the [doc|https://docs.google.com/document/d/1-XdE8-QyDHnBsYrClwNsR8X3ks0JmKJ1-rXq7_th0hc/edit] ). The result is promising that it is 20X faster than the regular tool. [~gszadovszky] Are you open to having the tool merge in first and then we refactor all the existing similar tools to have the universal tool? If yes, I am going to make a PR shortly. was (Author: sha...@uber.com): I just drafted the tool and had [~gershinsky] to have an earlier look(Thanks Gidon!). It is working now and I just had a comparison with a regular tool(I simply write a tool that read each record and write it back immediately). The result is promising that it is 20X faster than the regular tool. [~gszadovszky] Are you open to having the tool merge in first and then we refactor all the existing similar tools to have the universal tool? If yes, I am going to make a PR shortly. > Encryption translation tool > > > Key: PARQUET-2071 > URL: https://issues.apache.org/jira/browse/PARQUET-2071 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When translating existing data to encryption state, we could develop a tool > like TransCompression to translate the data at page level to encryption state > without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2071) Encryption translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402670#comment-17402670 ] Xinli Shang commented on PARQUET-2071: -- I just drafted the tool and had [~gershinsky] to have an earlier look(Thanks Gidon!). It is working now and I just had a comparison with a regular tool(I simply write a tool that read each record and write it back immediately). The result is promising that it is 20X faster than the regular tool. [~gszadovszky] Are you open to having the tool merge in first and then we refactor all the existing similar tools to have the universal tool? If yes, I am going to make a PR shortly. > Encryption translation tool > > > Key: PARQUET-2071 > URL: https://issues.apache.org/jira/browse/PARQUET-2071 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When translating existing data to encryption state, we could develop a tool > like TransCompression to translate the data at page level to encryption state > without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2075) Unified translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-2075: - External issue ID: https://issues.apache.org/jira/browse/PARQUET-2071 > Unified translation tool > -- > > Key: PARQUET-2075 > URL: https://issues.apache.org/jira/browse/PARQUET-2075 > Project: Parquet > Issue Type: New Feature >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > During the discussion of PARQUET-2071, we came up with the idea of a > universal tool to translate the existing file to a different state while > skipping some level steps like encoding/decoding, to gain speed. For example, > only decompress pages and then compress directly. For PARQUET-2071, we only > decrypt and then encrypt directly. This will be useful for the existing data > to onboard Parquet features like column encryption, zstd etc. > We already have tools like trans-compression, column pruning etc. We will > consolidate all these tools with this universal tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2071) Encryption translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-2071: - External issue ID: https://issues.apache.org/jira/browse/PARQUET-2075 > Encryption translation tool > > > Key: PARQUET-2071 > URL: https://issues.apache.org/jira/browse/PARQUET-2071 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When translating existing data to encryption state, we could develop a tool > like TransCompression to translate the data at page level to encryption state > without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2071) Encryption translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394098#comment-17394098 ] Xinli Shang commented on PARQUET-2071: -- Thanks, Gabor and Gidon! I think it is a good idea of 'universal tool' and load it for different use cases. I opened https://issues.apache.org/jira/browse/PARQUET-2075 for it. > Encryption translation tool > > > Key: PARQUET-2071 > URL: https://issues.apache.org/jira/browse/PARQUET-2071 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When translating existing data to encryption state, we could develop a tool > like TransCompression to translate the data at page level to encryption state > without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2075) Unified translation tool
Xinli Shang created PARQUET-2075: Summary: Unified translation tool Key: PARQUET-2075 URL: https://issues.apache.org/jira/browse/PARQUET-2075 Project: Parquet Issue Type: New Feature Reporter: Xinli Shang Assignee: Xinli Shang During the discussion of PARQUET-2071, we came up with the idea of a universal tool to translate the existing file to a different state while skipping some level steps like encoding/decoding, to gain speed. For example, only decompress pages and then compress directly. For PARQUET-2071, we only decrypt and then encrypt directly. This will be useful for the existing data to onboard Parquet features like column encryption, zstd etc. We already have tools like trans-compression, column pruning etc. We will consolidate all these tools with this universal tool. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2071) Encryption translation tool
[ https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-2071: - External issue URL: https://docs.google.com/document/d/1-XdE8-QyDHnBsYrClwNsR8X3ks0JmKJ1-rXq7_th0hc/edit# > Encryption translation tool > > > Key: PARQUET-2071 > URL: https://issues.apache.org/jira/browse/PARQUET-2071 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When translating existing data to encryption state, we could develop a tool > like TransCompression to translate the data at page level to encryption state > without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2071) Encryption translation tool
Xinli Shang created PARQUET-2071: Summary: Encryption translation tool Key: PARQUET-2071 URL: https://issues.apache.org/jira/browse/PARQUET-2071 Project: Parquet Issue Type: New Feature Components: parquet-mr Reporter: Xinli Shang Assignee: Xinli Shang When translating existing data to encryption state, we could develop a tool like TransCompression to translate the data at page level to encryption state without reading to record and rewrite. This will speed up the process a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2064) Make Range public accessible in RowRanges
[ https://issues.apache.org/jira/browse/PARQUET-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379217#comment-17379217 ] Xinli Shang commented on PARQUET-2064: -- [~gszadovszky], do you have some suggestions on how to proceed? It is the reality that Spar/Hive uses lower-level APIs that were not designed for and it is now a blocker for column index to rollout. > Make Range public accessible in RowRanges > - > > Key: PARQUET-2064 > URL: https://issues.apache.org/jira/browse/PARQUET-2064 > Project: Parquet > Issue Type: New Feature >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When rolling out to Presto, I found we need to know the boundaries of each > Range in RowRanges. It is still doable with Iterator but Presto has. batch > reader, we cannot use iterator for each row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2064) Make Range public accessible in RowRanges
Xinli Shang created PARQUET-2064: Summary: Make Range public accessible in RowRanges Key: PARQUET-2064 URL: https://issues.apache.org/jira/browse/PARQUET-2064 Project: Parquet Issue Type: New Feature Reporter: Xinli Shang Assignee: Xinli Shang When rolling out to Presto, I found we need to know the boundaries of each Range in RowRanges. It is still doable with Iterator but Presto has. batch reader, we cannot use iterator for each row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-2062) Data masking(null) for column encryption
[ https://issues.apache.org/jira/browse/PARQUET-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17374892#comment-17374892 ] Xinli Shang commented on PARQUET-2062: -- Great idea! On Mon, Jul 5, 2021 at 1:03 AM Gabor Szadovszky (Jira) -- Xinli Shang > Data masking(null) for column encryption > - > > Key: PARQUET-2062 > URL: https://issues.apache.org/jira/browse/PARQUET-2062 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When user doesn't have permisson on a column that are encrypted by the column > encryption feature (parquet-1178), returning a masked value could avoid an > exception and let the call succeed. > We would like to introduce the data masking with null values. The idea is > when the user gets key access denied and the user can accept null(via a > reading option flag), we would return null for the encrypted columns. This > solution doesn't need to save extra columns for masked value and doesn't need > to translate existing data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-1792: - Fix Version/s: 1.12.0 > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
[ https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17372862#comment-17372862 ] Xinli Shang commented on PARQUET-1681: -- We chose to revert the behavior back to 1.8.1. It runs fine for a year or so. We will port the changes soon. > Avro's isElementType() change breaks the reading of some parquet(1.8.1) files > - > > Key: PARQUET-1681 > URL: https://issues.apache.org/jira/browse/PARQUET-1681 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.10.0, 1.9.1, 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Critical > > When using the Avro schema below to write a parquet(1.8.1) file and then read > back by using parquet 1.10.1 without passing any schema, the reading throws > an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. > { > "name": "phones", > "type": [ > "null", > { > "type": "array", > "items": { > "type": "record", > "name": "phones_items", > "fields": [ > > { "name": "phone_number", > "type": [ "null", > "string" ], "default": null > } > ] > } > } > ], > "default": null > } > The code to read is as below > val reader = > AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new* > Configuration).build() > reader.read() > PARQUET-651 changed the method isElementType() by relying on Avro's > checkReaderWriterCompatibility() to check the compatibility. However, > checkReaderWriterCompatibility() consider the ParquetSchema and the > AvroSchema(converted from File schema) as not compatible(the name in avro > schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence > not compatible) . Hence return false and caused the “phone_number” field in > the above schema to be considered as group type which is not true. Then the > exception throws as .asGroupType(). > I didn’t try writing via parquet 1.10.1 would reproduce the same problem or > not. But it could because the translation of Avro schema to Parquet schema is > not changed(didn’t verify yet). > I hesitate to revert PARQUET-651 because it solved several problems. I would > like to hear the community's thoughts on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2062) Data masking(null) for column encryption
Xinli Shang created PARQUET-2062: Summary: Data masking(null) for column encryption Key: PARQUET-2062 URL: https://issues.apache.org/jira/browse/PARQUET-2062 Project: Parquet Issue Type: New Feature Components: parquet-mr Reporter: Xinli Shang Assignee: Xinli Shang When user doesn't have permisson on a column that are encrypted by the column encryption feature (parquet-1178), returning a masked value could avoid an exception and let the call succeed. We would like to introduce the data masking with null values. The idea is when the user gets key access denied and the user can accept null(via a reading option flag), we would return null for the encrypted columns. This solution doesn't need to save extra columns for masked value and doesn't need to translate existing data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2054) TCP connection leaking when calling appendFile()
Xinli Shang created PARQUET-2054: Summary: TCP connection leaking when calling appendFile() Key: PARQUET-2054 URL: https://issues.apache.org/jira/browse/PARQUET-2054 Project: Parquet Issue Type: New Feature Components: parquet-mr Reporter: Xinli Shang When appendFile() is called, the file reader path is opened but not closed. It caused many TCP connections leaked. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351199#comment-17351199 ] Xinli Shang commented on PARQUET-1968: -- Go ahead to work on it. Thanks Huaxin! > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313524#comment-17313524 ] Xinli Shang commented on PARQUET-1827: -- It seems the storage size is reduced by ~8% for the UUID column. > UUID type currently not supported by parquet-mr > --- > > Key: PARQUET-1827 > URL: https://issues.apache.org/jira/browse/PARQUET-1827 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Brad Smith >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > The parquet-format project introduced a new UUID logical type in version 2.4: > [https://github.com/apache/parquet-format/blob/master/CHANGES.md] > This would be a useful type to have available in some circumstances, but it > currently isn't supported in the parquet-mr library. Hopefully this feature > can be implemented at some point. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2006) Column resolution by ID
Xinli Shang created PARQUET-2006: Summary: Column resolution by ID Key: PARQUET-2006 URL: https://issues.apache.org/jira/browse/PARQUET-2006 Project: Parquet Issue Type: New Feature Components: parquet-mr Reporter: Xinli Shang Assignee: Xinli Shang Parquet relies on the name. In a lot of usages e.g. schema resolution, this would be a problem. Iceberg uses ID and stored Id/name mappings. This Jira is to add column ID resolution support. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1992) Cannot build from tarball because of git submodules
[ https://issues.apache.org/jira/browse/PARQUET-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296124#comment-17296124 ] Xinli Shang commented on PARQUET-1992: -- I think we shouldn't let it fail when developers run 'mvn package/install' or 'mvn verify' in any case if they don't make any changes. So I like the idea of downloading directly. I will review the code once it passes the build. > Cannot build from tarball because of git submodules > --- > > Key: PARQUET-1992 > URL: https://issues.apache.org/jira/browse/PARQUET-1992 > Project: Parquet > Issue Type: Bug >Reporter: Gabor Szadovszky >Priority: Blocker > > Because we use git submodules (to get test parquet files) a simple "mvn clean > install" fails from the unpacked tarball due to "not a git repository". > I think we would have 2 options to solve this situation: > * Include all the required files (even only for testing) in the tarball and > somehow avoid the git submodule update in case of executed in a non-git > envrionment > * Make the downloading of the parquet files and the related tests optional so > it won't fail the build from the tarball -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1948) TransCompressionCommand Inoperable
[ https://issues.apache.org/jira/browse/PARQUET-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286825#comment-17286825 ] Xinli Shang commented on PARQUET-1948: -- [~vanhooser], glad to see you have the interests of this tool. We have been using it by translating GZIP to ZSTD for existing parquet files. Let me know if you hit any issues. > TransCompressionCommand Inoperable > -- > > Key: PARQUET-1948 > URL: https://issues.apache.org/jira/browse/PARQUET-1948 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.1 > Environment: I am using parquet-tools 1.11.1 on a Mac machine running > Catalina, and my parquet-tools jar was downloaded from Maven Central. >Reporter: Shelby Vanhooser >Priority: Blocker > Labels: parquet-tools > > {{TransCompressionCommand}} in parquet-tools is intended to allow translation > of compression types in parquet files. We are intending to use this > functionality to debug a corrupted file, but this command fails to run at the > moment entirely. > Running the following command (on the uncorrupted file): > {code:java} > java -jar ./parquet-tools-1.11.1.jar trans-compression > ~/Downloads/part-00048-69f65188-94b5-4772-8906-5c78989240b5_00048.c000.snappy.parquet{code} > This results in > > {code:java} > Unknown command: trans-compression{code} > > I believe this is due to the Registry class [silently catching any errors to > initialize|https://github.com/apache/parquet-mr/blob/master/parquet-tools/src/main/java/org/apache/parquet/tools/command/Registry.java#L65] > which subsequently is [misinterpreted as an unknown > command|https://github.com/apache/parquet-mr/blob/master/parquet-tools/src/main/java/org/apache/parquet/tools/Main.java#L200]. > We need to: > # Write a test for the TransCompressionCommand to figure out why it's > showing up as unknown command > # Probably expand these tests to cover all the other commands > > This will then unblock our debugging work on the suspect file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276664#comment-17276664 ] Xinli Shang commented on PARQUET-1968: -- Sure, will connect with you shortly. > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276533#comment-17276533 ] Xinli Shang commented on PARQUET-1968: -- Hi [~rdblue]. We didn't discuss it in last week's Parquet sync meeting since you were not there. The next Parquet sync is Feb 23th 9:00am. I just added you explicitly with your Netflix email account. Hopefully, you can join. > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1949) Mark Parquet-1872 with not support bloom filter yet
[ https://issues.apache.org/jira/browse/PARQUET-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-1949: - Summary: Mark Parquet-1872 with not support bloom filter yet (was: Mark Parquet-1872 with note support bloom filter yet ) > Mark Parquet-1872 with not support bloom filter yet > > > Key: PARQUET-1949 > URL: https://issues.apache.org/jira/browse/PARQUET-1949 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > To unblock the release of 1.12.0, we need to add comments in the > trans-compression command to indicated 'not support bloom filter yet' -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1872) Add TransCompression command
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244334#comment-17244334 ] Xinli Shang commented on PARQUET-1872: -- Thanks [~gszadovszky] for working on this! I just created the PR to add comments in the command. Once you review it and we merge, I will resolve this Jira. > Add TransCompression command > - > > Key: PARQUET-1872 > URL: https://issues.apache.org/jira/browse/PARQUET-1872 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When ZSTD becomes more popular, there is a need to translate existing data to > ZSTD compressed which can achieve a higher compression ratio. It would be > useful if we can have a tool to convert a Parquet file directly by just > decompressing/compressing each page without decoding/encoding or assembling > the record because it is much faster. The initial result shows it is ~5 times > faster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1949) Mark Parquet-1872 with note support bloom filter yet
Xinli Shang created PARQUET-1949: Summary: Mark Parquet-1872 with note support bloom filter yet Key: PARQUET-1949 URL: https://issues.apache.org/jira/browse/PARQUET-1949 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.12.0 Reporter: Xinli Shang Assignee: Xinli Shang Fix For: 1.12.0 To unblock the release of 1.12.0, we need to add comments in the trans-compression command to indicated 'not support bloom filter yet' -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex
[ https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242634#comment-17242634 ] Xinli Shang commented on PARQUET-1901: -- For now, I think we can move it to the next release. > Add filter null check for ColumnIndex > --- > > Key: PARQUET-1901 > URL: https://issues.apache.org/jira/browse/PARQUET-1901 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > This Jira is opened for discussion that should we add null checking for the > filter when ColumnIndex is enabled. > In the ColumnIndexFilter#calculateRowRanges() method, the input parameter > 'filter' is assumed to be non-null without checking. It throws NPE when > ColumnIndex is enabled(by default) but there is no filter set in the > ParquetReadOptions. The call stack is as below. > java.lang.NullPointerException > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) > at > org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891) > If we don't add, the user might need to choose to call readNextRowGroup() or > readFilteredNextRowGroup() accordingly based on filter existence. > Thoughts? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242631#comment-17242631 ] Xinli Shang edited comment on PARQUET-1927 at 12/2/20, 7:05 PM: It is still not decided yet in the last Iceberg meeting. But I think if adding the 'skipped number of records' is minimal for us, we can go ahead just to add it. Otherwise, we can release without this. Add [~rdblue] for FYI was (Author: sha...@uber.com): It is still not decided yet in the last Iceberg meeting. But I think if adding the 'skipped number of records' is minimal for us, we can go ahead just to add it. Otherwise, we can release without this. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242631#comment-17242631 ] Xinli Shang commented on PARQUET-1927: -- It is still not decided yet in the last Iceberg meeting. But I think if adding the 'skipped number of records' is minimal for us, we can go ahead just to add it. Otherwise, we can release without this. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1666) Remove Unused Modules
[ https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242625#comment-17242625 ] Xinli Shang commented on PARQUET-1666: -- I think adding "-deprecated" is a good idea. [~zhenxiao], can you help us to know if dropping parquet-scooge module in paruqet-mr repo is OK for Twitter usage? > Remove Unused Modules > -- > > Key: PARQUET-1666 > URL: https://issues.apache.org/jira/browse/PARQUET-1666 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > In the last two meetings, Ryan Blue proposed to remove some unused Parquet > modules. This is to open a task to track it. > Here are the related meeting notes for the discussion on this. > Remove old Parquet modules > Hive modules - sounds good > Scooge - Julien will reach out to twitter > Tools - undecided - Cloudera may still use the parquet-tools according to > Gabor. > Cascading - undecided > We can change the module as deprecated as description. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226089#comment-17226089 ] Xinli Shang commented on PARQUET-1927: -- [~gszadovszky], I just realized the RowGroupFilter only applies the stats from ColumnChunkMetaData instead of page-level stats. There is a chance that ColumnChunkMetaData stats say yes, but page-level stats say no. In that case, readNextFilteredRowGroup() can still skip block. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221481#comment-17221481 ] Xinli Shang commented on PARQUET-1927: -- Thanks [~gszadovszky] for the explanation. I see it now. The confusing part is Iceberg creates ParquetFileReader object without passing in the filter. Instead, it rewrites RowGroup and Dictionary filtering. Hi [~rdblue], do you know why Iceberg rewrites the RowGroup and Dictionary filtering? From what Gabor mentioned above, if we pass the filter to the ParquetFileReader constructor, all the row groups that we need to deal with later are already filtered. When we upgrade to 1.12.0, bloomfilter will be automatically applied to those row groups. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221036#comment-17221036 ] Xinli Shang commented on PARQUET-1927: -- ParquetFileReader.getFilteredRecordCount() cannot be used because Iceberg applied RowGroup stats filter and Dcitionary filter also. I think what we can do is to make getRowRanges() public. Iceberg call getRowRanges() to calculate the filteredRecordCount for the RowGroup that is determined(by RowGroup stats and Dictionary filter) to be read. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang reassigned PARQUET-1927: Assignee: Xinli Shang > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219293#comment-17219293 ] Xinli Shang commented on PARQUET-1927: -- [~gszadovszky], the problem is when rowCount is 0(line 966 https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L966), readNextFilteredRowGroup() will just call advanceToNextBlock() and then recurse itself to next row group. In that case, the returned count of [PageReadStore.getRowCount()|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/page/PageReadStore.java#L44] will be the filtered count of the next row group. Iceberg doesn't have the knowledge to know these row counts are from which row group. It has to assume it is from the previous group. The result is it is wrongly counted and Iceberg iterator will just return true in hasNext() even all the records are read. The fix could be just to add a count for a skipped count including the skipped count as a whole row group. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1396) Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
[ https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218717#comment-17218717 ] Xinli Shang commented on PARQUET-1396: -- Most of the functionality of this Jira has been addressed by "PARQUET-1817: Crypto Properties Factory". Hence change the name of this Jira to 'Example of Using EncryptionPropertiesFactory/DecryptionPropertiesFactory. > Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory > > > Key: PARQUET-1396 > URL: https://issues.apache.org/jira/browse/PARQUET-1396 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.10.0, 1.10.1 >Reporter: Xinli Shang >Priority: Major > Labels: pull-request-available > > This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) > that will provide the basic building blocks and APIs for the encryption > support. > This JIRA provides a crypto data interface for schema activation of Parquet > encryption and serves as a high-level layer on top of PARQUET-1178 to make > the adoption of Parquet-1178 easier, with pluggable key access module, > without a need to use the low-level encryption APIs. Also, this feature will > enable seamless integration with existing clients. > No change to specifications (Parquet-format), no new Parquet APIs, and no > changes in existing Parquet APIs. All current applications, tests, etc, will > work. > From developer perspective, they can just implement the interface into a > plugin which can be attached any Parquet application like Hive/Spark etc. > This decouples the complexity of dealing with KMS and schema from Parquet > applications. In large organization, they may have hundreds or even thousands > of Parquet applications and pipelines. The decoupling would make Parquet > encryption easier to be adopted. > From end user(for example data owner) perspective, if they think a column is > sensitive, they can just set that column’s schema as sensitive and then the > Parquet application just encrypt that column automatically. This makes end > user easy to manage the encryptions of their columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1396) Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
[ https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang updated PARQUET-1396: - Summary: Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory (was: Cryptodata Interface for Schema Activation of Parquet Encryption) > Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory > > > Key: PARQUET-1396 > URL: https://issues.apache.org/jira/browse/PARQUET-1396 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.10.0, 1.10.1 >Reporter: Xinli Shang >Priority: Major > Labels: pull-request-available > > This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) > that will provide the basic building blocks and APIs for the encryption > support. > This JIRA provides a crypto data interface for schema activation of Parquet > encryption and serves as a high-level layer on top of PARQUET-1178 to make > the adoption of Parquet-1178 easier, with pluggable key access module, > without a need to use the low-level encryption APIs. Also, this feature will > enable seamless integration with existing clients. > No change to specifications (Parquet-format), no new Parquet APIs, and no > changes in existing Parquet APIs. All current applications, tests, etc, will > work. > From developer perspective, they can just implement the interface into a > plugin which can be attached any Parquet application like Hive/Spark etc. > This decouples the complexity of dealing with KMS and schema from Parquet > applications. In large organization, they may have hundreds or even thousands > of Parquet applications and pipelines. The decoupling would make Parquet > encryption easier to be adopted. > From end user(for example data owner) perspective, if they think a column is > sensitive, they can just set that column’s schema as sensitive and then the > Parquet application just encrypt that column automatically. This makes end > user easy to manage the encryptions of their columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218325#comment-17218325 ] Xinli Shang commented on PARQUET-1927: -- The workaround I can think of is to apply ColumnIndex to row groups, something like (columnIndex, rowGroup) => recordCount, before calling readNextFilteredRowGroup() in Iceberg. If recordCount is 0, we skip calling readNextFilteredRowGroup() for that row group. By doing this way, it is ensured that readNextFilteredRowGroup() will never advance to the next row group without Iceberg's knowledge. But this workaround has several issues. 1) It is not a trivial implementation because we need to implement all types of filters against columnIndex, which pretty much duplicate the implementation in Parquet. 2) The two implementations(in Parquet and in Iceberg) have to be consistent. If one has issues, it will cause Iceberg to be in an unknown state. 3) It requires other adoption like Hive, Spark to reimplement their own too. This is not regression because ColumnIndex is a new feature in 1.11.x. But I think releasing 1.11.2 would be better because it helps the adoption of 1.11.x as the ColumnIndex feature is one of the major features in 1.11.x. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217827#comment-17217827 ] Xinli Shang commented on PARQUET-1927: -- Add [~rdblue],[~shardulm] as FYI** > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217774#comment-17217774 ] Xinli Shang commented on PARQUET-1927: -- That is correct [~gszadovszky]! We need a finer-grained filter count in row group level to let the iterator to use. Do you think it makes sense that we add the API for that? If yes, do you think we can release the 1.11.2 version? I see usually no more release after 1.xx.1. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216849#comment-17216849 ] Xinli Shang commented on PARQUET-1927: -- [~gszadovszky], the way that Iceberg Parquet reader iterator implements is that it relies on the check of 'valuesRead < totalValues'. When intergrating ColumnIndex, we relace readNextRowGroup() with readNextFilteredRowGroup(). Because readNextFilteredRowGroup() will skip some records, we change the check as 'valuesRead + skippedValues < totalValues'. The skippedValues is calculated as 'blockRowCount - counts_Retuned_from_readNextFilteredRowGroup'.This works great. But when the whole row group is skipped, readNextFilteredRowGroup() advance to next row group internally without Iceberg's knowledge. Hence Icerberg doesn't know how to calculate the skippedValues. So if readNextFilteredRowGroup() can return how many records it skipped, or tell the index of the row group that it gets the returned pages, Iceberg can calcuate the skippedValues. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1927) ColumnIndex should provide number of records skipped
Xinli Shang created PARQUET-1927: Summary: ColumnIndex should provide number of records skipped Key: PARQUET-1927 URL: https://issues.apache.org/jira/browse/PARQUET-1927 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.11.0 Reporter: Xinli Shang Fix For: 1.12.0 When integrating Parquet ColumnIndex, I found we need to know from Parquet that how many records that we skipped due to ColumnIndex filtering. When rowCount is 0, readNextFilteredRowGroup() just advance to next without telling the caller. See code here [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] In Iceberg, it reads Parquet record with an iterator. The hasNext() has the following code(): valuesRead + skippedValues < totalValues See ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] So without knowing the skipped values, it is hard to determine hasNext() or not. Currently, we can workaround by using a flag. When readNextFilteredRowGroup() returns null, we consider it is done for the whole file. Then hasNext() just retrun false. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1916) Add hash functionality
Xinli Shang created PARQUET-1916: Summary: Add hash functionality Key: PARQUET-1916 URL: https://issues.apache.org/jira/browse/PARQUET-1916 Project: Parquet Issue Type: Sub-task Reporter: Xinli Shang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-1915) Add null command
[ https://issues.apache.org/jira/browse/PARQUET-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang reassigned PARQUET-1915: Assignee: Xinli Shang > Add null command > - > > Key: PARQUET-1915 > URL: https://issues.apache.org/jira/browse/PARQUET-1915 > Project: Parquet > Issue Type: Sub-task >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1915) Add null command
Xinli Shang created PARQUET-1915: Summary: Add null command Key: PARQUET-1915 URL: https://issues.apache.org/jira/browse/PARQUET-1915 Project: Parquet Issue Type: Sub-task Reporter: Xinli Shang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1901) Add filter null check for ColumnIndex
[ https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186211#comment-17186211 ] Xinli Shang edited comment on PARQUET-1901 at 8/28/20, 2:23 AM: I have the initial version of Iceberg integration working in my private repo https://github.com/shangxinli/iceberg/commit/4cc9351f8a511a3179cb3ac857541f9116dd8661. It can skip the pages now based on the column index. But it is very initial version and I didn't finalize it yet, also no tests are added. I also didn't get time to address your feedback to idtoAlias comments yet. But I hope it can give you an idea ON what the integration looks like. was (Author: sha...@uber.com): I have the initial version of Iceberg integration working in my private repo https://github.com/shangxinli/iceberg/commit/4cc9351f8a511a3179cb3ac857541f9116dd8661. It can skip the pages now based on the column index. But it is very initial version and I didn't finalize it yet, also no tests are added. I also didn't get time to address your feedback to idtoAlias comments yet. But I hope it can give you AN idea ON what the integration looks like. > Add filter null check for ColumnIndex > --- > > Key: PARQUET-1901 > URL: https://issues.apache.org/jira/browse/PARQUET-1901 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > This Jira is opened for discussion that should we add null checking for the > filter when ColumnIndex is enabled. > In the ColumnIndexFilter#calculateRowRanges() method, the input parameter > 'filter' is assumed to be non-null without checking. It throws NPE when > ColumnIndex is enabled(by default) but there is no filter set in the > ParquetReadOptions. The call stack is as below. > java.lang.NullPointerException > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) > at > org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891) > If we don't add, the user might need to choose to call readNextRowGroup() or > readFilteredNextRowGroup() accordingly based on filter existence. > Thoughts? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex
[ https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186211#comment-17186211 ] Xinli Shang commented on PARQUET-1901: -- I have the initial version of Iceberg integration working in my private repo https://github.com/shangxinli/iceberg/commit/4cc9351f8a511a3179cb3ac857541f9116dd8661. It can skip the pages now based on the column index. But it is very initial version and I didn't finalize it yet, also no tests are added. I also didn't get time to address your feedback to idtoAlias comments yet. But I hope it can give you AN idea ON what the integration looks like. > Add filter null check for ColumnIndex > --- > > Key: PARQUET-1901 > URL: https://issues.apache.org/jira/browse/PARQUET-1901 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > This Jira is opened for discussion that should we add null checking for the > filter when ColumnIndex is enabled. > In the ColumnIndexFilter#calculateRowRanges() method, the input parameter > 'filter' is assumed to be non-null without checking. It throws NPE when > ColumnIndex is enabled(by default) but there is no filter set in the > ParquetReadOptions. The call stack is as below. > java.lang.NullPointerException > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) > at > org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891) > If we don't add, the user might need to choose to call readNextRowGroup() or > readFilteredNextRowGroup() accordingly based on filter existence. > Thoughts? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex
[ https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183352#comment-17183352 ] Xinli Shang commented on PARQUET-1901: -- Hi [~rdblue], please comment on this if you have different opinions. This is found during the ColumnIndex integration to Iceberg. We would need to handle the null checking in Iceberg anyway before Parquet 1.12.0. > Add filter null check for ColumnIndex > --- > > Key: PARQUET-1901 > URL: https://issues.apache.org/jira/browse/PARQUET-1901 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > This Jira is opened for discussion that should we add null checking for the > filter when ColumnIndex is enabled. > In the ColumnIndexFilter#calculateRowRanges() method, the input parameter > 'filter' is assumed to be non-null without checking. It throws NPE when > ColumnIndex is enabled(by default) but there is no filter set in the > ParquetReadOptions. The call stack is as below. > java.lang.NullPointerException > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) > at > org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891) > If we don't add, the user might need to choose to call readNextRowGroup() or > readFilteredNextRowGroup() accordingly based on filter existence. > Thoughts? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1901) Add filter null check for ColumnIndex
Xinli Shang created PARQUET-1901: Summary: Add filter null check for ColumnIndex Key: PARQUET-1901 URL: https://issues.apache.org/jira/browse/PARQUET-1901 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.11.0 Reporter: Xinli Shang Assignee: Xinli Shang Fix For: 1.12.0 This Jira is opened for discussion that should we add null checking for the filter when ColumnIndex is enabled. In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 'filter' is assumed to be non-null without checking. It throws NPE when ColumnIndex is enabled(by default) but there is no filter set in the ParquetReadOptions. The call stack is as below. java.lang.NullPointerException at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) at org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961) at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891) If we don't add, the user might need to choose to call readNextRowGroup() or readFilteredNextRowGroup() accordingly based on filter existence. Thoughts? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1801) Add column index support for 'prune' command in Parquet-tools/cli
[ https://issues.apache.org/jira/browse/PARQUET-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176961#comment-17176961 ] Xinli Shang commented on PARQUET-1801: -- I will try to do it in 1.12.0. The feature works great! We removed columns in many whale tables and significant storage space was saved. I will have a talk in ApacheCon 2020 to present this topic. > Add column index support for 'prune' command in Parquet-tools/cli > - > > Key: PARQUET-1801 > URL: https://issues.apache.org/jira/browse/PARQUET-1801 > Project: Parquet > Issue Type: Improvement > Components: parquet-cli, parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli
[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176959#comment-17176959 ] Xinli Shang commented on PARQUET-1792: -- We might want to push it for next release. > Add 'mask' command to parquet-tools/parquet-cli > --- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1893) H2SeekableInputStream readFully() doesn't respect start and len
Xinli Shang created PARQUET-1893: Summary: H2SeekableInputStream readFully() doesn't respect start and len Key: PARQUET-1893 URL: https://issues.apache.org/jira/browse/PARQUET-1893 Project: Parquet Issue Type: Bug Components: parquet-mr Reporter: Xinli Shang Assignee: Xinli Shang The readFully() throws away the parameters 'start' and 'len' as shown below. public void readFully(byte[] bytes, int start, int len) throws IOException { stream.readFully(bytes); } It should be corrected as below. public void readFully(byte[] bytes, int start, int len) throws IOException { stream.readFully(bytes, start, len); } H1SeekableInputStream() has been fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005)