[jira] [Commented] (PARQUET-1975) Test failure on ARM64 CPU architecture
[ https://issues.apache.org/jira/browse/PARQUET-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286593#comment-17286593 ] Ryan Blue commented on PARQUET-1975: I would not want anyone to block on brotli-codec. That's a project I released for others to use, but not something that I actively maintain. I support the idea of adding an alternative to it to Parquet. > Test failure on ARM64 CPU architecture > -- > > Key: PARQUET-1975 > URL: https://issues.apache.org/jira/browse/PARQUET-1975 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Martin Tzvetanov Grigorov >Priority: Minor > > Trying to build Apache Parquet MR on ARM64 fails with: > > {code:java} > $ mvn clean verify > ... > Tests in error: > > testReadWriteWithCountDeprecated(org.apache.parquet.hadoop.DeprecatedInputFormatTest): > org.apache.hadoop.io.compress.CompressionCodec: Provider > org.apache.hadoop.io.compress.BrotliCodec could not be instantiated > {code} > > The reason is that com.github.rdblue:brotli-codec has no binary for aarch64 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276548#comment-17276548 ] Ryan Blue commented on PARQUET-1968: Thank you! I'm not sure why it was no longer on my calendar. I have the invite now and I plan to attend the sync on the 23rd. If you'd like, we can also set up a time to talk about this integration specifically, since it may take a while. > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276526#comment-17276526 ] Ryan Blue commented on PARQUET-1968: I would really like to see a new Parquet API that can support some of the additional features we needed for Iceberg. I proposed adopting Iceberg's filter expressions a year or two ago, so I'm glad to see that the idea has some support from other PMC members. This is one reason why the API is in a separate module. I think we were planning to talk about this at the next Parquet sync, although I'm not sure when that will be. FYI [~sha...@uber.com]. > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex
[ https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183481#comment-17183481 ] Ryan Blue commented on PARQUET-1901: It isn't clear to me how a filter implementation would handle the filter itself being null. It could return a default value to accept/read, but that runs into issues when filters like {{not(null)}} are passed in. So I agree with Gabor that it makes sense for a null filter to be an exceptional case in the filter implementations themselves. But I would expect a method like {{calculateRowRanges}} to correctly return the default {{RowRanges.createSingle(rowCount)}} if that method were passed a null value, since it is not actually processing the filter. For Iceberg, I'm wondering if it wouldn't be easier to implement our own filter implementation that produced row ranges and passed them in. That's how we filter row groups and I think it has been much easier not needing to convert to Parquet filters, which are difficult to work with. > Add filter null check for ColumnIndex > --- > > Key: PARQUET-1901 > URL: https://issues.apache.org/jira/browse/PARQUET-1901 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > This Jira is opened for discussion that should we add null checking for the > filter when ColumnIndex is enabled. > In the ColumnIndexFilter#calculateRowRanges() method, the input parameter > 'filter' is assumed to be non-null without checking. It throws NPE when > ColumnIndex is enabled(by default) but there is no filter set in the > ParquetReadOptions. The call stack is as below. > java.lang.NullPointerException > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) > at > org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891) > If we don't add, the user might need to choose to call readNextRowGroup() or > readFilteredNextRowGroup() accordingly based on filter existence. > Thoughts? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1809) Add new APIs for nested predicate pushdown
[ https://issues.apache.org/jira/browse/PARQUET-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051585#comment-17051585 ] Ryan Blue commented on PARQUET-1809: I think it should be fine to allow this. While there may be other problems when using `.` in names, the Spark PR that uses this shows that it works just fine to pass a string array instead of parsing a name. > Add new APIs for nested predicate pushdown > --- > > Key: PARQUET-1809 > URL: https://issues.apache.org/jira/browse/PARQUET-1809 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: DB Tsai >Priority: Major > > Currently, Parquet's *org.apache.parquet.filter2.predicate.FilterApi* is > using *dot* to split the column name into multi-parts of nested fields. The > drawback is that this causes issues when the field name contains *dot*. > The new APIs that will be added will take array of string directly for > multi-parts of nested fields, so no confusion as using *dot* as a separator. > See https://github.com/apache/spark/pull/27728 and [SPARK-17636] for details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
[ https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969493#comment-16969493 ] Ryan Blue commented on PARQUET-1681: Looks like it might be https://issues.apache.org/jira/browse/AVRO-2400. > Avro's isElementType() change breaks the reading of some parquet(1.8.1) files > - > > Key: PARQUET-1681 > URL: https://issues.apache.org/jira/browse/PARQUET-1681 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.10.0, 1.9.1, 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Critical > > When using the Avro schema below to write a parquet(1.8.1) file and then read > back by using parquet 1.10.1 without passing any schema, the reading throws > an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. > { > "name": "phones", > "type": [ > "null", > { > "type": "array", > "items": { > "type": "record", > "name": "phones_items", > "fields": [ > > { "name": "phone_number", > "type": [ "null", > "string" ], "default": null > } > ] > } > } > ], > "default": null > } > The code to read is as below > val reader = > AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new* > Configuration).build() > reader.read() > PARQUET-651 changed the method isElementType() by relying on Avro's > checkReaderWriterCompatibility() to check the compatibility. However, > checkReaderWriterCompatibility() consider the ParquetSchema and the > AvroSchema(converted from File schema) as not compatible(the name in avro > schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence > not compatible) . Hence return false and caused the “phone_number” field in > the above schema to be considered as group type which is not true. Then the > exception throws as .asGroupType(). > I didn’t try writing via parquet 1.10.1 would reproduce the same problem or > not. But it could because the translation of Avro schema to Parquet schema is > not changed(didn’t verify yet). > I hesitate to revert PARQUET-651 because it solved several problems. I would > like to hear the community's thoughts on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
[ https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969491#comment-16969491 ] Ryan Blue commented on PARQUET-1681: I think we should be able to work around this instead of reverting PARQUET-651. If the compatibility check requires that the name matches, then we should be able to ensure that the name matches when converting the Parquet schema to Avro. > Avro's isElementType() change breaks the reading of some parquet(1.8.1) files > - > > Key: PARQUET-1681 > URL: https://issues.apache.org/jira/browse/PARQUET-1681 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.10.0, 1.9.1, 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Critical > > When using the Avro schema below to write a parquet(1.8.1) file and then read > back by using parquet 1.10.1 without passing any schema, the reading throws > an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. > { > "name": "phones", > "type": [ > "null", > { > "type": "array", > "items": { > "type": "record", > "name": "phones_items", > "fields": [ > > { "name": "phone_number", > "type": [ "null", > "string" ], "default": null > } > ] > } > } > ], > "default": null > } > The code to read is as below > val reader = > AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new* > Configuration).build() > reader.read() > PARQUET-651 changed the method isElementType() by relying on Avro's > checkReaderWriterCompatibility() to check the compatibility. However, > checkReaderWriterCompatibility() consider the ParquetSchema and the > AvroSchema(converted from File schema) as not compatible(the name in avro > schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence > not compatible) . Hence return false and caused the “phone_number” field in > the above schema to be considered as group type which is not true. Then the > exception throws as .asGroupType(). > I didn’t try writing via parquet 1.10.1 would reproduce the same problem or > not. But it could because the translation of Avro schema to Parquet schema is > not changed(didn’t verify yet). > I hesitate to revert PARQUET-651 because it solved several problems. I would > like to hear the community's thoughts on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
[ https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969489#comment-16969489 ] Ryan Blue commented on PARQUET-1681: The Avro check should ignore record names if the record is the root. Has this check changed in Avro recently? > Avro's isElementType() change breaks the reading of some parquet(1.8.1) files > - > > Key: PARQUET-1681 > URL: https://issues.apache.org/jira/browse/PARQUET-1681 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.10.0, 1.9.1, 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Critical > > When using the Avro schema below to write a parquet(1.8.1) file and then read > back by using parquet 1.10.1 without passing any schema, the reading throws > an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. > { > "name": "phones", > "type": [ > "null", > { > "type": "array", > "items": { > "type": "record", > "name": "phones_items", > "fields": [ > > { "name": "phone_number", > "type": [ "null", > "string" ], "default": null > } > ] > } > } > ], > "default": null > } > The code to read is as below > val reader = > AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new* > Configuration).build() > reader.read() > PARQUET-651 changed the method isElementType() by relying on Avro's > checkReaderWriterCompatibility() to check the compatibility. However, > checkReaderWriterCompatibility() consider the ParquetSchema and the > AvroSchema(converted from File schema) as not compatible(the name in avro > schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence > not compatible) . Hence return false and caused the “phone_number” field in > the above schema to be considered as group type which is not true. Then the > exception throws as .asGroupType(). > I didn’t try writing via parquet 1.10.1 would reproduce the same problem or > not. But it could because the translation of Avro schema to Parquet schema is > not changed(didn’t verify yet). > I hesitate to revert PARQUET-651 because it solved several problems. I would > like to hear the community's thoughts on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1685) Truncate the stored min and max for String statistics to reduce the footer size
[ https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961187#comment-16961187 ] Ryan Blue commented on PARQUET-1685: Looks like Gabor is right. The stats fields used for each column chunk (and page) are called min_value and max_value, so we should not truncate them. We will have to use the new indexes to add truncation. That's good because we want more people to look at the implementation and validate that work anyway. Maybe we could add a flag for truncating the min and max values, as long as it is disabled by default and stored in the file's key-value metadata. > Truncate the stored min and max for String statistics to reduce the footer > size > > > Key: PARQUET-1685 > URL: https://issues.apache.org/jira/browse/PARQUET-1685 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > Iceberg has a cool feature that truncates the stored min, max statistics to > minimize the metadata size. We can borrow to truncate them in Parquet also to > reduce the size of the footer, or even the page header. Here is the code in > IceBerg > [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java]. > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-722) Building with JDK 8 fails over a maven bug
[ https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911797#comment-16911797 ] Ryan Blue commented on PARQUET-722: --- Looks like this was fixed when cascading3 support updated the maven-remote-resources-plugin: [https://github.com/apache/parquet-mr/blob/master/pom.xml#L390-L397] I've confirmed that copying that block into older versions also fixes the problem so I'm going to mark this resolved. > Building with JDK 8 fails over a maven bug > -- > > Key: PARQUET-722 > URL: https://issues.apache.org/jira/browse/PARQUET-722 > Project: Parquet > Issue Type: Bug >Reporter: Niels Basjes >Priority: Major > > When I build parquet on my system I get this error during the build: > {quote} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) > on project parquet-generator: Error rendering velocity resource. > NullPointerException -> [Help 1] > {quote} > About a year ago [~julienledem] responded that this is caused due to a bug in > Maven in combination with Java 8: > At this page > http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512 > > Now this bug has been solved at the Maven end in maven-filtering 1.2 > https://issues.apache.org/jira/browse/MSHARED-319 > The problem is that this fix has not yet been integrated into the latest > available maven versions yet. > I'll put up a pull request with a proposed fix for this. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (PARQUET-722) Building with JDK 8 fails over a maven bug
[ https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911797#comment-16911797 ] Ryan Blue edited comment on PARQUET-722 at 8/20/19 10:59 PM: - Looks like this was fixed when cascading3 support updated the maven-remote-resources-plugin: [https://github.com/apache/parquet-mr/blob/master/pom.xml#L390-L397] I've confirmed that copying that block into older versions also fixes the problem. was (Author: rdblue): Looks like this was fixed when cascading3 support updated the maven-remote-resources-plugin: [https://github.com/apache/parquet-mr/blob/master/pom.xml#L390-L397] I've confirmed that copying that block into older versions also fixes the problem so I'm going to mark this resolved. > Building with JDK 8 fails over a maven bug > -- > > Key: PARQUET-722 > URL: https://issues.apache.org/jira/browse/PARQUET-722 > Project: Parquet > Issue Type: Bug >Reporter: Niels Basjes >Priority: Major > > When I build parquet on my system I get this error during the build: > {quote} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) > on project parquet-generator: Error rendering velocity resource. > NullPointerException -> [Help 1] > {quote} > About a year ago [~julienledem] responded that this is caused due to a bug in > Maven in combination with Java 8: > At this page > http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512 > > Now this bug has been solved at the Maven end in maven-filtering 1.2 > https://issues.apache.org/jira/browse/MSHARED-319 > The problem is that this fix has not yet been integrated into the latest > available maven versions yet. > I'll put up a pull request with a proposed fix for this. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (PARQUET-1434) Release parquet-mr 1.11.0
[ https://issues.apache.org/jira/browse/PARQUET-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891462#comment-16891462 ] Ryan Blue commented on PARQUET-1434: My concern is that it has not been reviewed well enough to be confident that the write path implements the spec correctly. So there aren't specific issues to address. I made suggestions on an integration test Zoltan wrote, but that wasn't committed to the Parquet repository. Getting that cleaned up and committed is the only thing I can think of for now. > Release parquet-mr 1.11.0 > - > > Key: PARQUET-1434 > URL: https://issues.apache.org/jira/browse/PARQUET-1434 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Nandor Kollar >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (PARQUET-1488) UserDefinedPredicate throw NullPointerException
[ https://issues.apache.org/jira/browse/PARQUET-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884204#comment-16884204 ] Ryan Blue commented on PARQUET-1488: We discussed this on SPARK-28371. Previously, Parquet did not fail if a UserDefinedPredicate did not handle null values, so I think that it is a regression that Parquet will cause previously working code to fail. I think that it is correct for Parquet to call a UDP the way that it is, but that Parquet should catch exceptions thrown by the predicate and should process the row group where there error was thrown. That way, Parquet can keep the optimization for columns that are all null, but it doesn't break existing code. [~yumwang], would you like to submit a PR for this? > UserDefinedPredicate throw NullPointerException > --- > > Key: PARQUET-1488 > URL: https://issues.apache.org/jira/browse/PARQUET-1488 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > It throws {{NullPointerException}} after upgrade parquet to 1.11.0 when using > {{UserDefinedPredicate}}. > The > [UserDefinedPredicate|https://github.com/apache/spark/blob/faf73dcd33d04365c28c2846d3a1f845785f69df/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L548-L578] > is: > {code:java} > new UserDefinedPredicate[Binary] with Serializable { > > private val strToBinary = Binary.fromReusedByteArray(v.getBytes) > > private val size = strToBinary.length > > > > override def canDrop(statistics: Statistics[Binary]): Boolean = { > > val comparator = > PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR > val max = statistics.getMax > > val min = statistics.getMin > > comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) > < 0 || > comparator.compare(min.slice(0, math.min(size, min.length)), > strToBinary) > 0 > } > > > > override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = { > > val comparator = > PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR > val max = statistics.getMax > > val min = statistics.getMin > > comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) > == 0 && > comparator.compare(min.slice(0, math.min(size, min.length)), > strToBinary) == 0 > } > > > > override def keep(value: Binary): Boolean = { > > UTF8String.fromBytes(value.getBytes).startsWith( > > UTF8String.fromBytes(strToBinary.getBytes)) > > } > > } > > {code} > The stack trace is: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:573) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:552) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86) >
[jira] [Assigned] (PARQUET-1488) UserDefinedPredicate throw NullPointerException
[ https://issues.apache.org/jira/browse/PARQUET-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-1488: -- Assignee: Yuming Wang (was: Gabor Szadovszky) > UserDefinedPredicate throw NullPointerException > --- > > Key: PARQUET-1488 > URL: https://issues.apache.org/jira/browse/PARQUET-1488 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > It throws {{NullPointerException}} after upgrade parquet to 1.11.0 when using > {{UserDefinedPredicate}}. > The > [UserDefinedPredicate|https://github.com/apache/spark/blob/faf73dcd33d04365c28c2846d3a1f845785f69df/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L548-L578] > is: > {code:java} > new UserDefinedPredicate[Binary] with Serializable { > > private val strToBinary = Binary.fromReusedByteArray(v.getBytes) > > private val size = strToBinary.length > > > > override def canDrop(statistics: Statistics[Binary]): Boolean = { > > val comparator = > PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR > val max = statistics.getMax > > val min = statistics.getMin > > comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) > < 0 || > comparator.compare(min.slice(0, math.min(size, min.length)), > strToBinary) > 0 > } > > > > override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = { > > val comparator = > PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR > val max = statistics.getMax > > val min = statistics.getMin > > comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) > == 0 && > comparator.compare(min.slice(0, math.min(size, min.length)), > strToBinary) == 0 > } > > > > override def keep(value: Binary): Boolean = { > > UTF8String.fromBytes(value.getBytes).startsWith( > > UTF8String.fromBytes(strToBinary.getBytes)) > > } > > } > > {code} > The stack trace is: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:573) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:552) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Reopened] (PARQUET-1488) UserDefinedPredicate throw NullPointerException
[ https://issues.apache.org/jira/browse/PARQUET-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reopened PARQUET-1488: > UserDefinedPredicate throw NullPointerException > --- > > Key: PARQUET-1488 > URL: https://issues.apache.org/jira/browse/PARQUET-1488 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Yuming Wang >Assignee: Gabor Szadovszky >Priority: Major > > It throws {{NullPointerException}} after upgrade parquet to 1.11.0 when using > {{UserDefinedPredicate}}. > The > [UserDefinedPredicate|https://github.com/apache/spark/blob/faf73dcd33d04365c28c2846d3a1f845785f69df/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L548-L578] > is: > {code:java} > new UserDefinedPredicate[Binary] with Serializable { > > private val strToBinary = Binary.fromReusedByteArray(v.getBytes) > > private val size = strToBinary.length > > > > override def canDrop(statistics: Statistics[Binary]): Boolean = { > > val comparator = > PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR > val max = statistics.getMax > > val min = statistics.getMin > > comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) > < 0 || > comparator.compare(min.slice(0, math.min(size, min.length)), > strToBinary) > 0 > } > > > > override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = { > > val comparator = > PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR > val max = statistics.getMax > > val min = statistics.getMin > > comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) > == 0 && > comparator.compare(min.slice(0, math.min(size, min.length)), > strToBinary) == 0 > } > > > > override def keep(value: Binary): Boolean = { > > UTF8String.fromBytes(value.getBytes).startsWith( > > UTF8String.fromBytes(strToBinary.getBytes)) > > } > > } > > {code} > The stack trace is: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:573) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:552) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (PARQUET-1624) ParquetFileReader.open ignores Hadoop configuration options
Ryan Blue created PARQUET-1624: -- Summary: ParquetFileReader.open ignores Hadoop configuration options Key: PARQUET-1624 URL: https://issues.apache.org/jira/browse/PARQUET-1624 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.10.0, 1.11.0 Reporter: Ryan Blue Assignee: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (PARQUET-1142) Avoid leaking Hadoop API to downstream libraries
[ https://issues.apache.org/jira/browse/PARQUET-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775448#comment-16775448 ] Ryan Blue commented on PARQUET-1142: The next steps for this are to get compression working without relying on Hadoop. After that, it is a matter of some fairly simple refactoring of the file writer. But that refactoring doesn't help much unless the compression implementations also don't depend on Hadoop. > Avoid leaking Hadoop API to downstream libraries > > > Key: PARQUET-1142 > URL: https://issues.apache.org/jira/browse/PARQUET-1142 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > Parquet currently leaks the Hadoop API by requiring callers to pass {{Path}} > and {{Configuration}} instances, and by using Hadoop codecs. {{InputFile}} > and {{SeekableInputStream}} add alternatives to Hadoop classes in some parts > of the read path, but this needs to be extended to the write path and to > avoid passing options through {{Configuration}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1281) Jackson dependency
[ https://issues.apache.org/jira/browse/PARQUET-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1281. Resolution: Not A Problem > Jackson dependency > -- > > Key: PARQUET-1281 > URL: https://issues.apache.org/jira/browse/PARQUET-1281 > Project: Parquet > Issue Type: Improvement >Reporter: Qinghui Xu >Priority: Major > > Currently we shaded jackson in parquet-jackson module (org.codehaus.jackon > --> shaded.parquet.org.codehaus.jackson), but in fact we do not use the > shaded jackson in parquet-hadoop code. Is that a mistake? (see > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java#L26) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1512) Release Parquet Java 1.10.1
[ https://issues.apache.org/jira/browse/PARQUET-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1512. Resolution: Fixed > Release Parquet Java 1.10.1 > --- > > Key: PARQUET-1512 > URL: https://issues.apache.org/jira/browse/PARQUET-1512 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.1 > > > This is an umbrella issue to track the 1.10.1 release. Please link issues to > include in the release as blockers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-138) Parquet should allow a merge between required and optional schemas
[ https://issues.apache.org/jira/browse/PARQUET-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-138: - Assignee: Nicolas Trinquier (was: Ryan Blue) > Parquet should allow a merge between required and optional schemas > -- > > Key: PARQUET-138 > URL: https://issues.apache.org/jira/browse/PARQUET-138 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Robert Justice >Assignee: Nicolas Trinquier >Priority: Major > Labels: pull-request-available > > In discussion with Ryan, he felt we should be able to merge from required > binary to optional binary and the resulting schema would be optional > https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/test/java/parquet/schema/TestMessageType.java > {code:java} > try { > t3.union(t4); > fail("moving from optional to required"); > } catch (IncompatibleSchemaModificationException e) { > assertEquals("repetition constraint is more restrictive: can not merge > type required binary a into optional binary a", e.getMessage()); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-138) Parquet should allow a merge between required and optional schemas
[ https://issues.apache.org/jira/browse/PARQUET-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-138: - Assignee: Nicolas Trinquier (was: Nicolas Trinquier) > Parquet should allow a merge between required and optional schemas > -- > > Key: PARQUET-138 > URL: https://issues.apache.org/jira/browse/PARQUET-138 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Robert Justice >Assignee: Nicolas Trinquier >Priority: Major > Labels: pull-request-available > > In discussion with Ryan, he felt we should be able to merge from required > binary to optional binary and the resulting schema would be optional > https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/test/java/parquet/schema/TestMessageType.java > {code:java} > try { > t3.union(t4); > fail("moving from optional to required"); > } catch (IncompatibleSchemaModificationException e) { > assertEquals("repetition constraint is more restrictive: can not merge > type required binary a into optional binary a", e.getMessage()); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-138) Parquet should allow a merge between required and optional schemas
[ https://issues.apache.org/jira/browse/PARQUET-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-138: - Assignee: Ryan Blue > Parquet should allow a merge between required and optional schemas > -- > > Key: PARQUET-138 > URL: https://issues.apache.org/jira/browse/PARQUET-138 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Robert Justice >Assignee: Ryan Blue >Priority: Major > Labels: pull-request-available > > In discussion with Ryan, he felt we should be able to merge from required > binary to optional binary and the resulting schema would be optional > https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/test/java/parquet/schema/TestMessageType.java > {code:java} > try { > t3.union(t4); > fail("moving from optional to required"); > } catch (IncompatibleSchemaModificationException e) { > assertEquals("repetition constraint is more restrictive: can not merge > type required binary a into optional binary a", e.getMessage()); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1520) Update README to use correct build and version info
[ https://issues.apache.org/jira/browse/PARQUET-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757679#comment-16757679 ] Ryan Blue commented on PARQUET-1520: Thanks for contributing! > Update README to use correct build and version info > --- > > Key: PARQUET-1520 > URL: https://issues.apache.org/jira/browse/PARQUET-1520 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 1.10.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-1520) Update README to use correct build and version info
[ https://issues.apache.org/jira/browse/PARQUET-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-1520: -- Assignee: Dongjoon Hyun > Update README to use correct build and version info > --- > > Key: PARQUET-1520 > URL: https://issues.apache.org/jira/browse/PARQUET-1520 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 1.10.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1520) Update README to use correct build and version info
[ https://issues.apache.org/jira/browse/PARQUET-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1520. Resolution: Fixed Fix Version/s: 1.10.2 > Update README to use correct build and version info > --- > > Key: PARQUET-1520 > URL: https://issues.apache.org/jira/browse/PARQUET-1520 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 1.10.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1510. Resolution: Fixed > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0, 1.10.0, 1.9.1 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 1.11.0, 1.10.1 > > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1509) Update Docs for Hive Deprecation
[ https://issues.apache.org/jira/browse/PARQUET-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1509. Resolution: Fixed > Update Docs for Hive Deprecation > > > Key: PARQUET-1509 > URL: https://issues.apache.org/jira/browse/PARQUET-1509 > Project: Parquet > Issue Type: Improvement >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Labels: pull-request-available > > Update docs to state that Hive integration is now deprecated. [PARQUET-1447] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-1509) Update Docs for Hive Deprecation
[ https://issues.apache.org/jira/browse/PARQUET-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-1509: -- Assignee: BELUGA BEHR > Update Docs for Hive Deprecation > > > Key: PARQUET-1509 > URL: https://issues.apache.org/jira/browse/PARQUET-1509 > Project: Parquet > Issue Type: Improvement >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Labels: pull-request-available > > Update docs to state that Hive integration is now deprecated. [PARQUET-1447] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1513) HiddenFileFilter Streamline
[ https://issues.apache.org/jira/browse/PARQUET-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1513. Resolution: Fixed Fix Version/s: 1.12.0 > HiddenFileFilter Streamline > --- > > Key: PARQUET-1513 > URL: https://issues.apache.org/jira/browse/PARQUET-1513 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Labels: pull-request-available > Fix For: 1.12.0 > > > {code:java} > public boolean accept(Path p) { > return !p.getName().startsWith("_") && !p.getName().startsWith("."); > } > {code} > This can be streamlined a bit further. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-1513) HiddenFileFilter Streamline
[ https://issues.apache.org/jira/browse/PARQUET-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-1513: -- Assignee: BELUGA BEHR > HiddenFileFilter Streamline > --- > > Key: PARQUET-1513 > URL: https://issues.apache.org/jira/browse/PARQUET-1513 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Labels: pull-request-available > > {code:java} > public boolean accept(Path p) { > return !p.getName().startsWith("_") && !p.getName().startsWith("."); > } > {code} > This can be streamlined a bit further. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-1510: -- Assignee: Ryan Blue > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0, 1.10.0, 1.9.1 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 1.11.0, 1.10.1 > > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1510: --- Issue Type: Bug (was: Improvement) > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0, 1.10.0, 1.9.1 >Reporter: Ryan Blue >Priority: Major > Labels: correctness, pull-request-available > Fix For: 1.10.1 > > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1510: --- Affects Version/s: 1.9.1 1.9.0 1.10.0 > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0, 1.10.0, 1.9.1 >Reporter: Ryan Blue >Priority: Major > Labels: correctness, pull-request-available > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752641#comment-16752641 ] Ryan Blue commented on PARQUET-1510: Fixed metadata. > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0, 1.10.0, 1.9.1 >Reporter: Ryan Blue >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 1.11.0, 1.10.1 > > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1510: --- Labels: correctness pull-request-available (was: pull-request-available) > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Ryan Blue >Priority: Major > Labels: correctness, pull-request-available > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1512) Release Parquet Java 1.10.1
Ryan Blue created PARQUET-1512: -- Summary: Release Parquet Java 1.10.1 Key: PARQUET-1512 URL: https://issues.apache.org/jira/browse/PARQUET-1512 Project: Parquet Issue Type: Task Components: parquet-mr Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.10.1 This is an umbrella issue to track the 1.10.1 release. Please link issues to include in the release as blockers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1510: --- Priority: Blocker (was: Major) > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0, 1.10.0, 1.9.1 >Reporter: Ryan Blue >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 1.10.1 > > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1510: --- Fix Version/s: 1.10.1 > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0, 1.10.0, 1.9.1 >Reporter: Ryan Blue >Priority: Major > Labels: correctness, pull-request-available > Fix For: 1.10.1 > > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1510: --- Fix Version/s: 1.11.0 > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0, 1.10.0, 1.9.1 >Reporter: Ryan Blue >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 1.11.0, 1.10.1 > > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1510: --- Component/s: parquet-mr > Dictionary filter skips null values when evaluating not-equals. > --- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Ryan Blue >Priority: Major > Labels: pull-request-available > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > +-+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.
Ryan Blue created PARQUET-1510: -- Summary: Dictionary filter skips null values when evaluating not-equals. Key: PARQUET-1510 URL: https://issues.apache.org/jira/browse/PARQUET-1510 Project: Parquet Issue Type: Improvement Reporter: Ryan Blue This was discovered in Spark, see SPARK-26677. From the Spark PR: {code} // Repeat the values to get dictionary encoding. Seq(Some("A"), Some("A"), None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() +-+ |value| +-+ +-+ {code} {code} // Use plain encoding. Seq(Some("A"), None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() +-+ |value| +-+ | null| +-+ {code} This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1447) MapredParquetOutputFormat - Save Some Array Allocations
[ https://issues.apache.org/jira/browse/PARQUET-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737333#comment-16737333 ] Ryan Blue commented on PARQUET-1447: I'd be happy to merge a PR! > MapredParquetOutputFormat - Save Some Array Allocations > --- > > Key: PARQUET-1447 > URL: https://issues.apache.org/jira/browse/PARQUET-1447 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: BELUGA BEHR >Assignee: Ryan Blue >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (PARQUET-1447) MapredParquetOutputFormat - Save Some Array Allocations
[ https://issues.apache.org/jira/browse/PARQUET-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue reassigned PARQUET-1447: -- Resolution: Won't Fix Assignee: Ryan Blue I'm closing this because these classes are now maintained in Hive, not Parquet. > MapredParquetOutputFormat - Save Some Array Allocations > --- > > Key: PARQUET-1447 > URL: https://issues.apache.org/jira/browse/PARQUET-1447 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: BELUGA BEHR >Assignee: Ryan Blue >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1465) CLONE - Add a way to append encoded blocks in ParquetFileWriter
[ https://issues.apache.org/jira/browse/PARQUET-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1465. Resolution: Fixed See PARQUET-382. > CLONE - Add a way to append encoded blocks in ParquetFileWriter > --- > > Key: PARQUET-1465 > URL: https://issues.apache.org/jira/browse/PARQUET-1465 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.8.0 >Reporter: Steven Paster >Assignee: Ryan Blue >Priority: Major > Fix For: 1.8.2, 1.9.0 > > > Concatenating two files together currently requires reading the source files > and rewriting the content from scratch. This ends up taking a lot of memory, > even if the data is already encoded correctly and blocks just need to be > appended and have their metadata updated. Merging two files should be fast > and not take much memory. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader
[ https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1407. Resolution: Fixed Assignee: Nandor Kollar > Data loss on duplicate values with AvroParquetWriter/Reader > --- > > Key: PARQUET-1407 > URL: https://issues.apache.org/jira/browse/PARQUET-1407 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.9.0, 1.10.0, 1.8.3 >Reporter: Scott Carey >Assignee: Nandor Kollar >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.0 > > > {code:java} > public class Blah { > private static Path parquetFile = new Path("oops"); > private static Schema schema = SchemaBuilder.record("spark_schema") > .fields().optionalBytes("value").endRecord(); > private static GenericData.Record recordFor(String value) { > return new GenericRecordBuilder(schema) > .set("value", value.getBytes()).build(); > } > public static void main(String ... args) throws IOException { > try (ParquetWriter writer = AvroParquetWriter > .builder(parquetFile) > .withSchema(schema) > .build()) { > writer.write(recordFor("one")); > writer.write(recordFor("two")); > writer.write(recordFor("three")); > writer.write(recordFor("three")); > writer.write(recordFor("two")); > writer.write(recordFor("one")); > writer.write(recordFor("zero")); > } > try (ParquetReader reader = AvroParquetReader > .builder(parquetFile) > .withConf(new Configuration()).build()) { > GenericRecord rec; > int i = 0; > while ((rec = reader.read()) != null) { > ByteBuffer buf = (ByteBuffer) rec.get("value"); > byte[] bytes = new byte[buf.remaining()]; > buf.get(bytes); > System.out.println("rec " + i++ + ": " + new String(bytes)); > } > } > } > } > {code} > Expected output: > {noformat} > rec 0: one > rec 1: two > rec 2: three > rec 3: three > rec 4: two > rec 5: one > rec 6: zero{noformat} > Actual: > {noformat} > rec 0: one > rec 1: two > rec 2: three > rec 3: > rec 4: > rec 5: > rec 6: zero{noformat} > > This was found when we started getting empty byte[] values back in spark > unexpectedly. (Spark 2.3.1 and Parquet 1.8.3). I have not tried to > reproduce with parquet 1.9.0, but its a bad enough bug that I would like a > 1.8.4 release that I can drop-in replace 1.8.3 without any binary > compatibility issues. > Duplicate byte[] values are lost. > > A few clues: > If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go > to zero. I suspect a ByteBuffer is being recycled, but the call to > ByteBuffer.get mutates it. I wonder if an appropriately placed > ByteBuffer.duplicate() would fix it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader
[ https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16688796#comment-16688796 ] Ryan Blue commented on PARQUET-1407: [~scottcarey], [~jackytan], sorry for the delay. I didn't see this issue until now. I've posted a PR that should fix it. I haven't written a test for it. If you want to pick that commit and submit a PR with a test, that would be a great way to contribute! If not, I'll get it done sometime soon and this can be fixed in 1.11.0. Thanks! > Data loss on duplicate values with AvroParquetWriter/Reader > --- > > Key: PARQUET-1407 > URL: https://issues.apache.org/jira/browse/PARQUET-1407 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.9.0, 1.10.0, 1.8.3 >Reporter: Scott Carey >Priority: Critical > Labels: pull-request-available > > {code:java} > public class Blah { > private static Path parquetFile = new Path("oops"); > private static Schema schema = SchemaBuilder.record("spark_schema") > .fields().optionalBytes("value").endRecord(); > private static GenericData.Record recordFor(String value) { > return new GenericRecordBuilder(schema) > .set("value", value.getBytes()).build(); > } > public static void main(String ... args) throws IOException { > try (ParquetWriter writer = AvroParquetWriter > .builder(parquetFile) > .withSchema(schema) > .build()) { > writer.write(recordFor("one")); > writer.write(recordFor("two")); > writer.write(recordFor("three")); > writer.write(recordFor("three")); > writer.write(recordFor("two")); > writer.write(recordFor("one")); > writer.write(recordFor("zero")); > } > try (ParquetReader reader = AvroParquetReader > .builder(parquetFile) > .withConf(new Configuration()).build()) { > GenericRecord rec; > int i = 0; > while ((rec = reader.read()) != null) { > ByteBuffer buf = (ByteBuffer) rec.get("value"); > byte[] bytes = new byte[buf.remaining()]; > buf.get(bytes); > System.out.println("rec " + i++ + ": " + new String(bytes)); > } > } > } > } > {code} > Expected output: > {noformat} > rec 0: one > rec 1: two > rec 2: three > rec 3: three > rec 4: two > rec 5: one > rec 6: zero{noformat} > Actual: > {noformat} > rec 0: one > rec 1: two > rec 2: three > rec 3: > rec 4: > rec 5: > rec 6: zero{noformat} > > This was found when we started getting empty byte[] values back in spark > unexpectedly. (Spark 2.3.1 and Parquet 1.8.3). I have not tried to > reproduce with parquet 1.9.0, but its a bad enough bug that I would like a > 1.8.4 release that I can drop-in replace 1.8.3 without any binary > compatibility issues. > Duplicate byte[] values are lost. > > A few clues: > If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go > to zero. I suspect a ByteBuffer is being recycled, but the call to > ByteBuffer.get mutates it. I wonder if an appropriately placed > ByteBuffer.duplicate() would fix it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader
[ https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1407: --- Affects Version/s: 1.10.0 > Data loss on duplicate values with AvroParquetWriter/Reader > --- > > Key: PARQUET-1407 > URL: https://issues.apache.org/jira/browse/PARQUET-1407 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.9.0, 1.10.0, 1.8.3 >Reporter: Scott Carey >Priority: Critical > > {code:java} > public class Blah { > private static Path parquetFile = new Path("oops"); > private static Schema schema = SchemaBuilder.record("spark_schema") > .fields().optionalBytes("value").endRecord(); > private static GenericData.Record recordFor(String value) { > return new GenericRecordBuilder(schema) > .set("value", value.getBytes()).build(); > } > public static void main(String ... args) throws IOException { > try (ParquetWriter writer = AvroParquetWriter > .builder(parquetFile) > .withSchema(schema) > .build()) { > writer.write(recordFor("one")); > writer.write(recordFor("two")); > writer.write(recordFor("three")); > writer.write(recordFor("three")); > writer.write(recordFor("two")); > writer.write(recordFor("one")); > writer.write(recordFor("zero")); > } > try (ParquetReader reader = AvroParquetReader > .builder(parquetFile) > .withConf(new Configuration()).build()) { > GenericRecord rec; > int i = 0; > while ((rec = reader.read()) != null) { > ByteBuffer buf = (ByteBuffer) rec.get("value"); > byte[] bytes = new byte[buf.remaining()]; > buf.get(bytes); > System.out.println("rec " + i++ + ": " + new String(bytes)); > } > } > } > } > {code} > Expected output: > {noformat} > rec 0: one > rec 1: two > rec 2: three > rec 3: three > rec 4: two > rec 5: one > rec 6: zero{noformat} > Actual: > {noformat} > rec 0: one > rec 1: two > rec 2: three > rec 3: > rec 4: > rec 5: > rec 6: zero{noformat} > > This was found when we started getting empty byte[] values back in spark > unexpectedly. (Spark 2.3.1 and Parquet 1.8.3). I have not tried to > reproduce with parquet 1.9.0, but its a bad enough bug that I would like a > 1.8.4 release that I can drop-in replace 1.8.3 without any binary > compatibility issues. > Duplicate byte[] values are lost. > > A few clues: > If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go > to zero. I suspect a ByteBuffer is being recycled, but the call to > ByteBuffer.get mutates it. I wonder if an appropriately placed > ByteBuffer.duplicate() would fix it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1457) Data set integrity tool
[ https://issues.apache.org/jira/browse/PARQUET-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684275#comment-16684275 ] Ryan Blue commented on PARQUET-1457: [~gershinsky], this sounds like a reasonable extension to a table format and not really something that I think Parquet should be doing. What do you think about coming up with a proposal for snapshot integrity for [Iceberg|https://github.com/Netflix/iceberg]? > Data set integrity tool > --- > > Key: PARQUET-1457 > URL: https://issues.apache.org/jira/browse/PARQUET-1457 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp, parquet-mr >Reporter: Gidon Gershinsky >Assignee: Gidon Gershinsky >Priority: Major > > Parquet encryption protects integrity of individual files. However, data sets > (such as tables) are often written as a collection of files, say > "/path/to/dataset"/part0.parquet.encrypted > .. > "/path/to/dataset"/partN.parquet.encrypted > > In an untrusted storage, removal of one or more files will go unnoticed. > Replacement of one file contents with another will go unnoticed, unless a > user has provided unique AAD prefixes for each file. > > The data set integrity tool solves these problems. While it doesn't > necessarily belong in Parquet functionality (that is focused on individual > files (?)) - it will assist higher level frameworks that use Parquet, to > cryptographically protect integrity of data sets comprised of multiple files. > The use of this tool is not obligatory, as frameworks can use other means to > verify table (file collection) integrity. > > The tool works by creating a small file, that can be stored as say > "/path/to/dataset"/.dataset.signature > > that contains the dataset unique name (URI) and the number of files (N). The > file contents is either encrypted with AES-GCM (authenticated, encrypted) - > or hashed and signed (authenticated, plaintext). A private key issued for > each dataset. > > On the writer side, the tools creates AAD prefixes for every data file, and > creates the signature file itself. The input is the dataset URI, N and the > encryption/signature key. > > On the reader side, the tool parses and verifies the signature file, and > provides the framework with the verified dataset name, number of files that > must be accounted for, and the AAD prefix for each file. The input is the > expected dataset URI and the encryption/signature key. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1414) Limit page size based on maximum row count
[ https://issues.apache.org/jira/browse/PARQUET-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653995#comment-16653995 ] Ryan Blue commented on PARQUET-1414: [~gszadovszky], can you add a link to your benchmarks to this issue? I think the conclusion we came to while discussing was between 10k and 20k, with 20k being the better choice for overall file size. Is 20k the planned default now? > Limit page size based on maximum row count > -- > > Key: PARQUET-1414 > URL: https://issues.apache.org/jira/browse/PARQUET-1414 > Project: Parquet > Issue Type: Improvement >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.11.0 > > > For column index based filtering it is important to have enough pages for a > column. In case of a perfectly matching encoding for the suitable data it can > happen that all of the values can be encoded in one page (e.g. a column of an > ascending counter). > With this improvement we would be able to limit the pages by the maximum > number of rows to be written in it so we would have enough pages for every > column. A good default value should be benchmarked. For initial, we can use > 10k. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1432) ACID support
[ https://issues.apache.org/jira/browse/PARQUET-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634298#comment-16634298 ] Ryan Blue commented on PARQUET-1432: [~yumwang], ACID guarantees are a feature of the table layout, not the file format. I don't think Parquet needs to do anything differently to support this. What are you proposing to change in Parquet? > ACID support > > > Key: PARQUET-1432 > URL: https://issues.apache.org/jira/browse/PARQUET-1432 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr >Affects Versions: 1.10.1 >Reporter: Yuming Wang >Priority: Major > > https://orc.apache.org/docs/acid.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1201) Column indexes
[ https://issues.apache.org/jira/browse/PARQUET-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631130#comment-16631130 ] Ryan Blue commented on PARQUET-1201: [~gszadovszky], where is the branch for page skipping? Is it this one? https://github.com/apache/parquet-mr/tree/column-indexes I just went to review it, but I don't see a PR. Could you open one against master? > Column indexes > -- > > Key: PARQUET-1201 > URL: https://issues.apache.org/jira/browse/PARQUET-1201 > Project: Parquet > Issue Type: New Feature >Affects Versions: 1.10.0 >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Fix For: format-2.5.0 > > > Write the column indexes described in PARQUET-922. > This is the first phase of implementing the whole feature. The > implementation is done in the following steps: > * Utility to read/write indexes in parquet-format > * Writing indexes in the parquet file > * Extend parquet-tools and parquet-cli to show the indexes > * Limit index size based on parquet properties > * Trim min/max values where possible based on parquet properties > * Filtering based on column indexes > The work is done on the feature branch {{column-indexes}}. This JIRA will be > resolved after the branch has been merged to {{master}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-632) Parquet file in invalid state while writing to S3 from EMR
[ https://issues.apache.org/jira/browse/PARQUET-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581275#comment-16581275 ] Ryan Blue commented on PARQUET-632: --- [~pkgajulapalli], can you go ahead and post the stack trace? I thought you said you were using 2.2.0. These classes definitely changed. > Parquet file in invalid state while writing to S3 from EMR > -- > > Key: PARQUET-632 > URL: https://issues.apache.org/jira/browse/PARQUET-632 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.7.0 >Reporter: Peter Halliday >Priority: Blocker > > I'm writing parquet to S3 from Spark 1.6.1 on EMR. And when it got to the > last few files to write to S3, I received this stacktrace in the log with no > other errors before or after it. It's very consistent. This particular > batch keeps erroring the same way. > {noformat} > 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager > [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, > ip-172-16-96-32.ec2.internal): org.apache.spark.SparkException: Task failed > while writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: The file being written is in an invalid > state. Probably caused by an error thrown previously. Current state: COLUMN > at > org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146) > at > org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138) > at > org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113) > at > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:405) > ... 8 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-632) Parquet file in invalid state while writing to S3 from EMR
[ https://issues.apache.org/jira/browse/PARQUET-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579966#comment-16579966 ] Ryan Blue commented on PARQUET-632: --- [~pkgajulapalli], there isn't enough information here to know what's happening. Can you post the schema of the dataframe you're writing and a stack trace? > Parquet file in invalid state while writing to S3 from EMR > -- > > Key: PARQUET-632 > URL: https://issues.apache.org/jira/browse/PARQUET-632 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.7.0 >Reporter: Peter Halliday >Priority: Blocker > > I'm writing parquet to S3 from Spark 1.6.1 on EMR. And when it got to the > last few files to write to S3, I received this stacktrace in the log with no > other errors before or after it. It's very consistent. This particular > batch keeps erroring the same way. > {noformat} > 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager > [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, > ip-172-16-96-32.ec2.internal): org.apache.spark.SparkException: Task failed > while writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: The file being written is in an invalid > state. Probably caused by an error thrown previously. Current state: COLUMN > at > org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146) > at > org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138) > at > org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113) > at > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:405) > ... 8 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1341) Null count is suppressed when columns have no min or max and use unsigned sort order
Ryan Blue created PARQUET-1341: -- Summary: Null count is suppressed when columns have no min or max and use unsigned sort order Key: PARQUET-1341 URL: https://issues.apache.org/jira/browse/PARQUET-1341 Project: Parquet Issue Type: Bug Components: parquet-mr Affects Versions: 1.10.0 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.10.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-381) It should be possible to merge summary files, and control which files are generated
[ https://issues.apache.org/jira/browse/PARQUET-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-381: -- Fix Version/s: (was: 2.0.0) 1.9.0 > It should be possible to merge summary files, and control which files are > generated > --- > > Key: PARQUET-381 > URL: https://issues.apache.org/jira/browse/PARQUET-381 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Alex Levenson >Assignee: Alex Levenson >Priority: Major > Fix For: 1.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-381) It should be possible to merge summary files, and control which files are generated
[ https://issues.apache.org/jira/browse/PARQUET-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491350#comment-16491350 ] Ryan Blue commented on PARQUET-381: --- Fixed. Thanks for pointing this out. > It should be possible to merge summary files, and control which files are > generated > --- > > Key: PARQUET-381 > URL: https://issues.apache.org/jira/browse/PARQUET-381 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Alex Levenson >Assignee: Alex Levenson >Priority: Major > Fix For: 1.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1309) Parquet Java uses incorrect stats and dictionary filter properties
[ https://issues.apache.org/jira/browse/PARQUET-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1309: --- Description: In SPARK-24251, we found that the changes to use HadoopReadOptions accidentally switched the [properties that enable stats and dictionary filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83]. Both are enabled by default so it is unlikely that anyone will need to turn them off and there is an easy work-around, but we should fix the properties for 1.10.1. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is on 1.8.x). (was: In SPARK-24251, we found that the changes to use HadoopReadOptions accidentally switched the [properties that enable stats and dictionary filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83]. Both are enabled by default so it is unlikely that anyone will need to turn them off and there is an easy work-around, but we should fix the properties for 1.10.0. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is on 1.8.x).) > Parquet Java uses incorrect stats and dictionary filter properties > -- > > Key: PARQUET-1309 > URL: https://issues.apache.org/jira/browse/PARQUET-1309 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Ryan Blue >Priority: Major > Fix For: 1.10.1 > > > In SPARK-24251, we found that the changes to use HadoopReadOptions > accidentally switched the [properties that enable stats and dictionary > filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83]. > Both are enabled by default so it is unlikely that anyone will need to turn > them off and there is an easy work-around, but we should fix the properties > for 1.10.1. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is > on 1.8.x). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1309) Parquet Java uses incorrect stats and dictionary filter properties
Ryan Blue created PARQUET-1309: -- Summary: Parquet Java uses incorrect stats and dictionary filter properties Key: PARQUET-1309 URL: https://issues.apache.org/jira/browse/PARQUET-1309 Project: Parquet Issue Type: Bug Components: parquet-mr Reporter: Ryan Blue Fix For: 1.10.1 In SPARK-24251, we found that the changes to use HadoopReadOptions accidentally switched the [properties that enable stats and dictionary filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83]. Both are enabled by default so it is unlikely that anyone will need to turn them off and there is an easy work-around, but we should fix the properties for 1.10.0. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is on 1.8.x). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1295) Parquet libraries do not follow proper semantic versioning
[ https://issues.apache.org/jira/browse/PARQUET-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483153#comment-16483153 ] Ryan Blue commented on PARQUET-1295: Since there is not a well-defined public API, I understand how it is annoying to find out that some classes are internal. But the APIs referenced here are definitely something that we've always considered internal. We use 1.7.0 for semver checks because that's the oldest release that we want public API compatibility with (even though "public API" is not well defined). We only add exclusions for private classes when they change, so growing this list is essentially marking APIs private as we make changes. I think that's worth keeping rather than needing to add to the list when we make changes to a private class each release. > Parquet libraries do not follow proper semantic versioning > -- > > Key: PARQUET-1295 > URL: https://issues.apache.org/jira/browse/PARQUET-1295 > Project: Parquet > Issue Type: Bug >Reporter: Vlad Rozov >Priority: Major > > There are changes between 1.8.0 and 1.10.0 that break API compatibility. A > minor version change is supposed to be backward compatible with 1.9.0 and > 1.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1189) Release Parquet Java 1.10
[ https://issues.apache.org/jira/browse/PARQUET-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1189. Resolution: Fixed > Release Parquet Java 1.10 > - > > Key: PARQUET-1189 > URL: https://issues.apache.org/jira/browse/PARQUET-1189 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > Please link needed issues as blockers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1264) Update Javadoc for Java 1.8
[ https://issues.apache.org/jira/browse/PARQUET-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1264. Resolution: Fixed > Update Javadoc for Java 1.8 > --- > > Key: PARQUET-1264 > URL: https://issues.apache.org/jira/browse/PARQUET-1264 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > After moving the build to Java 1.8, the release procedure no longer works > because Javadoc generation fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1253) Support for new logical type representation
[ https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425994#comment-16425994 ] Ryan Blue commented on PARQUET-1253: Not including the UUID logical type in that union is probably an accident. MAP_KEY_VALUE is no longer used. It is noted in [backward compatibility rules|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1], but is not required for any types. The [comment "only valid for primitives"|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.5.0/src/main/thrift/parquet.thrift#L384] is incorrect. I think we can remove it. I'm not sure why the comment was there. > Support for new logical type representation > --- > > Key: PARQUET-1253 > URL: https://issues.apache.org/jira/browse/PARQUET-1253 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > > Latest parquet-format > [introduced|https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe#diff-0f9d1b5347959e15259da7ba8f4b6252] > a new representation for logical types. As of now this is not yet supported > in parquet-mr, thus there's no way to use parametrized UTC normalized > timestamp data types. When reading and writing Parquet files, besides > 'converted_type' parquet-mr should use the new 'logicalType' field in > SchemaElement to tell the current logical type annotation. To maintain > backward compatibility, the semantic of converted_type shouldn't change. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1264) Update Javadoc for Java 1.8
Ryan Blue created PARQUET-1264: -- Summary: Update Javadoc for Java 1.8 Key: PARQUET-1264 URL: https://issues.apache.org/jira/browse/PARQUET-1264 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.9.0 Reporter: Ryan Blue Assignee: Ryan Blue Fix For: 1.10.0 After moving the build to Java 1.8, the release procedure no longer works because Javadoc generation fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1263. Resolution: Fixed Assignee: Ryan Blue Merged #464. > ParquetReader's builder should use Configuration from the InputFile > --- > > Key: PARQUET-1263 > URL: https://issues.apache.org/jira/browse/PARQUET-1263 > Project: Parquet > Issue Type: Improvement >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > ParquetReader can be built using an InputFile, which may be a HadoopInputFile > and have a Configuration. If it is, ParquetHadoopOptions should be be based > on that configuration instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder
[ https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1183. Resolution: Fixed Assignee: Ryan Blue Merged #460. Thanks [~zi] for reviewing! > AvroParquetWriter needs OutputFile based Builder > > > Key: PARQUET-1183 > URL: https://issues.apache.org/jira/browse/PARQUET-1183 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.9.1 >Reporter: Werner Daehn >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > The ParquetWriter got a new Builder(OutputFile). > But it cannot be used by the AvroParquetWriter as there is no matching > Builder/Constructor. > Changes are quite simple: > public static Builder builder(OutputFile file) { > return new Builder(file) > } > and in the static Builder class below > private Builder(OutputFile file) { > super(file); > } > Note: I am not good enough with builds, maven and git to create a pull > request yet. Sorry. Will try to get better here. > See: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile
Ryan Blue created PARQUET-1263: -- Summary: ParquetReader's builder should use Configuration from the InputFile Key: PARQUET-1263 URL: https://issues.apache.org/jira/browse/PARQUET-1263 Project: Parquet Issue Type: Improvement Reporter: Ryan Blue ParquetReader can be built using an InputFile, which may be a HadoopInputFile and have a Configuration. If it is, ParquetHadoopOptions should be be based on that configuration instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1263: --- Fix Version/s: 1.10.0 > ParquetReader's builder should use Configuration from the InputFile > --- > > Key: PARQUET-1263 > URL: https://issues.apache.org/jira/browse/PARQUET-1263 > Project: Parquet > Issue Type: Improvement >Reporter: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > ParquetReader can be built using an InputFile, which may be a HadoopInputFile > and have a Configuration. If it is, ParquetHadoopOptions should be be based > on that configuration instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class
[ https://issues.apache.org/jira/browse/PARQUET-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1184. Resolution: Won't Fix Fix Version/s: (was: 1.10.0) > Make DelegatingPositionOutputStream a concrete class > > > Key: PARQUET-1184 > URL: https://issues.apache.org/jira/browse/PARQUET-1184 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.9.1 >Reporter: Werner Daehn >Priority: Major > > I fail to understand why this is an abstract class. In my example I want to > write the Parquet file to a java.io.FileOutputStream, hence have to extend > the DelegatingPositionOutputStream and store the pos information, increase it > in all write(..) methods and return its value in getPos(). > Doable of course, but useful? Previously yes but now with the OutputFile > changes to decouple it from Hadoop more, I believe no. > related to: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class
[ https://issues.apache.org/jira/browse/PARQUET-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420982#comment-16420982 ] Ryan Blue commented on PARQUET-1184: The reason why this is an abstract class is so that you can use it to wrap implementations that provide a position, like Hadoop's FsOutputStream. It would not be correct to assume that the position is at the current number of bytes written to the underlying stream. An implementation could wrap RandomAccessFile and expose its seek method, which would invalidate the delegating stream's position. The delegating class is present for convenience only. You don't have to use it and can implement your own logic as long as you implement PositionOutputStream. > Make DelegatingPositionOutputStream a concrete class > > > Key: PARQUET-1184 > URL: https://issues.apache.org/jira/browse/PARQUET-1184 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.9.1 >Reporter: Werner Daehn >Priority: Major > Fix For: 1.10.0 > > > I fail to understand why this is an abstract class. In my example I want to > write the Parquet file to a java.io.FileOutputStream, hence have to extend > the DelegatingPositionOutputStream and store the pos information, increase it > in all write(..) methods and return its value in getPos(). > Doable of course, but useful? Previously yes but now with the OutputFile > changes to decouple it from Hadoop more, I believe no. > related to: https://issues.apache.org/jira/browse/PARQUET-1142 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
[ https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1028: --- Fix Version/s: 1.10.0 > [JAVA] When reading old Spark-generated files with INT96, stats are reported > as valid when they aren't > --- > > Key: PARQUET-1028 > URL: https://issues.apache.org/jira/browse/PARQUET-1028 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jacques Nadeau >Priority: Major > Fix For: 1.10.0 > > > Found that the condition > [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55] > is missing a check for INT96. Since INT96 statis are also corrupt with old > versions of Parquet, the code here shouldn't short-circuit return. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
[ https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1028. Resolution: Fixed Assignee: Zoltan Ivanfi > [JAVA] When reading old Spark-generated files with INT96, stats are reported > as valid when they aren't > --- > > Key: PARQUET-1028 > URL: https://issues.apache.org/jira/browse/PARQUET-1028 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jacques Nadeau >Assignee: Zoltan Ivanfi >Priority: Major > Fix For: 1.10.0 > > > Found that the condition > [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55] > is missing a check for INT96. Since INT96 statis are also corrupt with old > versions of Parquet, the code here shouldn't short-circuit return. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
[ https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420962#comment-16420962 ] Ryan Blue commented on PARQUET-1028: This was fixed by PARQUET-1065. The expected sort order for INT96 is now UNKNOWN, so stats are discarded. > [JAVA] When reading old Spark-generated files with INT96, stats are reported > as valid when they aren't > --- > > Key: PARQUET-1028 > URL: https://issues.apache.org/jira/browse/PARQUET-1028 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jacques Nadeau >Priority: Major > Fix For: 1.10.0 > > > Found that the condition > [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55] > is missing a check for INT96. Since INT96 statis are also corrupt with old > versions of Parquet, the code here shouldn't short-circuit return. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1055) Improve the creation of ExecutorService when reading footers
[ https://issues.apache.org/jira/browse/PARQUET-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1055: --- Fix Version/s: (was: 1.9.1) > Improve the creation of ExecutorService when reading footers > > > Key: PARQUET-1055 > URL: https://issues.apache.org/jira/browse/PARQUET-1055 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Benoit Lacelle >Priority: Minor > > Doing some benchmarks loading a large set of parquet files (3000+) from the > local FS, we observed some inefficiencies in the number of created threads > when reading footers. > By reading, the read the configuration parallelism in Hadoop configuration > (defaulted to 5) and allocate 2 ExecuteService with each 5 threads to read > footers. This is especially inefficient if there is less Callable to handle > than the configured parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
[ https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1028: --- Fix Version/s: (was: 1.9.1) > [JAVA] When reading old Spark-generated files with INT96, stats are reported > as valid when they aren't > --- > > Key: PARQUET-1028 > URL: https://issues.apache.org/jira/browse/PARQUET-1028 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jacques Nadeau >Priority: Major > > Found that the condition > [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55] > is missing a check for INT96. Since INT96 statis are also corrupt with old > versions of Parquet, the code here shouldn't short-circuit return. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1174) Concurrent read micro benchmarks
[ https://issues.apache.org/jira/browse/PARQUET-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1174: --- Fix Version/s: (was: 1.9.1) > Concurrent read micro benchmarks > > > Key: PARQUET-1174 > URL: https://issues.apache.org/jira/browse/PARQUET-1174 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Takeshi Yoshimura >Priority: Minor > > parquet-benchmarks only contain read and write benchmarks with a single > thread. > I add concurrent Parquet file scans like typical data-parallel computing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-796) Delta Encoding is not used when dictionary enabled
[ https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-796: -- Fix Version/s: (was: 1.9.1) > Delta Encoding is not used when dictionary enabled > -- > > Key: PARQUET-796 > URL: https://issues.apache.org/jira/browse/PARQUET-796 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jakub Liska >Priority: Critical > > Current code doesn't enable using both Delta Encoding and Dictionary > Encoding. If I instantiate ParquetWriter like this : > {code} > val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, > blockSize, pageSize, dictPageSize, enableDictionary = true, true, > ParquetProperties.WriterVersion.PARQUET_2_0, configuration) > {code} > Then this piece of code : > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86 > Causes that DictionaryValuesWriter is used instead of the inferred > DeltaLongEncodingWriter. > The original issue is here : > https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1153) Parquet-thrift doesn't compile with Thrift 0.10.0
[ https://issues.apache.org/jira/browse/PARQUET-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1153: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Parquet-thrift doesn't compile with Thrift 0.10.0 > - > > Key: PARQUET-1153 > URL: https://issues.apache.org/jira/browse/PARQUET-1153 > Project: Parquet > Issue Type: Bug >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > Fix For: 1.10.0 > > > Parquet-thrift doesn't compile with Thrift 0.10.0 due to THRIFT-2263. The > default generator parameter used for {{--gen}} argument by Thrift Maven > plugin is no longer supported, this can be fixed with an additional > {{java}} parameter to Thrift Maven plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-777) Add new Parquet CLI tools
[ https://issues.apache.org/jira/browse/PARQUET-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-777. --- Resolution: Fixed > Add new Parquet CLI tools > - > > Key: PARQUET-777 > URL: https://issues.apache.org/jira/browse/PARQUET-777 > Project: Parquet > Issue Type: Improvement > Components: parquet-cli >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.9.1 > > > This issue tracks adding parquet-cli from > [rdblue/parquet-cli|https://github.com/rdblue/parquet-cli]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1152) Parquet-thrift doesn't compile with Thrift 0.9.3
[ https://issues.apache.org/jira/browse/PARQUET-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1152: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Parquet-thrift doesn't compile with Thrift 0.9.3 > > > Key: PARQUET-1152 > URL: https://issues.apache.org/jira/browse/PARQUET-1152 > Project: Parquet > Issue Type: Bug >Reporter: Nandor Kollar >Assignee: Nandor Kollar >Priority: Major > Fix For: 1.10.0 > > > Parquet-thrift doesn't compile with Thrift 0.9.3, because > TBinaryProtocol#setReadLength method was removed. > PARQUET-180 already addressed the problem, but only in runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-777) Add new Parquet CLI tools
[ https://issues.apache.org/jira/browse/PARQUET-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-777: -- Fix Version/s: (was: 1.9.1) 1.10.0 > Add new Parquet CLI tools > - > > Key: PARQUET-777 > URL: https://issues.apache.org/jira/browse/PARQUET-777 > Project: Parquet > Issue Type: Improvement > Components: parquet-cli >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > This issue tracks adding parquet-cli from > [rdblue/parquet-cli|https://github.com/rdblue/parquet-cli]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1135) upgrade thrift and protobuf dependencies
[ https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1135: --- Fix Version/s: (was: 1.9.1) 1.10.0 > upgrade thrift and protobuf dependencies > > > Key: PARQUET-1135 > URL: https://issues.apache.org/jira/browse/PARQUET-1135 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Julien Le Dem >Assignee: Julien Le Dem >Priority: Major > Fix For: 1.10.0 > > > thrift 0.7.0 -> 0.9.3 > protobuf 3.2 -> 3.5.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1115) Warn users when misusing parquet-tools merge
[ https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1115: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Warn users when misusing parquet-tools merge > > > Key: PARQUET-1115 > URL: https://issues.apache.org/jira/browse/PARQUET-1115 > Project: Parquet > Issue Type: Improvement >Reporter: Zoltan Ivanfi >Assignee: Nandor Kollar >Priority: Major > Fix For: 1.10.0 > > > To prevent users from using {{parquet-tools merge}} in scenarios where its > use is not practical, we should describe its limitations in the help text of > this command. Additionally, we should add a warning to the output of the > merge command if the size of the original row groups are below a threshold. > Reasoning: > Many users are tempted to use the new {{parquet-tools merge}} functionality, > because they want to achieve good performance and historically that has been > associated with large Parquet files. However, in practice Hive performance > won't change significantly after using {{parquet-tools merge}}, but Impala > performance will be much worse. The reason for that is that good performance > is not a result of large files but large rowgroups instead (up to the HDFS > block size). > However, {{parquet-tools merge}} does not merge rowgroups, it just places > them one after the other. It was intended to be used for Parquet files that > are already arranged in row groups of the desired size. When used to merge > many small files, the resulting file will still contain small row groups and > one loses most of the advantages of larger files (the only one that remains > is that it takes a single HDFS operation to read them). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1149) Upgrade Avro dependancy to 1.8.2
[ https://issues.apache.org/jira/browse/PARQUET-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1149: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Upgrade Avro dependancy to 1.8.2 > > > Key: PARQUET-1149 > URL: https://issues.apache.org/jira/browse/PARQUET-1149 > Project: Parquet > Issue Type: Improvement >Reporter: Fokko Driesprong >Priority: Major > Fix For: 1.10.0 > > > I would like to update the Avro dependancy to 1.8.2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1141) IDs are dropped in metadata conversion
[ https://issues.apache.org/jira/browse/PARQUET-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1141: --- Fix Version/s: (was: 1.9.1) 1.10.0 > IDs are dropped in metadata conversion > -- > > Key: PARQUET-1141 > URL: https://issues.apache.org/jira/browse/PARQUET-1141 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0, 1.8.2 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1025) Support new min-max statistics in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1025: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Support new min-max statistics in parquet-mr > > > Key: PARQUET-1025 > URL: https://issues.apache.org/jira/browse/PARQUET-1025 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Affects Versions: 1.9.1 >Reporter: Zoltan Ivanfi >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.10.0 > > > Impala started using new min-max statistics that got specified as part of > PARQUET-686. Support for these should be added to parquet-mr as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1077) [MR] Switch to long key ids in KEYs file
[ https://issues.apache.org/jira/browse/PARQUET-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1077: --- Fix Version/s: (was: 1.9.1) > [MR] Switch to long key ids in KEYs file > > > Key: PARQUET-1077 > URL: https://issues.apache.org/jira/browse/PARQUET-1077 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Lars Volker >Assignee: Lars Volker >Priority: Major > Fix For: 2.0.0, 1.10.0 > > > PGP keys should be longer than 32bit, as outlined on https://evil32.com/. We > should fix the KEYS file in parquet-mr. I will push a PR shortly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-791) Predicate pushing down on missing columns should work on UserDefinedPredicate too
[ https://issues.apache.org/jira/browse/PARQUET-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-791: -- Fix Version/s: (was: 1.9.1) 1.10.0 > Predicate pushing down on missing columns should work on UserDefinedPredicate > too > - > > Key: PARQUET-791 > URL: https://issues.apache.org/jira/browse/PARQUET-791 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 1.10.0 > > > This is related to PARQUET-389. PARQUET-389 fixes the predicate pushing down > on missing columns. But it doesn't fix it for UserDefinedPredicate. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1024) allow for case insensitive parquet-xxx prefix in PR title
[ https://issues.apache.org/jira/browse/PARQUET-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1024: --- Fix Version/s: (was: 1.9.1) 1.10.0 > allow for case insensitive parquet-xxx prefix in PR title > - > > Key: PARQUET-1024 > URL: https://issues.apache.org/jira/browse/PARQUET-1024 > Project: Parquet > Issue Type: Improvement >Reporter: Julien Le Dem >Assignee: Julien Le Dem >Priority: Major > Fix For: 1.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1005) Fix DumpCommand parsing to allow column projection
[ https://issues.apache.org/jira/browse/PARQUET-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1005: --- Fix Version/s: (was: 1.9.1) 1.10.0 > Fix DumpCommand parsing to allow column projection > -- > > Key: PARQUET-1005 > URL: https://issues.apache.org/jira/browse/PARQUET-1005 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.8.0, 1.8.1, 1.9.0, 2.0.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov >Priority: Major > Fix For: 1.10.0 > > > DumpCommand option for -c is specified as hasArgs() for unlimited > number of arguments following -c. The very description of the option > shows the real intent of using hasArg() such that multiple columns > can be specified as '-c c1 -c c2 ...'. Otherwise, the input path > is parsed as an argument for -c instead of the command itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1026) allow unsigned binary stats when min == max
[ https://issues.apache.org/jira/browse/PARQUET-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1026: --- Fix Version/s: (was: 1.9.1) 1.10.0 > allow unsigned binary stats when min == max > --- > > Key: PARQUET-1026 > URL: https://issues.apache.org/jira/browse/PARQUET-1026 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Julien Le Dem >Assignee: Julien Le Dem >Priority: Major > Fix For: 1.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-801) Allow UserDefinedPredicates in DictionaryFilter
[ https://issues.apache.org/jira/browse/PARQUET-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-801: -- Fix Version/s: (was: 1.9.1) 1.10.0 > Allow UserDefinedPredicates in DictionaryFilter > --- > > Key: PARQUET-801 > URL: https://issues.apache.org/jira/browse/PARQUET-801 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Patrick Woody >Assignee: Patrick Woody >Priority: Major > Fix For: 1.10.0 > > > UserDefinedPredicate is not implemented for dictionary filtering. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-321) Set the HDFS padding default to 8MB
[ https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-321: -- Fix Version/s: (was: 1.9.1) 1.10.0 > Set the HDFS padding default to 8MB > --- > > Key: PARQUET-321 > URL: https://issues.apache.org/jira/browse/PARQUET-321 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > PARQUET-306 added the ability to pad row groups so that they align with HDFS > blocks to avoid remote reads. The ParquetFileWriter will now either pad the > remaining space in the block or target a row group for the remaining size. > The padding maximum controls the threshold of the amount of padding that will > be used. If the space left is under this threshold, it is padded. If it is > greater than this threshold, then the next row group is fit into the > remaining space. The current padding maximum is 0. > I think we should change the padding maximum to 8MB. My reasoning is this: we > want this number to be small enough that it won't prevent the library from > writing reasonable row groups, but larger than the minimum size row group we > would want to write. 8MB is 1/16th of the row group default, so I think it is > reasonable: we don't want a row group to be smaller than 8 MB. > We also want this to be large enough that a few row groups in a block don't > cause a tiny row group to be written in the excess space. 8MB accounts for 4 > row groups that are 2MB under-size. In addition, it is reasonable to not > allow row groups under 8MB. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1222) Definition of float and double sort order is ambiguous
[ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412154#comment-16412154 ] Ryan Blue commented on PARQUET-1222: I think Jim is right. IEEE-754 numbers are ordered correctly if you flip the sign bit and use unsigned, byte-wise comparison. I wrote a spec for encoding HBase keys that used this a while ago. The reason why rule 3 works is that for normal floating point numbers, the significand must start with a 1. Conceptually, this means that 0.0001 and 0.1 cannot be represented with the same exponent. Because the exponent basically encodes where the first set bit is in the number, it can be used for sorting. There is also support for very small numbers where the significand doesn't start with 1, but those must use the smallest-possible exponent so sorting still works. > Definition of float and double sort order is ambiguous > -- > > Key: PARQUET-1222 > URL: https://issues.apache.org/jira/browse/PARQUET-1222 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Zoltan Ivanfi >Priority: Critical > > Currently parquet-format specifies the sort order for floating point numbers > as follows: > {code:java} >* FLOAT - signed comparison of the represented value >* DOUBLE - signed comparison of the represented value > {code} > The problem is that the comparison of floating point numbers is only a > partial ordering with strange behaviour in specific corner cases. For > example, according to IEEE 754, -0 is neither less nor more than \+0 and > comparing NaN to anything always returns false. This ordering is not suitable > for statistics. Additionally, the Java implementation already uses a > different (total) ordering that handles these cases correctly but differently > than the C\+\+ implementations, which leads to interoperability problems. > TypeDefinedOrder for doubles and floats should be deprecated and a new > TotalFloatingPointOrder should be introduced. The default for writing doubles > and floats would be the new TotalFloatingPointOrder. This ordering should be > effective and easy to implement in all programming languages. > For reading existing stats created using TypeDefinedOrder, the following > compatibility rules should be applied: > * When looking for NaN values, min and max should be ignored. > * If the min is a NaN, it should be ignored. > * If the max is a NaN, it should be ignored. > * If the min is \+0, the row group may contain -0 values as well. > * If the max is -0, the row group may contain \+0 values as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1241) Use LZ4 frame format
[ https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395711#comment-16395711 ] Ryan Blue commented on PARQUET-1241: Does anyone know what the Hadoop compression codec produces? That's what we're using in the Java implementation, so that's what the current LZ4 codec name indicates. I didn't realize there were multiple formats. > Use LZ4 frame format > > > Key: PARQUET-1241 > URL: https://issues.apache.org/jira/browse/PARQUET-1241 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp, parquet-format >Reporter: Lawrence Chan >Priority: Major > > The parquet-format spec doesn't currently specify whether lz4-compressed data > should be framed or not. We should choose one and make it explicit in the > spec, as they are not inter-operable. After some discussions with others [1], > we think it would be beneficial to use the framed format, which adds a small > header in exchange for more self-contained decompression as well as a richer > feature set (checksums, parallel decompression, etc). > The current arrow implementation compresses using the lz4 block format, and > this would need to be updated when we add the spec clarification. > If backwards compatibility is a concern, I would suggest adding an additional > LZ4_FRAMED compression type, but that may be more noise than anything. > [1] https://github.com/dask/fastparquet/issues/314 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1238) Invalid links found in parquet site document page
[ https://issues.apache.org/jira/browse/PARQUET-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379687#comment-16379687 ] Ryan Blue commented on PARQUET-1238: I didn't realize the patch was for the SVN site. Thanks, I'll take a look and should be able to commit it as is. > Invalid links found in parquet site document page > - > > Key: PARQUET-1238 > URL: https://issues.apache.org/jira/browse/PARQUET-1238 > Project: Parquet > Issue Type: Bug >Reporter: xuchuanyin >Priority: Trivial > Attachments: PARQUET-1238_fixed_invalid_links_in_latest_html_md.patch > > > Links to pictures in document page are invalid, such as Section ‘File Format’ > and ‘Metadata’ > > Links to external documents in document page are invalid, such as Section > 'Motivation', 'Logical Types' and 'Data Pages' > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1238) Invalid links found in parquet site document page
[ https://issues.apache.org/jira/browse/PARQUET-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378969#comment-16378969 ] Ryan Blue commented on PARQUET-1238: [~xuchuanyin], thanks for fixing this. Could you post your patch as a pull request on github? > Invalid links found in parquet site document page > - > > Key: PARQUET-1238 > URL: https://issues.apache.org/jira/browse/PARQUET-1238 > Project: Parquet > Issue Type: Bug >Reporter: xuchuanyin >Priority: Trivial > Attachments: PARQUET-1238_fixed_invalid_links_in_latest_html_md.patch > > > Links to pictures in document page are invalid, such as Section ‘File Format’ > and ‘Metadata’ > > Links to external documents in document page are invalid, such as Section > 'Motivation', 'Logical Types' and 'Data Pages' > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (PARQUET-796) Delta Encoding is not used when dictionary enabled
[ https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377382#comment-16377382 ] Ryan Blue edited comment on PARQUET-796 at 2/26/18 7:06 PM: I don't recommend using the delta long encoding because I think we need to update to better encodings (specifically, the zig-zag-encoding ones in [this branch|https://github.com/rdblue/parquet-mr/commits/encoders]). We could definitely use a better fallback, but I don't think the solution is to turn off dictionary encoding. If you can use dictionary encoding to get a smaller size, you should. The problem is when dictionary encoding needs to test whether another encoding would be better. It currently tests against plain and uses plain. We should have it test against a delta encoding and use one. This kind of improvement is why we added PARQUET-601. We want to be able to test out different ways of choosing an encoding at write time. But we do not want to make it so that users must specify their own encodings because we want Parquet to select them automatically and get the choice right. PARQUET-601 is about testing out strategies that we release as the defaults. was (Author: rdblue): I don't recommend using the delta long encoding because I think we need to update to better encodings (specifically, the zig-zag-encoding ones in this branch). We could definitely use a better fallback, but I don't think the solution is to turn off dictionary encoding. If you can use dictionary encoding to get a smaller size, you should. The problem is when dictionary encoding needs to test whether another encoding would be better. It currently tests against plain and uses plain. We should have it test against a delta encoding and use one. This kind of improvement is why we added PARQUET-601. We want to be able to test out different ways of choosing an encoding at write time. But we do not want to make it so that users must specify their own encodings because we want Parquet to select them automatically and get the choice right. PARQUET-601 is about testing out strategies that we release as the defaults. > Delta Encoding is not used when dictionary enabled > -- > > Key: PARQUET-796 > URL: https://issues.apache.org/jira/browse/PARQUET-796 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jakub Liska >Priority: Critical > Fix For: 1.9.1 > > > Current code doesn't enable using both Delta Encoding and Dictionary > Encoding. If I instantiate ParquetWriter like this : > {code} > val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, > blockSize, pageSize, dictPageSize, enableDictionary = true, true, > ParquetProperties.WriterVersion.PARQUET_2_0, configuration) > {code} > Then this piece of code : > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86 > Causes that DictionaryValuesWriter is used instead of the inferred > DeltaLongEncodingWriter. > The original issue is here : > https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-796) Delta Encoding is not used when dictionary enabled
[ https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377382#comment-16377382 ] Ryan Blue commented on PARQUET-796: --- I don't recommend using the delta long encoding because I think we need to update to better encodings (specifically, the zig-zag-encoding ones in this branch). We could definitely use a better fallback, but I don't think the solution is to turn off dictionary encoding. If you can use dictionary encoding to get a smaller size, you should. The problem is when dictionary encoding needs to test whether another encoding would be better. It currently tests against plain and uses plain. We should have it test against a delta encoding and use one. This kind of improvement is why we added PARQUET-601. We want to be able to test out different ways of choosing an encoding at write time. But we do not want to make it so that users must specify their own encodings because we want Parquet to select them automatically and get the choice right. PARQUET-601 is about testing out strategies that we release as the defaults. > Delta Encoding is not used when dictionary enabled > -- > > Key: PARQUET-796 > URL: https://issues.apache.org/jira/browse/PARQUET-796 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Jakub Liska >Priority: Critical > Fix For: 1.9.1 > > > Current code doesn't enable using both Delta Encoding and Dictionary > Encoding. If I instantiate ParquetWriter like this : > {code} > val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, > blockSize, pageSize, dictPageSize, enableDictionary = true, true, > ParquetProperties.WriterVersion.PARQUET_2_0, configuration) > {code} > Then this piece of code : > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86 > Causes that DictionaryValuesWriter is used instead of the inferred > DeltaLongEncodingWriter. > The original issue is here : > https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1234) Release Parquet format 2.5.0
[ https://issues.apache.org/jira/browse/PARQUET-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371749#comment-16371749 ] Ryan Blue commented on PARQUET-1234: Are we going to release a 2.4.1 with the changes for column index structures? I'd rather not wait on a resolution to PARQUET-1222 to get that out. > Release Parquet format 2.5.0 > > > Key: PARQUET-1234 > URL: https://issues.apache.org/jira/browse/PARQUET-1234 > Project: Parquet > Issue Type: Task > Components: parquet-format >Affects Versions: format-2.5.0 >Reporter: Gabor Szadovszky >Priority: Major > Fix For: format-2.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-787) Add a size limit for heap allocations when reading
[ https://issues.apache.org/jira/browse/PARQUET-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-787. --- Resolution: Fixed Fix Version/s: 1.10.0 Merged #390. > Add a size limit for heap allocations when reading > -- > > Key: PARQUET-787 > URL: https://issues.apache.org/jira/browse/PARQUET-787 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.9.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 1.10.0 > > > [G1GC allocates humongous objects directly in the old > generation|https://www.infoq.com/articles/tuning-tips-G1-GC] to avoid > unnecessary copies, which means that these allocations aren't garbage > collected until a full GC runs. Humongous objects are objects that are 50% of > the region size or more. Region size is at most 32MB (see the table for > [region size from heap > size|http://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection#Regions]). > Parquet currently allocates a huge buffer for each contiguous group of column > chunks, which in many cases is not garbage collected until a full GC. Adding > a size limit for the allocation size should allow users to break row groups > across multiple buffers so that buffers get collected when they have been > read. -- This message was sent by Atlassian JIRA (v7.6.3#76005)