[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276548#comment-17276548
 ] 

Ryan Blue commented on PARQUET-1968:


Thank you! I'm not sure why it was no longer on my calendar. I have the invite 
now and I plan to attend the sync on the 23rd. If you'd like, we can also set 
up a time to talk about this integration specifically, since it may take a 
while.

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276526#comment-17276526
 ] 

Ryan Blue commented on PARQUET-1968:


I would really like to see a new Parquet API that can support some of the 
additional features we needed for Iceberg. I proposed adopting Iceberg's filter 
expressions a year or two ago, so I'm glad to see that the idea has some 
support from other PMC members. This is one reason why the API is in a separate 
module. I think we were planning to talk about this at the next Parquet sync, 
although I'm not sure when that will be.

FYI [~sha...@uber.com].

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex

2020-08-24 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183481#comment-17183481
 ] 

Ryan Blue commented on PARQUET-1901:


It isn't clear to me how a filter implementation would handle the filter itself 
being null. It could return a default value to accept/read, but that runs into 
issues when filters like {{not(null)}} are passed in. So I agree with Gabor 
that it makes sense for a null filter to be an exceptional case in the filter 
implementations themselves.

But I would expect a method like {{calculateRowRanges}} to correctly return the 
default {{RowRanges.createSingle(rowCount)}} if that method were passed a null 
value, since it is not actually processing the filter.

For Iceberg, I'm wondering if it wouldn't be easier to implement our own filter 
implementation that produced row ranges and passed them in. That's how we 
filter row groups and I think it has been much easier not needing to convert to 
Parquet filters, which are difficult to work with.

> Add filter null check for ColumnIndex  
> ---
>
> Key: PARQUET-1901
> URL: https://issues.apache.org/jira/browse/PARQUET-1901
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> This Jira is opened for discussion that should we add null checking for the 
> filter when ColumnIndex is enabled. 
> In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 
> 'filter' is assumed to be non-null without checking. It throws NPE when 
> ColumnIndex is enabled(by default) but there is no filter set in the 
> ParquetReadOptions. The call stack is as below. 
> java.lang.NullPointerException
> at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891)
> If we don't add, the user might need to choose to call readNextRowGroup() or 
> readFilteredNextRowGroup() accordingly based on filter existence. 
> Thoughts?  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1809) Add new APIs for nested predicate pushdown

2020-03-04 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051585#comment-17051585
 ] 

Ryan Blue commented on PARQUET-1809:


I think it should be fine to allow this. While there may be other problems when 
using `.` in names, the Spark PR that uses this shows that it works just fine 
to pass a string array instead of parsing a name.

>  Add new APIs for nested predicate pushdown
> ---
>
> Key: PARQUET-1809
> URL: https://issues.apache.org/jira/browse/PARQUET-1809
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: DB Tsai
>Priority: Major
>
> Currently, Parquet's *org.apache.parquet.filter2.predicate.FilterApi* is 
> using *dot* to split the column name into multi-parts of nested fields. The 
> drawback is that this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.  
> See https://github.com/apache/spark/pull/27728 and [SPARK-17636] for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969493#comment-16969493
 ] 

Ryan Blue commented on PARQUET-1681:


Looks like it might be https://issues.apache.org/jira/browse/AVRO-2400.

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969491#comment-16969491
 ] 

Ryan Blue commented on PARQUET-1681:


I think we should be able to work around this instead of reverting PARQUET-651. 
If the compatibility check requires that the name matches, then we should be 
able to ensure that the name matches when converting the Parquet schema to Avro.

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2019-11-07 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969489#comment-16969489
 ] 

Ryan Blue commented on PARQUET-1681:


The Avro check should ignore record names if the record is the root. Has this 
check changed in Avro recently?

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1685) Truncate the stored min and max for String statistics to reduce the footer size

2019-10-28 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961187#comment-16961187
 ] 

Ryan Blue commented on PARQUET-1685:


Looks like Gabor is right. The stats fields used for each column chunk (and 
page) are called min_value and max_value, so we should not truncate them. We 
will have to use the new indexes to add truncation. That's good because we want 
more people to look at the implementation and validate that work anyway.

Maybe we could add a flag for truncating the min and max values, as long as it 
is disabled by default and stored in the file's key-value metadata.

> Truncate the stored min and max for String statistics to reduce the footer 
> size 
> 
>
> Key: PARQUET-1685
> URL: https://issues.apache.org/jira/browse/PARQUET-1685
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Iceberg has a cool feature that truncates the stored min, max statistics to 
> minimize the metadata size. We can borrow to truncate them in Parquet also to 
> reduce the size of the footer, or even the page header. Here is the code in 
> IceBerg 
> [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java].
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-722) Building with JDK 8 fails over a maven bug

2019-08-20 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911797#comment-16911797
 ] 

Ryan Blue commented on PARQUET-722:
---

Looks like this was fixed when cascading3 support updated the 
maven-remote-resources-plugin: 
[https://github.com/apache/parquet-mr/blob/master/pom.xml#L390-L397]

I've confirmed that copying that block into older versions also fixes the 
problem so I'm going to mark this resolved.

> Building with JDK 8 fails over a maven bug
> --
>
> Key: PARQUET-722
> URL: https://issues.apache.org/jira/browse/PARQUET-722
> Project: Parquet
>  Issue Type: Bug
>Reporter: Niels Basjes
>Priority: Major
>
> When I build parquet on my system I get this error during the build:
> {quote}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
> on project parquet-generator: Error rendering velocity resource. 
> NullPointerException -> [Help 1]
> {quote}
> About a year ago [~julienledem] responded that this is caused due to a bug in 
> Maven in combination with Java 8:
> At this page 
> http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512
>  
> Now this bug has been solved at the Maven end in maven-filtering 1.2
> https://issues.apache.org/jira/browse/MSHARED-319
> The problem is that this fix has not yet been integrated into the latest 
> available maven versions yet.
> I'll put up a pull request with a proposed fix for this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (PARQUET-722) Building with JDK 8 fails over a maven bug

2019-08-20 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911797#comment-16911797
 ] 

Ryan Blue edited comment on PARQUET-722 at 8/20/19 10:59 PM:
-

Looks like this was fixed when cascading3 support updated the 
maven-remote-resources-plugin: 
[https://github.com/apache/parquet-mr/blob/master/pom.xml#L390-L397]

I've confirmed that copying that block into older versions also fixes the 
problem.


was (Author: rdblue):
Looks like this was fixed when cascading3 support updated the 
maven-remote-resources-plugin: 
[https://github.com/apache/parquet-mr/blob/master/pom.xml#L390-L397]

I've confirmed that copying that block into older versions also fixes the 
problem so I'm going to mark this resolved.

> Building with JDK 8 fails over a maven bug
> --
>
> Key: PARQUET-722
> URL: https://issues.apache.org/jira/browse/PARQUET-722
> Project: Parquet
>  Issue Type: Bug
>Reporter: Niels Basjes
>Priority: Major
>
> When I build parquet on my system I get this error during the build:
> {quote}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
> on project parquet-generator: Error rendering velocity resource. 
> NullPointerException -> [Help 1]
> {quote}
> About a year ago [~julienledem] responded that this is caused due to a bug in 
> Maven in combination with Java 8:
> At this page 
> http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512
>  
> Now this bug has been solved at the Maven end in maven-filtering 1.2
> https://issues.apache.org/jira/browse/MSHARED-319
> The problem is that this fix has not yet been integrated into the latest 
> available maven versions yet.
> I'll put up a pull request with a proposed fix for this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (PARQUET-1434) Release parquet-mr 1.11.0

2019-07-23 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891462#comment-16891462
 ] 

Ryan Blue commented on PARQUET-1434:


My concern is that it has not been reviewed well enough to be confident that 
the write path implements the spec correctly. So there aren't specific issues 
to address.

I made suggestions on an integration test Zoltan wrote, but that wasn't 
committed to the Parquet repository. Getting that cleaned up and committed is 
the only thing I can think of for now.

> Release parquet-mr 1.11.0
> -
>
> Key: PARQUET-1434
> URL: https://issues.apache.org/jira/browse/PARQUET-1434
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Nandor Kollar
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (PARQUET-1488) UserDefinedPredicate throw NullPointerException

2019-07-12 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884204#comment-16884204
 ] 

Ryan Blue commented on PARQUET-1488:


We discussed this on SPARK-28371.

Previously, Parquet did not fail if a UserDefinedPredicate did not handle null 
values, so I think that it is a regression that Parquet will cause previously 
working code to fail. I think that it is correct for Parquet to call a UDP the 
way that it is, but that Parquet should catch exceptions thrown by the 
predicate and should process the row group where there error was thrown. That 
way, Parquet can keep the optimization for columns that are all null, but it 
doesn't break existing code.

[~yumwang], would you like to submit a PR for this?

> UserDefinedPredicate throw NullPointerException
> ---
>
> Key: PARQUET-1488
> URL: https://issues.apache.org/jira/browse/PARQUET-1488
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> It throws {{NullPointerException}} after upgrade parquet to 1.11.0 when using 
> {{UserDefinedPredicate}}.
> The  
> [UserDefinedPredicate|https://github.com/apache/spark/blob/faf73dcd33d04365c28c2846d3a1f845785f69df/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L548-L578]
>  is:
> {code:java}
> new UserDefinedPredicate[Binary] with Serializable {  
> 
>   private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
> 
>   private val size = strToBinary.length   
> 
>   
> 
>   override def canDrop(statistics: Statistics[Binary]): Boolean = {   
> 
> val comparator = 
> PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR   
> val max = statistics.getMax   
> 
> val min = statistics.getMin   
> 
> comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) 
> < 0 ||  
>   comparator.compare(min.slice(0, math.min(size, min.length)), 
> strToBinary) > 0   
>   }   
> 
>   
> 
>   override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = {
> 
> val comparator = 
> PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR   
> val max = statistics.getMax   
> 
> val min = statistics.getMin   
> 
> comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) 
> == 0 && 
>   comparator.compare(min.slice(0, math.min(size, min.length)), 
> strToBinary) == 0  
>   }   
> 
>   
> 
>   override def keep(value: Binary): Boolean = {   
> 
> UTF8String.fromBytes(value.getBytes).startsWith(  
> 
>   UTF8String.fromBytes(strToBinary.getBytes)) 
> 
>   }   
> 
> } 
> 
> {code}
> The stack trace is:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:573)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:552)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
>   at 
> 

[jira] [Assigned] (PARQUET-1488) UserDefinedPredicate throw NullPointerException

2019-07-12 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-1488:
--

Assignee: Yuming Wang  (was: Gabor Szadovszky)

> UserDefinedPredicate throw NullPointerException
> ---
>
> Key: PARQUET-1488
> URL: https://issues.apache.org/jira/browse/PARQUET-1488
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> It throws {{NullPointerException}} after upgrade parquet to 1.11.0 when using 
> {{UserDefinedPredicate}}.
> The  
> [UserDefinedPredicate|https://github.com/apache/spark/blob/faf73dcd33d04365c28c2846d3a1f845785f69df/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L548-L578]
>  is:
> {code:java}
> new UserDefinedPredicate[Binary] with Serializable {  
> 
>   private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
> 
>   private val size = strToBinary.length   
> 
>   
> 
>   override def canDrop(statistics: Statistics[Binary]): Boolean = {   
> 
> val comparator = 
> PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR   
> val max = statistics.getMax   
> 
> val min = statistics.getMin   
> 
> comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) 
> < 0 ||  
>   comparator.compare(min.slice(0, math.min(size, min.length)), 
> strToBinary) > 0   
>   }   
> 
>   
> 
>   override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = {
> 
> val comparator = 
> PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR   
> val max = statistics.getMax   
> 
> val min = statistics.getMin   
> 
> comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) 
> == 0 && 
>   comparator.compare(min.slice(0, math.min(size, min.length)), 
> strToBinary) == 0  
>   }   
> 
>   
> 
>   override def keep(value: Binary): Boolean = {   
> 
> UTF8String.fromBytes(value.getBytes).startsWith(  
> 
>   UTF8String.fromBytes(strToBinary.getBytes)) 
> 
>   }   
> 
> } 
> 
> {code}
> The stack trace is:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:573)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:552)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (PARQUET-1488) UserDefinedPredicate throw NullPointerException

2019-07-12 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reopened PARQUET-1488:


> UserDefinedPredicate throw NullPointerException
> ---
>
> Key: PARQUET-1488
> URL: https://issues.apache.org/jira/browse/PARQUET-1488
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Gabor Szadovszky
>Priority: Major
>
> It throws {{NullPointerException}} after upgrade parquet to 1.11.0 when using 
> {{UserDefinedPredicate}}.
> The  
> [UserDefinedPredicate|https://github.com/apache/spark/blob/faf73dcd33d04365c28c2846d3a1f845785f69df/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L548-L578]
>  is:
> {code:java}
> new UserDefinedPredicate[Binary] with Serializable {  
> 
>   private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
> 
>   private val size = strToBinary.length   
> 
>   
> 
>   override def canDrop(statistics: Statistics[Binary]): Boolean = {   
> 
> val comparator = 
> PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR   
> val max = statistics.getMax   
> 
> val min = statistics.getMin   
> 
> comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) 
> < 0 ||  
>   comparator.compare(min.slice(0, math.min(size, min.length)), 
> strToBinary) > 0   
>   }   
> 
>   
> 
>   override def inverseCanDrop(statistics: Statistics[Binary]): Boolean = {
> 
> val comparator = 
> PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR   
> val max = statistics.getMax   
> 
> val min = statistics.getMin   
> 
> comparator.compare(max.slice(0, math.min(size, max.length)), strToBinary) 
> == 0 && 
>   comparator.compare(min.slice(0, math.min(size, min.length)), 
> strToBinary) == 0  
>   }   
> 
>   
> 
>   override def keep(value: Binary): Boolean = {   
> 
> UTF8String.fromBytes(value.getBytes).startsWith(  
> 
>   UTF8String.fromBytes(strToBinary.getBytes)) 
> 
>   }   
> 
> } 
> 
> {code}
> The stack trace is:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:573)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anon$1.keep(ParquetFilters.scala:552)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (PARQUET-1624) ParquetFileReader.open ignores Hadoop configuration options

2019-07-11 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-1624:
--

 Summary: ParquetFileReader.open ignores Hadoop configuration 
options
 Key: PARQUET-1624
 URL: https://issues.apache.org/jira/browse/PARQUET-1624
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.10.0, 1.11.0
Reporter: Ryan Blue
Assignee: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (PARQUET-1142) Avoid leaking Hadoop API to downstream libraries

2019-02-22 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775448#comment-16775448
 ] 

Ryan Blue commented on PARQUET-1142:


The next steps for this are to get compression working without relying on 
Hadoop. After that, it is a matter of some fairly simple refactoring of the 
file writer. But that refactoring doesn't help much unless the compression 
implementations also don't depend on Hadoop.

> Avoid leaking Hadoop API to downstream libraries
> 
>
> Key: PARQUET-1142
> URL: https://issues.apache.org/jira/browse/PARQUET-1142
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> Parquet currently leaks the Hadoop API by requiring callers to pass {{Path}} 
> and {{Configuration}} instances, and by using Hadoop codecs. {{InputFile}} 
> and {{SeekableInputStream}} add alternatives to Hadoop classes in some parts 
> of the read path, but this needs to be extended to the write path and to 
> avoid passing options through {{Configuration}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1281) Jackson dependency

2019-02-18 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1281.

Resolution: Not A Problem

> Jackson dependency
> --
>
> Key: PARQUET-1281
> URL: https://issues.apache.org/jira/browse/PARQUET-1281
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Qinghui Xu
>Priority: Major
>
> Currently we shaded jackson in parquet-jackson module (org.codehaus.jackon 
> --> shaded.parquet.org.codehaus.jackson), but in fact we do not use the 
> shaded jackson in parquet-hadoop code. Is that a mistake? (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java#L26)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1512) Release Parquet Java 1.10.1

2019-02-04 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1512.

Resolution: Fixed

> Release Parquet Java 1.10.1
> ---
>
> Key: PARQUET-1512
> URL: https://issues.apache.org/jira/browse/PARQUET-1512
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.1
>
>
> This is an umbrella issue to track the 1.10.1 release. Please link issues to 
> include in the release as blockers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-138) Parquet should allow a merge between required and optional schemas

2019-02-01 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-138:
-

Assignee: Nicolas Trinquier  (was: Ryan Blue)

> Parquet should allow a merge between required and optional schemas
> --
>
> Key: PARQUET-138
> URL: https://issues.apache.org/jira/browse/PARQUET-138
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Robert Justice
>Assignee: Nicolas Trinquier
>Priority: Major
>  Labels: pull-request-available
>
> In discussion with Ryan, he felt we should be able to merge from required 
> binary to optional binary and the resulting schema would be optional
> https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/test/java/parquet/schema/TestMessageType.java
> {code:java}
> try {
>   t3.union(t4);
>   fail("moving from optional to required");
> } catch (IncompatibleSchemaModificationException e) {
>   assertEquals("repetition constraint is more restrictive: can not merge 
> type required binary a into optional binary a", e.getMessage());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-138) Parquet should allow a merge between required and optional schemas

2019-02-01 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-138:
-

Assignee: Nicolas Trinquier  (was: Nicolas Trinquier)

> Parquet should allow a merge between required and optional schemas
> --
>
> Key: PARQUET-138
> URL: https://issues.apache.org/jira/browse/PARQUET-138
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Robert Justice
>Assignee: Nicolas Trinquier
>Priority: Major
>  Labels: pull-request-available
>
> In discussion with Ryan, he felt we should be able to merge from required 
> binary to optional binary and the resulting schema would be optional
> https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/test/java/parquet/schema/TestMessageType.java
> {code:java}
> try {
>   t3.union(t4);
>   fail("moving from optional to required");
> } catch (IncompatibleSchemaModificationException e) {
>   assertEquals("repetition constraint is more restrictive: can not merge 
> type required binary a into optional binary a", e.getMessage());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-138) Parquet should allow a merge between required and optional schemas

2019-02-01 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-138:
-

Assignee: Ryan Blue

> Parquet should allow a merge between required and optional schemas
> --
>
> Key: PARQUET-138
> URL: https://issues.apache.org/jira/browse/PARQUET-138
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Robert Justice
>Assignee: Ryan Blue
>Priority: Major
>  Labels: pull-request-available
>
> In discussion with Ryan, he felt we should be able to merge from required 
> binary to optional binary and the resulting schema would be optional
> https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/test/java/parquet/schema/TestMessageType.java
> {code:java}
> try {
>   t3.union(t4);
>   fail("moving from optional to required");
> } catch (IncompatibleSchemaModificationException e) {
>   assertEquals("repetition constraint is more restrictive: can not merge 
> type required binary a into optional binary a", e.getMessage());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1520) Update README to use correct build and version info

2019-01-31 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757679#comment-16757679
 ] 

Ryan Blue commented on PARQUET-1520:


Thanks for contributing!

> Update README to use correct build and version info
> ---
>
> Key: PARQUET-1520
> URL: https://issues.apache.org/jira/browse/PARQUET-1520
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1520) Update README to use correct build and version info

2019-01-31 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-1520:
--

Assignee: Dongjoon Hyun

> Update README to use correct build and version info
> ---
>
> Key: PARQUET-1520
> URL: https://issues.apache.org/jira/browse/PARQUET-1520
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1520) Update README to use correct build and version info

2019-01-31 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1520.

   Resolution: Fixed
Fix Version/s: 1.10.2

> Update README to use correct build and version info
> ---
>
> Key: PARQUET-1520
> URL: https://issues.apache.org/jira/browse/PARQUET-1520
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-28 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1510.

Resolution: Fixed

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.10.0, 1.9.1
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 1.11.0, 1.10.1
>
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1509) Update Docs for Hive Deprecation

2019-01-27 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1509.

Resolution: Fixed

> Update Docs for Hive Deprecation
> 
>
> Key: PARQUET-1509
> URL: https://issues.apache.org/jira/browse/PARQUET-1509
> Project: Parquet
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
>  Labels: pull-request-available
>
> Update docs to state that Hive integration is now deprecated. [PARQUET-1447]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1509) Update Docs for Hive Deprecation

2019-01-27 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-1509:
--

Assignee: BELUGA BEHR

> Update Docs for Hive Deprecation
> 
>
> Key: PARQUET-1509
> URL: https://issues.apache.org/jira/browse/PARQUET-1509
> Project: Parquet
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
>  Labels: pull-request-available
>
> Update docs to state that Hive integration is now deprecated. [PARQUET-1447]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1513) HiddenFileFilter Streamline

2019-01-27 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1513.

   Resolution: Fixed
Fix Version/s: 1.12.0

> HiddenFileFilter Streamline
> ---
>
> Key: PARQUET-1513
> URL: https://issues.apache.org/jira/browse/PARQUET-1513
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> {code:java}
>   public boolean accept(Path p) {
> return !p.getName().startsWith("_") && !p.getName().startsWith(".");
>   }
> {code}
> This can be streamlined a bit further.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1513) HiddenFileFilter Streamline

2019-01-27 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-1513:
--

Assignee: BELUGA BEHR

> HiddenFileFilter Streamline
> ---
>
> Key: PARQUET-1513
> URL: https://issues.apache.org/jira/browse/PARQUET-1513
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
>  Labels: pull-request-available
>
> {code:java}
>   public boolean accept(Path p) {
> return !p.getName().startsWith("_") && !p.getName().startsWith(".");
>   }
> {code}
> This can be streamlined a bit further.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-27 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-1510:
--

Assignee: Ryan Blue

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.10.0, 1.9.1
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 1.11.0, 1.10.1
>
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-25 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1510:
---
Issue Type: Bug  (was: Improvement)

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.10.0, 1.9.1
>Reporter: Ryan Blue
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 1.10.1
>
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-25 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1510:
---
Affects Version/s: 1.9.1
   1.9.0
   1.10.0

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.10.0, 1.9.1
>Reporter: Ryan Blue
>Priority: Major
>  Labels: correctness, pull-request-available
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-25 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16752641#comment-16752641
 ] 

Ryan Blue commented on PARQUET-1510:


Fixed metadata.

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.10.0, 1.9.1
>Reporter: Ryan Blue
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 1.11.0, 1.10.1
>
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-25 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1510:
---
Labels: correctness pull-request-available  (was: pull-request-available)

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Ryan Blue
>Priority: Major
>  Labels: correctness, pull-request-available
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1512) Release Parquet Java 1.10.1

2019-01-25 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-1512:
--

 Summary: Release Parquet Java 1.10.1
 Key: PARQUET-1512
 URL: https://issues.apache.org/jira/browse/PARQUET-1512
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.10.1


This is an umbrella issue to track the 1.10.1 release. Please link issues to 
include in the release as blockers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-25 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1510:
---
Priority: Blocker  (was: Major)

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.10.0, 1.9.1
>Reporter: Ryan Blue
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 1.10.1
>
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-25 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1510:
---
Fix Version/s: 1.10.1

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.10.0, 1.9.1
>Reporter: Ryan Blue
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 1.10.1
>
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-25 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1510:
---
Fix Version/s: 1.11.0

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.10.0, 1.9.1
>Reporter: Ryan Blue
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 1.11.0, 1.10.1
>
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-25 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1510:
---
Component/s: parquet-mr

> Dictionary filter skips null values when evaluating not-equals.
> ---
>
> Key: PARQUET-1510
> URL: https://issues.apache.org/jira/browse/PARQUET-1510
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Ryan Blue
>Priority: Major
>  Labels: pull-request-available
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> +-+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1510) Dictionary filter skips null values when evaluating not-equals.

2019-01-25 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-1510:
--

 Summary: Dictionary filter skips null values when evaluating 
not-equals.
 Key: PARQUET-1510
 URL: https://issues.apache.org/jira/browse/PARQUET-1510
 Project: Parquet
  Issue Type: Improvement
Reporter: Ryan Blue


This was discovered in Spark, see SPARK-26677. From the Spark PR:

{code}
// Repeat the values to get dictionary encoding.
Seq(Some("A"), Some("A"), 
None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
+-+
|value|
+-+
+-+
{code}

{code}
// Use plain encoding.
Seq(Some("A"), 
None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
+-+
|value|
+-+
| null|
+-+
{code}

This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1447) MapredParquetOutputFormat - Save Some Array Allocations

2019-01-08 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737333#comment-16737333
 ] 

Ryan Blue commented on PARQUET-1447:


I'd be happy to merge a PR!

> MapredParquetOutputFormat - Save Some Array Allocations
> ---
>
> Key: PARQUET-1447
> URL: https://issues.apache.org/jira/browse/PARQUET-1447
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: BELUGA BEHR
>Assignee: Ryan Blue
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1447) MapredParquetOutputFormat - Save Some Array Allocations

2019-01-07 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue reassigned PARQUET-1447:
--

Resolution: Won't Fix
  Assignee: Ryan Blue

I'm closing this because these classes are now maintained in Hive, not Parquet.

> MapredParquetOutputFormat - Save Some Array Allocations
> ---
>
> Key: PARQUET-1447
> URL: https://issues.apache.org/jira/browse/PARQUET-1447
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: BELUGA BEHR
>Assignee: Ryan Blue
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1465) CLONE - Add a way to append encoded blocks in ParquetFileWriter

2018-11-29 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1465.

Resolution: Fixed

See PARQUET-382.

> CLONE - Add a way to append encoded blocks in ParquetFileWriter
> ---
>
> Key: PARQUET-1465
> URL: https://issues.apache.org/jira/browse/PARQUET-1465
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.8.0
>Reporter: Steven Paster
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.8.2, 1.9.0
>
>
> Concatenating two files together currently requires reading the source files 
> and rewriting the content from scratch. This ends up taking a lot of memory, 
> even if the data is already encoded correctly and blocks just need to be 
> appended and have their metadata updated. Merging two files should be fast 
> and not take much memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader

2018-11-19 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1407.

Resolution: Fixed
  Assignee: Nandor Kollar

> Data loss on duplicate values with AvroParquetWriter/Reader
> ---
>
> Key: PARQUET-1407
> URL: https://issues.apache.org/jira/browse/PARQUET-1407
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0, 1.8.3
>Reporter: Scott Carey
>Assignee: Nandor Kollar
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> {code:java}
> public class Blah {
>   private static Path parquetFile = new Path("oops");
>   private static Schema schema = SchemaBuilder.record("spark_schema")
>   .fields().optionalBytes("value").endRecord();
>   private static GenericData.Record recordFor(String value) {
> return new GenericRecordBuilder(schema)
> .set("value", value.getBytes()).build();
>   }
>   public static void main(String ... args) throws IOException {
> try (ParquetWriter writer = AvroParquetWriter
>   .builder(parquetFile)
>   .withSchema(schema)
>   .build()) {
>   writer.write(recordFor("one"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("one"));
>   writer.write(recordFor("zero"));
> }
> try (ParquetReader reader = AvroParquetReader
> .builder(parquetFile)
> .withConf(new Configuration()).build()) {
>   GenericRecord rec;
>   int i = 0;
>   while ((rec = reader.read()) != null) {
> ByteBuffer buf = (ByteBuffer) rec.get("value");
> byte[] bytes = new byte[buf.remaining()];
> buf.get(bytes);
> System.out.println("rec " + i++ + ": " + new String(bytes));
>   }
> }
>   }
> }
> {code}
> Expected output:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: three
> rec 4: two
> rec 5: one
> rec 6: zero{noformat}
> Actual:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: 
> rec 4: 
> rec 5: 
> rec 6: zero{noformat}
>  
> This was found when we started getting empty byte[] values back in spark 
> unexpectedly.  (Spark 2.3.1 and Parquet 1.8.3).   I have not tried to 
> reproduce with parquet 1.9.0, but its a bad enough bug that I would like a 
> 1.8.4 release that I can drop-in replace 1.8.3 without any binary 
> compatibility issues.
>  Duplicate byte[] values are lost.
>  
> A few clues: 
> If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go 
> to zero.  I suspect a ByteBuffer is being recycled, but the call to 
> ByteBuffer.get mutates it.  I wonder if an appropriately placed 
> ByteBuffer.duplicate() would fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader

2018-11-15 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688796#comment-16688796
 ] 

Ryan Blue commented on PARQUET-1407:


[~scottcarey], [~jackytan], sorry for the delay. I didn't see this issue until 
now.

I've posted a PR that should fix it. I haven't written a test for it. If you 
want to pick that commit and submit a PR with a test, that would be a great way 
to contribute! If not, I'll get it done sometime soon and this can be fixed in 
1.11.0. Thanks!

> Data loss on duplicate values with AvroParquetWriter/Reader
> ---
>
> Key: PARQUET-1407
> URL: https://issues.apache.org/jira/browse/PARQUET-1407
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0, 1.8.3
>Reporter: Scott Carey
>Priority: Critical
>  Labels: pull-request-available
>
> {code:java}
> public class Blah {
>   private static Path parquetFile = new Path("oops");
>   private static Schema schema = SchemaBuilder.record("spark_schema")
>   .fields().optionalBytes("value").endRecord();
>   private static GenericData.Record recordFor(String value) {
> return new GenericRecordBuilder(schema)
> .set("value", value.getBytes()).build();
>   }
>   public static void main(String ... args) throws IOException {
> try (ParquetWriter writer = AvroParquetWriter
>   .builder(parquetFile)
>   .withSchema(schema)
>   .build()) {
>   writer.write(recordFor("one"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("one"));
>   writer.write(recordFor("zero"));
> }
> try (ParquetReader reader = AvroParquetReader
> .builder(parquetFile)
> .withConf(new Configuration()).build()) {
>   GenericRecord rec;
>   int i = 0;
>   while ((rec = reader.read()) != null) {
> ByteBuffer buf = (ByteBuffer) rec.get("value");
> byte[] bytes = new byte[buf.remaining()];
> buf.get(bytes);
> System.out.println("rec " + i++ + ": " + new String(bytes));
>   }
> }
>   }
> }
> {code}
> Expected output:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: three
> rec 4: two
> rec 5: one
> rec 6: zero{noformat}
> Actual:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: 
> rec 4: 
> rec 5: 
> rec 6: zero{noformat}
>  
> This was found when we started getting empty byte[] values back in spark 
> unexpectedly.  (Spark 2.3.1 and Parquet 1.8.3).   I have not tried to 
> reproduce with parquet 1.9.0, but its a bad enough bug that I would like a 
> 1.8.4 release that I can drop-in replace 1.8.3 without any binary 
> compatibility issues.
>  Duplicate byte[] values are lost.
>  
> A few clues: 
> If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go 
> to zero.  I suspect a ByteBuffer is being recycled, but the call to 
> ByteBuffer.get mutates it.  I wonder if an appropriately placed 
> ByteBuffer.duplicate() would fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1407) Data loss on duplicate values with AvroParquetWriter/Reader

2018-11-15 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1407:
---
Affects Version/s: 1.10.0

> Data loss on duplicate values with AvroParquetWriter/Reader
> ---
>
> Key: PARQUET-1407
> URL: https://issues.apache.org/jira/browse/PARQUET-1407
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.9.0, 1.10.0, 1.8.3
>Reporter: Scott Carey
>Priority: Critical
>
> {code:java}
> public class Blah {
>   private static Path parquetFile = new Path("oops");
>   private static Schema schema = SchemaBuilder.record("spark_schema")
>   .fields().optionalBytes("value").endRecord();
>   private static GenericData.Record recordFor(String value) {
> return new GenericRecordBuilder(schema)
> .set("value", value.getBytes()).build();
>   }
>   public static void main(String ... args) throws IOException {
> try (ParquetWriter writer = AvroParquetWriter
>   .builder(parquetFile)
>   .withSchema(schema)
>   .build()) {
>   writer.write(recordFor("one"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("three"));
>   writer.write(recordFor("two"));
>   writer.write(recordFor("one"));
>   writer.write(recordFor("zero"));
> }
> try (ParquetReader reader = AvroParquetReader
> .builder(parquetFile)
> .withConf(new Configuration()).build()) {
>   GenericRecord rec;
>   int i = 0;
>   while ((rec = reader.read()) != null) {
> ByteBuffer buf = (ByteBuffer) rec.get("value");
> byte[] bytes = new byte[buf.remaining()];
> buf.get(bytes);
> System.out.println("rec " + i++ + ": " + new String(bytes));
>   }
> }
>   }
> }
> {code}
> Expected output:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: three
> rec 4: two
> rec 5: one
> rec 6: zero{noformat}
> Actual:
> {noformat}
> rec 0: one
> rec 1: two
> rec 2: three
> rec 3: 
> rec 4: 
> rec 5: 
> rec 6: zero{noformat}
>  
> This was found when we started getting empty byte[] values back in spark 
> unexpectedly.  (Spark 2.3.1 and Parquet 1.8.3).   I have not tried to 
> reproduce with parquet 1.9.0, but its a bad enough bug that I would like a 
> 1.8.4 release that I can drop-in replace 1.8.3 without any binary 
> compatibility issues.
>  Duplicate byte[] values are lost.
>  
> A few clues: 
> If I do not call ByteBuffer.get, the size of ByteBuffer.remaining does not go 
> to zero.  I suspect a ByteBuffer is being recycled, but the call to 
> ByteBuffer.get mutates it.  I wonder if an appropriately placed 
> ByteBuffer.duplicate() would fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1457) Data set integrity tool

2018-11-12 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684275#comment-16684275
 ] 

Ryan Blue commented on PARQUET-1457:


[~gershinsky], this sounds like a reasonable extension to a table format and 
not really something that I think Parquet should be doing.

What do you think about coming up with a proposal for snapshot integrity for 
[Iceberg|https://github.com/Netflix/iceberg]?

> Data set integrity tool
> ---
>
> Key: PARQUET-1457
> URL: https://issues.apache.org/jira/browse/PARQUET-1457
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Parquet encryption protects integrity of individual files. However, data sets 
> (such as tables) are often written as a collection of files, say
> "/path/to/dataset"/part0.parquet.encrypted
> ..
> "/path/to/dataset"/partN.parquet.encrypted
>  
> In an untrusted storage, removal of one or more files will go unnoticed. 
> Replacement of one file contents with another will go unnoticed, unless a 
> user has provided unique AAD prefixes for each file.
>  
> The data set integrity tool solves these problems. While it doesn't 
> necessarily belong in Parquet functionality (that is focused on individual 
> files (?)) - it will assist higher level frameworks that use Parquet, to 
> cryptographically protect integrity of data sets comprised of multiple files.
> The use of this tool is not obligatory, as frameworks can use other means to 
> verify table (file collection) integrity.
>  
> The tool works by creating a small file, that can be stored as say
> "/path/to/dataset"/.dataset.signature
>  
> that contains the dataset unique name (URI) and the number of files (N). The 
> file contents is either encrypted with AES-GCM  (authenticated, encrypted) - 
> or hashed and signed (authenticated, plaintext). A private key issued for 
> each dataset.
>  
> On the writer side, the tools creates AAD prefixes for every data file, and 
> creates the signature file itself. The input is the dataset URI, N and the 
> encryption/signature key.
>  
> On the reader side, the tool parses and verifies the signature file, and 
> provides the framework with the verified dataset name, number of files that 
> must be accounted for, and the AAD prefix for each file. The input is the 
> expected dataset URI and the encryption/signature key.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1414) Limit page size based on maximum row count

2018-10-17 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653995#comment-16653995
 ] 

Ryan Blue commented on PARQUET-1414:


[~gszadovszky], can you add a link to your benchmarks to this issue?

I think the conclusion we came to while discussing was between 10k and 20k, 
with 20k being the better choice for overall file size. Is 20k the planned 
default now?

> Limit page size based on maximum row count
> --
>
> Key: PARQUET-1414
> URL: https://issues.apache.org/jira/browse/PARQUET-1414
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.11.0
>
>
> For column index based filtering it is important to have enough pages for a 
> column. In case of a perfectly matching encoding for the suitable data it can 
> happen that all of the values can be encoded in one page (e.g. a column of an 
> ascending counter).
> With this improvement we would be able to limit the pages by the maximum 
> number of rows to be written in it so we would have enough pages for every 
> column. A good default value should be benchmarked. For initial, we can use 
> 10k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1432) ACID support

2018-10-01 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634298#comment-16634298
 ] 

Ryan Blue commented on PARQUET-1432:


[~yumwang], ACID guarantees are a feature of the table layout, not the file 
format. I don't think Parquet needs to do anything differently to support this. 
What are you proposing to change in Parquet?

> ACID support
> 
>
> Key: PARQUET-1432
> URL: https://issues.apache.org/jira/browse/PARQUET-1432
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Affects Versions: 1.10.1
>Reporter: Yuming Wang
>Priority: Major
>
> https://orc.apache.org/docs/acid.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1201) Column indexes

2018-09-27 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631130#comment-16631130
 ] 

Ryan Blue commented on PARQUET-1201:


[~gszadovszky], where is the branch for page skipping? Is it this one? 
https://github.com/apache/parquet-mr/tree/column-indexes

I just went to review it, but I don't see a PR. Could you open one against 
master?

> Column indexes
> --
>
> Key: PARQUET-1201
> URL: https://issues.apache.org/jira/browse/PARQUET-1201
> Project: Parquet
>  Issue Type: New Feature
>Affects Versions: 1.10.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>
> Write the column indexes described in PARQUET-922.
>  This is the first phase of implementing the whole feature. The 
> implementation is done in the following steps:
>  * Utility to read/write indexes in parquet-format
>  * Writing indexes in the parquet file
>  * Extend parquet-tools and parquet-cli to show the indexes
>  * Limit index size based on parquet properties
>  * Trim min/max values where possible based on parquet properties
>  * Filtering based on column indexes
> The work is done on the feature branch {{column-indexes}}. This JIRA will be 
> resolved after the branch has been merged to {{master}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-632) Parquet file in invalid state while writing to S3 from EMR

2018-08-15 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581275#comment-16581275
 ] 

Ryan Blue commented on PARQUET-632:
---

[~pkgajulapalli], can you go ahead and post the stack trace? I thought you said 
you were using 2.2.0. These classes definitely changed.

> Parquet file in invalid state while writing to S3 from EMR
> --
>
> Key: PARQUET-632
> URL: https://issues.apache.org/jira/browse/PARQUET-632
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.7.0
>Reporter: Peter Halliday
>Priority: Blocker
>
> I'm writing parquet to S3 from Spark 1.6.1 on EMR.  And when it got to the 
> last few files to write to S3, I received this stacktrace in the log with no 
> other errors before or after it.  It's very consistent.  This particular 
> batch keeps erroring the same way.
> {noformat}
> 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager 
> [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, 
> ip-172-16-96-32.ec2.internal): org.apache.spark.SparkException: Task failed 
> while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: The file being written is in an invalid 
> state. Probably caused by an error thrown previously. Current state: COLUMN
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:405)
>   ... 8 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-632) Parquet file in invalid state while writing to S3 from EMR

2018-08-14 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579966#comment-16579966
 ] 

Ryan Blue commented on PARQUET-632:
---

[~pkgajulapalli], there isn't enough information here to know what's happening. 
Can you post the schema of the dataframe you're writing and a stack trace?

> Parquet file in invalid state while writing to S3 from EMR
> --
>
> Key: PARQUET-632
> URL: https://issues.apache.org/jira/browse/PARQUET-632
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.7.0
>Reporter: Peter Halliday
>Priority: Blocker
>
> I'm writing parquet to S3 from Spark 1.6.1 on EMR.  And when it got to the 
> last few files to write to S3, I received this stacktrace in the log with no 
> other errors before or after it.  It's very consistent.  This particular 
> batch keeps erroring the same way.
> {noformat}
> 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager 
> [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, 
> ip-172-16-96-32.ec2.internal): org.apache.spark.SparkException: Task failed 
> while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: The file being written is in an invalid 
> state. Probably caused by an error thrown previously. Current state: COLUMN
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:405)
>   ... 8 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1341) Null count is suppressed when columns have no min or max and use unsigned sort order

2018-06-28 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-1341:
--

 Summary: Null count is suppressed when columns have no min or max 
and use unsigned sort order
 Key: PARQUET-1341
 URL: https://issues.apache.org/jira/browse/PARQUET-1341
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.10.0
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.10.1






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-381) It should be possible to merge summary files, and control which files are generated

2018-05-25 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-381:
--
Fix Version/s: (was: 2.0.0)
   1.9.0

> It should be possible to merge summary files, and control which files are 
> generated
> ---
>
> Key: PARQUET-381
> URL: https://issues.apache.org/jira/browse/PARQUET-381
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Alex Levenson
>Assignee: Alex Levenson
>Priority: Major
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-381) It should be possible to merge summary files, and control which files are generated

2018-05-25 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491350#comment-16491350
 ] 

Ryan Blue commented on PARQUET-381:
---

Fixed. Thanks for pointing this out.

> It should be possible to merge summary files, and control which files are 
> generated
> ---
>
> Key: PARQUET-381
> URL: https://issues.apache.org/jira/browse/PARQUET-381
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Alex Levenson
>Assignee: Alex Levenson
>Priority: Major
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1309) Parquet Java uses incorrect stats and dictionary filter properties

2018-05-24 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1309:
---
Description: In SPARK-24251, we found that the changes to use 
HadoopReadOptions accidentally switched the [properties that enable stats and 
dictionary 
filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83].
 Both are enabled by default so it is unlikely that anyone will need to turn 
them off and there is an easy work-around, but we should fix the properties for 
1.10.1. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is on 
1.8.x).  (was: In SPARK-24251, we found that the changes to use 
HadoopReadOptions accidentally switched the [properties that enable stats and 
dictionary 
filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83].
 Both are enabled by default so it is unlikely that anyone will need to turn 
them off and there is an easy work-around, but we should fix the properties for 
1.10.0. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is on 
1.8.x).)

> Parquet Java uses incorrect stats and dictionary filter properties
> --
>
> Key: PARQUET-1309
> URL: https://issues.apache.org/jira/browse/PARQUET-1309
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 1.10.1
>
>
> In SPARK-24251, we found that the changes to use HadoopReadOptions 
> accidentally switched the [properties that enable stats and dictionary 
> filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83].
>  Both are enabled by default so it is unlikely that anyone will need to turn 
> them off and there is an easy work-around, but we should fix the properties 
> for 1.10.1. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is 
> on 1.8.x).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1309) Parquet Java uses incorrect stats and dictionary filter properties

2018-05-24 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-1309:
--

 Summary: Parquet Java uses incorrect stats and dictionary filter 
properties
 Key: PARQUET-1309
 URL: https://issues.apache.org/jira/browse/PARQUET-1309
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Ryan Blue
 Fix For: 1.10.1


In SPARK-24251, we found that the changes to use HadoopReadOptions accidentally 
switched the [properties that enable stats and dictionary 
filters|https://github.com/apache/parquet-mr/blob/8bbc6cb95fd9b4b9e86c924ca1e40fd555ecac1d/parquet-hadoop/src/main/java/org/apache/parquet/HadoopReadOptions.java#L83].
 Both are enabled by default so it is unlikely that anyone will need to turn 
them off and there is an easy work-around, but we should fix the properties for 
1.10.0. This doesn't affect the 1.8.x or 1.9.x releases (Spark 2.3.x is on 
1.8.x).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1295) Parquet libraries do not follow proper semantic versioning

2018-05-21 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483153#comment-16483153
 ] 

Ryan Blue commented on PARQUET-1295:


Since there is not a well-defined public API, I understand how it is annoying 
to find out that some classes are internal. But the APIs referenced here are 
definitely something that we've always considered internal.

We use 1.7.0 for semver checks because that's the oldest release that we want 
public API compatibility with (even though "public API" is not well defined). 
We only add exclusions for private classes when they change, so growing this 
list is essentially marking APIs private as we make changes. I think that's 
worth keeping rather than needing to add to the list when we make changes to a 
private class each release.

> Parquet libraries do not follow proper semantic versioning
> --
>
> Key: PARQUET-1295
> URL: https://issues.apache.org/jira/browse/PARQUET-1295
> Project: Parquet
>  Issue Type: Bug
>Reporter: Vlad Rozov
>Priority: Major
>
> There are changes between 1.8.0 and 1.10.0 that break API compatibility. A 
> minor version change is supposed to be backward compatible with 1.9.0 and 
> 1.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1189) Release Parquet Java 1.10

2018-04-20 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1189.

Resolution: Fixed

> Release Parquet Java 1.10
> -
>
> Key: PARQUET-1189
> URL: https://issues.apache.org/jira/browse/PARQUET-1189
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> Please link needed issues as blockers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1264) Update Javadoc for Java 1.8

2018-04-05 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1264.

Resolution: Fixed

> Update Javadoc for Java 1.8
> ---
>
> Key: PARQUET-1264
> URL: https://issues.apache.org/jira/browse/PARQUET-1264
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> After moving the build to Java 1.8, the release procedure no longer works 
> because Javadoc generation fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-04-04 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425994#comment-16425994
 ] 

Ryan Blue commented on PARQUET-1253:


Not including the UUID logical type in that union is probably an accident.

MAP_KEY_VALUE is no longer used. It is noted in [backward compatibility 
rules|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1],
 but is not required for any types.

The [comment "only valid for 
primitives"|https://github.com/apache/parquet-format/blob/apache-parquet-format-2.5.0/src/main/thrift/parquet.thrift#L384]
 is incorrect. I think we can remove it. I'm not sure why the comment was there.

> Support for new logical type representation
> ---
>
> Key: PARQUET-1253
> URL: https://issues.apache.org/jira/browse/PARQUET-1253
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>
> Latest parquet-format 
> [introduced|https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe#diff-0f9d1b5347959e15259da7ba8f4b6252]
>  a new representation for logical types. As of now this is not yet supported 
> in parquet-mr, thus there's no way to use parametrized UTC normalized 
> timestamp data types. When reading and writing Parquet files, besides 
> 'converted_type' parquet-mr should use the new 'logicalType' field in 
> SchemaElement to tell the current logical type annotation. To maintain 
> backward compatibility, the semantic of converted_type shouldn't change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1264) Update Javadoc for Java 1.8

2018-03-30 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-1264:
--

 Summary: Update Javadoc for Java 1.8
 Key: PARQUET-1264
 URL: https://issues.apache.org/jira/browse/PARQUET-1264
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.9.0
Reporter: Ryan Blue
Assignee: Ryan Blue
 Fix For: 1.10.0


After moving the build to Java 1.8, the release procedure no longer works 
because Javadoc generation fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1263.

Resolution: Fixed
  Assignee: Ryan Blue

Merged #464.

> ParquetReader's builder should use Configuration from the InputFile
> ---
>
> Key: PARQUET-1263
> URL: https://issues.apache.org/jira/browse/PARQUET-1263
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> ParquetReader can be built using an InputFile, which may be a HadoopInputFile 
> and have a Configuration. If it is, ParquetHadoopOptions should be be based 
> on that configuration instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1183) AvroParquetWriter needs OutputFile based Builder

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1183.

Resolution: Fixed
  Assignee: Ryan Blue

Merged #460. Thanks [~zi] for reviewing!

> AvroParquetWriter needs OutputFile based Builder
> 
>
> Key: PARQUET-1183
> URL: https://issues.apache.org/jira/browse/PARQUET-1183
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.9.1
>Reporter: Werner Daehn
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> The ParquetWriter got a new Builder(OutputFile). 
> But it cannot be used by the AvroParquetWriter as there is no matching 
> Builder/Constructor.
> Changes are quite simple:
> public static  Builder builder(OutputFile file) {
>   return new Builder(file)
> }
> and in the static Builder class below
> private Builder(OutputFile file) {
>   super(file);
> }
> Note: I am not good enough with builds, maven and git to create a pull 
> request yet. Sorry. Will try to get better here.
> See: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile

2018-03-30 Thread Ryan Blue (JIRA)
Ryan Blue created PARQUET-1263:
--

 Summary: ParquetReader's builder should use Configuration from the 
InputFile
 Key: PARQUET-1263
 URL: https://issues.apache.org/jira/browse/PARQUET-1263
 Project: Parquet
  Issue Type: Improvement
Reporter: Ryan Blue


ParquetReader can be built using an InputFile, which may be a HadoopInputFile 
and have a Configuration. If it is, ParquetHadoopOptions should be be based on 
that configuration instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1263) ParquetReader's builder should use Configuration from the InputFile

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1263:
---
Fix Version/s: 1.10.0

> ParquetReader's builder should use Configuration from the InputFile
> ---
>
> Key: PARQUET-1263
> URL: https://issues.apache.org/jira/browse/PARQUET-1263
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> ParquetReader can be built using an InputFile, which may be a HadoopInputFile 
> and have a Configuration. If it is, ParquetHadoopOptions should be be based 
> on that configuration instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1184.

   Resolution: Won't Fix
Fix Version/s: (was: 1.10.0)

> Make DelegatingPositionOutputStream a concrete class
> 
>
> Key: PARQUET-1184
> URL: https://issues.apache.org/jira/browse/PARQUET-1184
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.9.1
>Reporter: Werner Daehn
>Priority: Major
>
> I fail to understand why this is an abstract class. In my example I want to 
> write the Parquet file to a java.io.FileOutputStream, hence have to extend 
> the DelegatingPositionOutputStream and store the pos information, increase it 
> in all write(..) methods and return its value in getPos().
> Doable of course, but useful? Previously yes but now with the OutputFile 
> changes to decouple it from Hadoop more, I believe no.
> related to: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1184) Make DelegatingPositionOutputStream a concrete class

2018-03-30 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420982#comment-16420982
 ] 

Ryan Blue commented on PARQUET-1184:


The reason why this is an abstract class is so that you can use it to wrap 
implementations that provide a position, like Hadoop's FsOutputStream. It would 
not be correct to assume that the position is at the current number of bytes 
written to the underlying stream. An implementation could wrap RandomAccessFile 
and expose its seek method, which would invalidate the delegating stream's 
position.

The delegating class is present for convenience only. You don't have to use it 
and can implement your own logic as long as you implement PositionOutputStream.

> Make DelegatingPositionOutputStream a concrete class
> 
>
> Key: PARQUET-1184
> URL: https://issues.apache.org/jira/browse/PARQUET-1184
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.9.1
>Reporter: Werner Daehn
>Priority: Major
> Fix For: 1.10.0
>
>
> I fail to understand why this is an abstract class. In my example I want to 
> write the Parquet file to a java.io.FileOutputStream, hence have to extend 
> the DelegatingPositionOutputStream and store the pos information, increase it 
> in all write(..) methods and return its value in getPos().
> Doable of course, but useful? Previously yes but now with the OutputFile 
> changes to decouple it from Hadoop more, I believe no.
> related to: https://issues.apache.org/jira/browse/PARQUET-1142



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1028:
---
Fix Version/s: 1.10.0

> [JAVA] When reading old Spark-generated files with INT96, stats are reported 
> as valid when they aren't 
> ---
>
> Key: PARQUET-1028
> URL: https://issues.apache.org/jira/browse/PARQUET-1028
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jacques Nadeau
>Priority: Major
> Fix For: 1.10.0
>
>
> Found that the condition 
> [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
>  is missing a check for INT96. Since INT96 statis are also corrupt with old 
> versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1028.

Resolution: Fixed
  Assignee: Zoltan Ivanfi

> [JAVA] When reading old Spark-generated files with INT96, stats are reported 
> as valid when they aren't 
> ---
>
> Key: PARQUET-1028
> URL: https://issues.apache.org/jira/browse/PARQUET-1028
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jacques Nadeau
>Assignee: Zoltan Ivanfi
>Priority: Major
> Fix For: 1.10.0
>
>
> Found that the condition 
> [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
>  is missing a check for INT96. Since INT96 statis are also corrupt with old 
> versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2018-03-30 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420962#comment-16420962
 ] 

Ryan Blue commented on PARQUET-1028:


This was fixed by PARQUET-1065. The expected sort order for INT96 is now 
UNKNOWN, so stats are discarded.

> [JAVA] When reading old Spark-generated files with INT96, stats are reported 
> as valid when they aren't 
> ---
>
> Key: PARQUET-1028
> URL: https://issues.apache.org/jira/browse/PARQUET-1028
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jacques Nadeau
>Priority: Major
> Fix For: 1.10.0
>
>
> Found that the condition 
> [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
>  is missing a check for INT96. Since INT96 statis are also corrupt with old 
> versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1055) Improve the creation of ExecutorService when reading footers

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1055:
---
Fix Version/s: (was: 1.9.1)

> Improve the creation of ExecutorService when reading footers
> 
>
> Key: PARQUET-1055
> URL: https://issues.apache.org/jira/browse/PARQUET-1055
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Benoit Lacelle
>Priority: Minor
>
> Doing some benchmarks loading a large set of parquet files (3000+) from the 
> local FS, we observed some inefficiencies in the number of created threads 
> when reading footers.
> By reading, the read the configuration parallelism in Hadoop configuration 
> (defaulted to 5) and allocate 2 ExecuteService with each 5 threads to read 
> footers. This is especially inefficient if there is less Callable to handle 
> than the configured parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1028) [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1028:
---
Fix Version/s: (was: 1.9.1)

> [JAVA] When reading old Spark-generated files with INT96, stats are reported 
> as valid when they aren't 
> ---
>
> Key: PARQUET-1028
> URL: https://issues.apache.org/jira/browse/PARQUET-1028
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jacques Nadeau
>Priority: Major
>
> Found that the condition 
> [here|https://github.com/apache/parquet-mr/blob/9d58b6a83aa79dcad01c3bcc2ec0a7db74ba83b1/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L55]
>  is missing a check for INT96. Since INT96 statis are also corrupt with old 
> versions of Parquet, the code here shouldn't short-circuit return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1174) Concurrent read micro benchmarks

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1174:
---
Fix Version/s: (was: 1.9.1)

> Concurrent read micro benchmarks
> 
>
> Key: PARQUET-1174
> URL: https://issues.apache.org/jira/browse/PARQUET-1174
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Takeshi Yoshimura
>Priority: Minor
>
> parquet-benchmarks only contain read and write benchmarks with a single 
> thread.
> I add concurrent Parquet file scans like typical data-parallel computing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-796) Delta Encoding is not used when dictionary enabled

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-796:
--
Fix Version/s: (was: 1.9.1)

> Delta Encoding is not used when dictionary enabled
> --
>
> Key: PARQUET-796
> URL: https://issues.apache.org/jira/browse/PARQUET-796
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jakub Liska
>Priority: Critical
>
> Current code doesn't enable using both Delta Encoding and Dictionary 
> Encoding. If I instantiate ParquetWriter like this : 
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, 
> blockSize, pageSize, dictPageSize, enableDictionary = true, true, 
> ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code : 
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred 
> DeltaLongEncodingWriter. 
> The original issue is here : 
> https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1153) Parquet-thrift doesn't compile with Thrift 0.10.0

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1153:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Parquet-thrift doesn't compile with Thrift 0.10.0
> -
>
> Key: PARQUET-1153
> URL: https://issues.apache.org/jira/browse/PARQUET-1153
> Project: Parquet
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 1.10.0
>
>
> Parquet-thrift doesn't compile with Thrift 0.10.0 due to THRIFT-2263. The 
> default generator parameter used for {{--gen}} argument by Thrift Maven 
> plugin is no longer supported, this can be fixed with an additional 
> {{java}} parameter to Thrift Maven plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1135) upgrade thrift and protobuf dependencies

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1135:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> upgrade thrift and protobuf dependencies
> 
>
> Key: PARQUET-1135
> URL: https://issues.apache.org/jira/browse/PARQUET-1135
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.10.0
>
>
> thrift 0.7.0 -> 0.9.3
>  protobuf 3.2 -> 3.5.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-777) Add new Parquet CLI tools

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-777.
---
Resolution: Fixed

> Add new Parquet CLI tools
> -
>
> Key: PARQUET-777
> URL: https://issues.apache.org/jira/browse/PARQUET-777
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.9.1
>
>
> This issue tracks adding parquet-cli from 
> [rdblue/parquet-cli|https://github.com/rdblue/parquet-cli].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1152) Parquet-thrift doesn't compile with Thrift 0.9.3

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1152:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Parquet-thrift doesn't compile with Thrift 0.9.3
> 
>
> Key: PARQUET-1152
> URL: https://issues.apache.org/jira/browse/PARQUET-1152
> Project: Parquet
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 1.10.0
>
>
> Parquet-thrift doesn't compile with Thrift 0.9.3, because 
> TBinaryProtocol#setReadLength method was removed.
> PARQUET-180 already addressed the problem, but only in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-777) Add new Parquet CLI tools

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-777:
--
Fix Version/s: (was: 1.9.1)
   1.10.0

> Add new Parquet CLI tools
> -
>
> Key: PARQUET-777
> URL: https://issues.apache.org/jira/browse/PARQUET-777
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> This issue tracks adding parquet-cli from 
> [rdblue/parquet-cli|https://github.com/rdblue/parquet-cli].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1115) Warn users when misusing parquet-tools merge

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1115:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Warn users when misusing parquet-tools merge
> 
>
> Key: PARQUET-1115
> URL: https://issues.apache.org/jira/browse/PARQUET-1115
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Zoltan Ivanfi
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 1.10.0
>
>
> To prevent users from using {{parquet-tools merge}} in scenarios where its 
> use is not practical, we should describe its limitations in the help text of 
> this command. Additionally, we should add a warning to the output of the 
> merge command if the size of the original row groups are below a threshold.
> Reasoning:
> Many users are tempted to use the new {{parquet-tools merge}} functionality, 
> because they want to achieve good performance and historically that has been 
> associated with large Parquet files. However, in practice Hive performance 
> won't change significantly after using {{parquet-tools merge}}, but Impala 
> performance will be much worse. The reason for that is that good performance 
> is not a result of large files but large rowgroups instead (up to the HDFS 
> block size).
> However, {{parquet-tools merge}} does not merge rowgroups, it just places 
> them one after the other. It was intended to be used for Parquet files that 
> are already arranged in row groups of the desired size. When used to merge 
> many small files, the resulting file will still contain small row groups and 
> one loses most of the advantages of larger files (the only one that remains 
> is that it takes a single HDFS operation to read them).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1149) Upgrade Avro dependancy to 1.8.2

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1149:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Upgrade Avro dependancy to 1.8.2
> 
>
> Key: PARQUET-1149
> URL: https://issues.apache.org/jira/browse/PARQUET-1149
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: 1.10.0
>
>
> I would like to update the Avro dependancy to 1.8.2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1141) IDs are dropped in metadata conversion

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1141:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> IDs are dropped in metadata conversion
> --
>
> Key: PARQUET-1141
> URL: https://issues.apache.org/jira/browse/PARQUET-1141
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.8.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1025) Support new min-max statistics in parquet-mr

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1025:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Support new min-max statistics in parquet-mr
> 
>
> Key: PARQUET-1025
> URL: https://issues.apache.org/jira/browse/PARQUET-1025
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.9.1
>Reporter: Zoltan Ivanfi
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.10.0
>
>
> Impala started using new min-max statistics that got specified as part of 
> PARQUET-686. Support for these should be added to parquet-mr as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1077) [MR] Switch to long key ids in KEYs file

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1077:
---
Fix Version/s: (was: 1.9.1)

> [MR] Switch to long key ids in KEYs file
> 
>
> Key: PARQUET-1077
> URL: https://issues.apache.org/jira/browse/PARQUET-1077
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Lars Volker
>Assignee: Lars Volker
>Priority: Major
> Fix For: 2.0.0, 1.10.0
>
>
> PGP keys should be longer than 32bit, as outlined on https://evil32.com/. We 
> should fix the KEYS file in parquet-mr. I will push a PR shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-791) Predicate pushing down on missing columns should work on UserDefinedPredicate too

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-791:
--
Fix Version/s: (was: 1.9.1)
   1.10.0

> Predicate pushing down on missing columns should work on UserDefinedPredicate 
> too
> -
>
> Key: PARQUET-791
> URL: https://issues.apache.org/jira/browse/PARQUET-791
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 1.10.0
>
>
> This is related to PARQUET-389. PARQUET-389 fixes the predicate pushing down 
> on missing columns. But it doesn't fix it for UserDefinedPredicate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1024) allow for case insensitive parquet-xxx prefix in PR title

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1024:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> allow for case insensitive parquet-xxx prefix in PR title
> -
>
> Key: PARQUET-1024
> URL: https://issues.apache.org/jira/browse/PARQUET-1024
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1005) Fix DumpCommand parsing to allow column projection

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1005:
---
Fix Version/s: (was: 1.9.1)
   1.10.0

> Fix DumpCommand parsing to allow column projection
> --
>
> Key: PARQUET-1005
> URL: https://issues.apache.org/jira/browse/PARQUET-1005
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.8.0, 1.8.1, 1.9.0, 2.0.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Major
> Fix For: 1.10.0
>
>
> DumpCommand option for -c is specified as hasArgs() for unlimited
> number of arguments following -c. The very description of the option
> shows the real intent of using hasArg() such that multiple columns
> can be specified as '-c c1 -c c2 ...'. Otherwise, the input path
> is parsed as an argument for -c instead of the command itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-801) Allow UserDefinedPredicates in DictionaryFilter

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-801:
--
Fix Version/s: (was: 1.9.1)
   1.10.0

> Allow UserDefinedPredicates in DictionaryFilter
> ---
>
> Key: PARQUET-801
> URL: https://issues.apache.org/jira/browse/PARQUET-801
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Patrick Woody
>Assignee: Patrick Woody
>Priority: Major
> Fix For: 1.10.0
>
>
> UserDefinedPredicate is not implemented for dictionary filtering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-321) Set the HDFS padding default to 8MB

2018-03-30 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-321:
--
Fix Version/s: (was: 1.9.1)
   1.10.0

> Set the HDFS padding default to 8MB
> ---
>
> Key: PARQUET-321
> URL: https://issues.apache.org/jira/browse/PARQUET-321
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> PARQUET-306 added the ability to pad row groups so that they align with HDFS 
> blocks to avoid remote reads. The ParquetFileWriter will now either pad the 
> remaining space in the block or target a row group for the remaining size.
> The padding maximum controls the threshold of the amount of padding that will 
> be used. If the space left is under this threshold, it is padded. If it is 
> greater than this threshold, then the next row group is fit into the 
> remaining space. The current padding maximum is 0.
> I think we should change the padding maximum to 8MB. My reasoning is this: we 
> want this number to be small enough that it won't prevent the library from 
> writing reasonable row groups, but larger than the minimum size row group we 
> would want to write. 8MB is 1/16th of the row group default, so I think it is 
> reasonable: we don't want a row group to be smaller than 8 MB.
> We also want this to be large enough that a few row groups in a  block don't 
> cause a tiny row group to be written in the excess space. 8MB accounts for 4 
> row groups that are 2MB under-size. In addition, it is reasonable to not 
> allow row groups under 8MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1222) Definition of float and double sort order is ambiguous

2018-03-23 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412154#comment-16412154
 ] 

Ryan Blue commented on PARQUET-1222:


I think Jim is right. IEEE-754 numbers are ordered correctly if you flip the 
sign bit and use unsigned, byte-wise comparison. I wrote a spec for encoding 
HBase keys that used this a while ago.

The reason why rule 3 works is that for normal floating point numbers, the 
significand must start with a 1. Conceptually, this means that 0.0001 and 
0.1 cannot be represented with the same exponent. Because the exponent 
basically encodes where the first set bit is in the number, it can be used for 
sorting. There is also support for very small numbers where the significand 
doesn't start with 1, but those must use the smallest-possible exponent so 
sorting still works.

> Definition of float and double sort order is ambiguous
> --
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-03-12 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395711#comment-16395711
 ] 

Ryan Blue commented on PARQUET-1241:


Does anyone know what the Hadoop compression codec produces? That's what we're 
using in the Java implementation, so that's what the current LZ4 codec name 
indicates. I didn't realize there were multiple formats.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1238) Invalid links found in parquet site document page

2018-02-27 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379687#comment-16379687
 ] 

Ryan Blue commented on PARQUET-1238:


I didn't realize the patch was for the SVN site. Thanks, I'll take a look and 
should be able to commit it as is.

> Invalid links found in parquet site document page
> -
>
> Key: PARQUET-1238
> URL: https://issues.apache.org/jira/browse/PARQUET-1238
> Project: Parquet
>  Issue Type: Bug
>Reporter: xuchuanyin
>Priority: Trivial
> Attachments: PARQUET-1238_fixed_invalid_links_in_latest_html_md.patch
>
>
> Links to pictures in document page are invalid, such as Section ‘File Format’ 
> and ‘Metadata’
>  
> Links to external documents in document page are invalid, such as Section 
> 'Motivation', 'Logical Types' and 'Data Pages'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1238) Invalid links found in parquet site document page

2018-02-27 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378969#comment-16378969
 ] 

Ryan Blue commented on PARQUET-1238:


[~xuchuanyin], thanks for fixing this. Could you post your patch as a pull 
request on github?

> Invalid links found in parquet site document page
> -
>
> Key: PARQUET-1238
> URL: https://issues.apache.org/jira/browse/PARQUET-1238
> Project: Parquet
>  Issue Type: Bug
>Reporter: xuchuanyin
>Priority: Trivial
> Attachments: PARQUET-1238_fixed_invalid_links_in_latest_html_md.patch
>
>
> Links to pictures in document page are invalid, such as Section ‘File Format’ 
> and ‘Metadata’
>  
> Links to external documents in document page are invalid, such as Section 
> 'Motivation', 'Logical Types' and 'Data Pages'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-796) Delta Encoding is not used when dictionary enabled

2018-02-26 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16377382#comment-16377382
 ] 

Ryan Blue edited comment on PARQUET-796 at 2/26/18 7:06 PM:


I don't recommend using the delta long encoding because I think we need to 
update to better encodings (specifically, the zig-zag-encoding ones in [this 
branch|https://github.com/rdblue/parquet-mr/commits/encoders]).

We could definitely use a better fallback, but I don't think the solution is to 
turn off dictionary encoding. If you can use dictionary encoding to get a 
smaller size, you should. The problem is when dictionary encoding needs to test 
whether another encoding would be better. It currently tests against plain and 
uses plain. We should have it test against a delta encoding and use one.

This kind of improvement is why we added PARQUET-601. We want to be able to 
test out different ways of choosing an encoding at write time. But we do not 
want to make it so that users must specify their own encodings because we want 
Parquet to select them automatically and get the choice right. PARQUET-601 is 
about testing out strategies that we release as the defaults.


was (Author: rdblue):
I don't recommend using the delta long encoding because I think we need to 
update to better encodings (specifically, the zig-zag-encoding ones in this 
branch).

We could definitely use a better fallback, but I don't think the solution is to 
turn off dictionary encoding. If you can use dictionary encoding to get a 
smaller size, you should. The problem is when dictionary encoding needs to test 
whether another encoding would be better. It currently tests against plain and 
uses plain. We should have it test against a delta encoding and use one.

This kind of improvement is why we added PARQUET-601. We want to be able to 
test out different ways of choosing an encoding at write time. But we do not 
want to make it so that users must specify their own encodings because we want 
Parquet to select them automatically and get the choice right. PARQUET-601 is 
about testing out strategies that we release as the defaults.

> Delta Encoding is not used when dictionary enabled
> --
>
> Key: PARQUET-796
> URL: https://issues.apache.org/jira/browse/PARQUET-796
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jakub Liska
>Priority: Critical
> Fix For: 1.9.1
>
>
> Current code doesn't enable using both Delta Encoding and Dictionary 
> Encoding. If I instantiate ParquetWriter like this : 
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, 
> blockSize, pageSize, dictPageSize, enableDictionary = true, true, 
> ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code : 
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred 
> DeltaLongEncodingWriter. 
> The original issue is here : 
> https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1234) Release Parquet format 2.5.0

2018-02-21 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371749#comment-16371749
 ] 

Ryan Blue commented on PARQUET-1234:


Are we going to release a 2.4.1 with the changes for column index structures? 
I'd rather not wait on a resolution to PARQUET-1222 to get that out.

> Release Parquet format 2.5.0
> 
>
> Key: PARQUET-1234
> URL: https://issues.apache.org/jira/browse/PARQUET-1234
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Affects Versions: format-2.5.0
>Reporter: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-787) Add a size limit for heap allocations when reading

2018-02-21 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-787.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

Merged #390.

> Add a size limit for heap allocations when reading
> --
>
> Key: PARQUET-787
> URL: https://issues.apache.org/jira/browse/PARQUET-787
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> [G1GC allocates humongous objects directly in the old 
> generation|https://www.infoq.com/articles/tuning-tips-G1-GC] to avoid 
> unnecessary copies, which means that these allocations aren't garbage 
> collected until a full GC runs. Humongous objects are objects that are 50% of 
> the region size or more. Region size is at most 32MB (see the table for 
> [region size from heap 
> size|http://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection#Regions]).
> Parquet currently allocates a huge buffer for each contiguous group of column 
> chunks, which in many cases is not garbage collected until a full GC. Adding 
> a size limit for the allocation size should allow users to break row groups 
> across multiple buffers so that buffers get collected when they have been 
> read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-860) ParquetWriter.getDataSize NullPointerException after closed

2018-02-20 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370458#comment-16370458
 ] 

Ryan Blue commented on PARQUET-860:
---

The S3 file system implementation should retry and recover if it is a transient 
error. In general I'm skeptical that Parquet can reliably provide what you want.

Parquet makes no guarantee of durability until the close operation returns. As 
a consequence, you should not discard incoming records until then. This is why 
Parquet works better with systems like Kafka that have a long window of time 
where records can be replayed. In general, I would not recommend Parquet as a 
format from other streaming systems like Flume. This works fine with MR or 
Spark where the framework itself will retry a write task when an output writer 
throws an exception.

> ParquetWriter.getDataSize NullPointerException after closed
> ---
>
> Key: PARQUET-860
> URL: https://issues.apache.org/jira/browse/PARQUET-860
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
> Environment: Linux prim 4.8.13-1-ARCH #1 SMP PREEMPT Fri Dec 9 
> 07:24:34 CET 2016 x86_64 GNU/Linux
> openjdk version "1.8.0_112"
> OpenJDK Runtime Environment (build 1.8.0_112-b15)
> OpenJDK 64-Bit Server VM (build 25.112-b15, mixed mode)
>Reporter: Mike Mintz
>Priority: Major
>
> When I run {{ParquetWriter.getDataSize()}}, it works normally. But after I 
> call {{ParquetWriter.close()}}, subsequent calls to ParquetWriter.getDataSize 
> result in a NullPointerException.
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.getDataSize(InternalParquetRecordWriter.java:132)
>   at 
> org.apache.parquet.hadoop.ParquetWriter.getDataSize(ParquetWriter.java:314)
>   at FileBufferState.getFileSizeInBytes(FileBufferState.scala:83)
> {noformat}
> The reason for the NPE appears to be in 
> {{InternalParquetRecordWriter.getDataSize}}, where it assumes that 
> {{columnStore}} is not null.
> But the {{close()}} method calls {{flushRowGroupToStore()}} which sets 
> {{columnStore = null}}.
> I'm guessing that once the file is closed, we can just return 
> {{lastRowGroupEndPos}} since there should be no more buffered data, but I 
> don't fully understand how this class works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-860) ParquetWriter.getDataSize NullPointerException after closed

2018-02-20 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370317#comment-16370317
 ] 

Ryan Blue commented on PARQUET-860:
---

[~e.birukov], this issue is not related to the problem you're hitting. If you'd 
like, please open another issue for that and we can look into whether it is 
worth fixing. Most of the time, we assume that an exception in close is not 
recoverable and the entire file needs to be rewritten. You're only guaranteed 
durability when close returns successfully, so this is not causing data loss. 
Data loss is only a problem if you've already discarded the input data, but 
that is a problem with the writing application and not with Parquet.

[~mikemintz], I hadn't seen this issue before now. We can probably fix this by 
adding logic that checks whether the file was closed and saves the file 
position just after writing the footer. We've also recently added an accessor 
for the footer that is available once the file is closed, so you could also use 
that to get stats and other info if that's what you're after.

> ParquetWriter.getDataSize NullPointerException after closed
> ---
>
> Key: PARQUET-860
> URL: https://issues.apache.org/jira/browse/PARQUET-860
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
> Environment: Linux prim 4.8.13-1-ARCH #1 SMP PREEMPT Fri Dec 9 
> 07:24:34 CET 2016 x86_64 GNU/Linux
> openjdk version "1.8.0_112"
> OpenJDK Runtime Environment (build 1.8.0_112-b15)
> OpenJDK 64-Bit Server VM (build 25.112-b15, mixed mode)
>Reporter: Mike Mintz
>Priority: Major
>
> When I run {{ParquetWriter.getDataSize()}}, it works normally. But after I 
> call {{ParquetWriter.close()}}, subsequent calls to ParquetWriter.getDataSize 
> result in a NullPointerException.
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.getDataSize(InternalParquetRecordWriter.java:132)
>   at 
> org.apache.parquet.hadoop.ParquetWriter.getDataSize(ParquetWriter.java:314)
>   at FileBufferState.getFileSizeInBytes(FileBufferState.scala:83)
> {noformat}
> The reason for the NPE appears to be in 
> {{InternalParquetRecordWriter.getDataSize}}, where it assumes that 
> {{columnStore}} is not null.
> But the {{close()}} method calls {{flushRowGroupToStore()}} which sets 
> {{columnStore = null}}.
> I'm guessing that once the file is closed, we can just return 
> {{lastRowGroupEndPos}} since there should be no more buffered data, but I 
> don't fully understand how this class works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1215) Add accessor for footer after a file is closed

2018-02-15 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1215.

Resolution: Fixed

Merged #457. Thanks to [~zi] and [~gszadovszky] for the reviews!

> Add accessor for footer after a file is closed
> --
>
> Key: PARQUET-1215
> URL: https://issues.apache.org/jira/browse/PARQUET-1215
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 1.10.0
>
>
> I'm storing metrics along with Parquet files in Iceberg and need to get the 
> metrics from the Parquet footer just after writing a file. Parquet should be 
> able to return the footer after closing a file so the caller doesn't have to 
> open the file just after writing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   >