[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-22 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447455#comment-17447455
 ] 

Gabor Szadovszky commented on PARQUET-2103:
---

I think, we need to update 
[ParquetMetadata.toJSON|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java#L67-L71].
 Jackson shall be able to be configured to not to look for getter methods but 
the private fields. I am not sure if it is a good idea or if it will work in 
every environment. Another option would be to refactor 
EncryptedColumnChunkMetaData to not to call "decrypt" for a getter but it might 
not worth the efforts. The easiest way would be to simply detect if the 
metadata contains encrypted data and do not log anything. I don't know how 
urgent might it be to log the metadata in case of debugging.

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Gidon Gershinsky
>Priority: Major
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*for unencrypted files*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  

[jira] [Resolved] (PARQUET-2101) Fix wrong descriptions about the default block size

2021-11-02 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2101.
---
Resolution: Fixed

> Fix wrong descriptions about the default block size
> ---
>
> Key: PARQUET-2101
> URL: https://issues.apache.org/jira/browse/PARQUET-2101
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-mr, parquet-protobuf
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Trivial
>
> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L90
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L240
> https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoParquetWriter.java#L80
> These javadocs say the default block size is 50 MB but it's actually 128MB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2106) BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path

2021-12-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2106:
-

Assignee: Alexey Kudinkin

> BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
> ---
>
> Key: PARQUET-2106
> URL: https://issues.apache.org/jira/browse/PARQUET-2106
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, 
> profile_48449_alloc_1638494450_sort_by.html
>
>
> *Background*
> While writing out large Parquet tables using Spark, we've noticed that 
> BinaryComparator is the source of substantial churn of extremely short-lived 
> `HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
> allocations in our benchmarks, putting substantial pressure on a Garbage 
> Collector:
> !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
> [^profile_48449_alloc_1638494450_sort_by.html]
>  
> *Proposal*
> We're proposing to adjust lexicographical comparison (at least) to avoid 
> doing any allocations, since this code lies on the hot-path of every Parquet 
> write, therefore causing substantial churn amplification.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2106) BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path

2021-12-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-2106:
--
Issue Type: Improvement  (was: Task)

> BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
> ---
>
> Key: PARQUET-2106
> URL: https://issues.apache.org/jira/browse/PARQUET-2106
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, 
> profile_48449_alloc_1638494450_sort_by.html
>
>
> *Background*
> While writing out large Parquet tables using Spark, we've noticed that 
> BinaryComparator is the source of substantial churn of extremely short-lived 
> `HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
> allocations in our benchmarks, putting substantial pressure on a Garbage 
> Collector:
> !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
> [^profile_48449_alloc_1638494450_sort_by.html]
>  
> *Proposal*
> We're proposing to adjust lexicographical comparison (at least) to avoid 
> doing any allocations, since this code lies on the hot-path of every Parquet 
> write, therefore causing substantial churn amplification.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2107) Travis failures

2021-12-08 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2107.
---
Resolution: Fixed

> Travis failures
> ---
>
> Key: PARQUET-2107
> URL: https://issues.apache.org/jira/browse/PARQUET-2107
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> There are Travis failures since a while in our PRs. See e.g. 
> https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598285 or 
> https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598286



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2107) Travis failures

2021-12-07 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2107:
-

 Summary: Travis failures
 Key: PARQUET-2107
 URL: https://issues.apache.org/jira/browse/PARQUET-2107
 Project: Parquet
  Issue Type: Bug
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


There are Travis failures since a while in our PRs. See e.g. 
https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598285 or 
https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598286



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2065) parquet-cli not working in release 1.12.0

2021-07-16 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381910#comment-17381910
 ] 

Gabor Szadovszky commented on PARQUET-2065:
---

I've checked this with 1.11.0 and is reproducible so not a regression in 1.12.0.

The problem is that in target there are multiple parquet-cli jars are 
generated. One is a slim jar (parquet-cli-1.12.0.jar) and another one is a fat 
jar (parquet-cli-1.12.0-runtime.jar) that contains the avro dependency shaded. 
If all of these jars put on the classpath (target/*) it can mix up things. So, 
I would suggest using one specific jar file from the listed ones instead of 
putting all jars on the classpath from target. The other dependency jars are 
required.
For example:
{code}
java -cp target/parquet-cli-1.12.0.jar:target/dependency/* 
org.apache.parquet.cli.Main head 
{code}

> parquet-cli not working in release 1.12.0
> -
>
> Key: PARQUET-2065
> URL: https://issues.apache.org/jira/browse/PARQUET-2065
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.0
>Reporter: Akshay Sundarraj
>Priority: Major
>
> When I run parquet-cli getting  java.lang.NoSuchMethodError
> Steps to repdouce:
>  # Download parquet-mr 1.12.0 from 
> [https://github.com/apache/parquet-mr/archive/refs/tags/apache-parquet-1.12.0.tar.gz]
>  # Build and install using mvn clean install
>  # cd parquet-cli
>  # {{mvn dependency:copy-dependencies}}
>  # {{java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main head 
> }}
>  # Got below exception
> WARNING: An illegal reflective access operation has occurred
>  WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> ([file:/home/amsundar/hgroot/parquet-mr-apache-parquet-1.12.0/parquet-cli/target/dependency/hadoop-auth-2.10.1.jar|file://home/amsundar/hgroot/parquet-mr-apache-parquet-1.12.0/parquet-cli/target/dependency/hadoop-auth-2.10.1.jar])
>  to method sun.security.krb5.Config.getInstance()
>  WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
>  WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
>  WARNING: All illegal access operations will be denied in a future release
>  Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.parquet.avro.AvroSchemaConverter.convert(Lorg/apache/parquet/schema/MessageType;)Lorg/apache/avro/Schema;
>  at org.apache.parquet.cli.util.Schemas.fromParquet(Schemas.java:89)
>  at org.apache.parquet.cli.BaseCommand.getAvroSchema(BaseCommand.java:405)
>  at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66)
>  at org.apache.parquet.cli.Main.run(Main.java:155)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>  at org.apache.parquet.cli.Main.main(Main.java:185)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-10-28 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-758.
--
Resolution: Fixed

> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Anja Boskovic
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-10-28 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-758:


Assignee: Anja Boskovic

> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Anja Boskovic
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2340) appendRowGroup will loose pageIndex

2023-08-22 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757396#comment-17757396
 ] 

Gabor Szadovszky commented on PARQUET-2340:
---

[~NathanKan], I don't think these methods are used anymore. {{parquet-cli}} has 
another concept to merge files and that supports column indexes AFAIK. 
[~wgtmac], could you confirm this? Maybe, we can close this jira?

> appendRowGroup will loose pageIndex
> ---
>
> Key: PARQUET-2340
> URL: https://issues.apache.org/jira/browse/PARQUET-2340
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: GANHONGNAN
>Priority: Major
>
> Currently, 
> org.apache.parquet.hadoop.ParquetFileWriter#appendFile(org.apache.parquet.io.InputFile)
>  uses appendRowGroup method to concate parquet row group. However, 
> appendRowGroup method *looses* column index.
> {code:java}
> // code placeholder
>   public void appendRowGroup(SeekableInputStream from, BlockMetaData rowGroup,
>                              boolean dropColumns) throws IOException {
>   
>       // TODO: column/offset indexes are not copied
>       // (it would require seeking to the end of the file for each row groups)
>       currentColumnIndexes.add(null);
>       currentOffsetIndexes.add(null);
>   } {code}
>  
> [https://github.com/apache/parquet-mr/blob/f8465a274b42e0a96996c76f3be0b50cf85ecf15/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1033C19-L1033C19]
>  
> Look forward to functionality that support append with page index.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2182) Handle unknown logical types

2022-08-30 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2182:
-

 Summary: Handle unknown logical types
 Key: PARQUET-2182
 URL: https://issues.apache.org/jira/browse/PARQUET-2182
 Project: Parquet
  Issue Type: Bug
Reporter: Gabor Szadovszky


New logical types introduced in parquet-format shall be properly handled in 
parquet-mr releases that are not aware of this new type. In this case we shall 
read the data as if only the primitive type would be defined (without a logical 
type) with one exception: We shall not use min/max based statistics (including 
column indexes) since we don't know the proper ordering of that type.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2020) Remove deprecated modules

2022-10-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17616825#comment-17616825
 ] 

Gabor Szadovszky commented on PARQUET-2020:
---

[~Unsta], the module {{parquet-cli}} is meant to substitute the functionality 
of {{parquet-tools}}. {{parquet-cli}} might have the functionality you need. 
However, neither of them was designed to have its classes used publicly. (There 
are no guarantees the changes will be backward compatible.)
I don't think that {{parquet-format-structures}} would be a good fit to place 
such functionality either. This module is for reading/writing the footer and 
also not designed to be used by our clients. 
The question is if you need this json representation for production use or for 
debugging purposes. In case of the latter one we might want to create a new 
module inside parquet-mr for tools to be used from the java API. We might 
factor out some existing implementation from {{parquet-cli}} and maybe having 
back something from {{parquet-tools}} if required.
If however you need the reading to json (and maybe writing from it) for 
production use I would suggest having a new binding for json just like we have 
{{parquet-avro}}, {{parquet-protobuf}}, {{parquet-thrift}} etc.
Unfortunately, I won't have time to guide you with any of these choices. I 
would suggest bringing up this topic on [mailto:dev@parquet.apache.org] to have 
broader audience.


> Remove deprecated modules
> -
>
> Key: PARQUET-2020
> URL: https://issues.apache.org/jira/browse/PARQUET-2020
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cascading
>Affects Versions: 1.12.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.13.0
>
>
> Removes: 
>  * parquet-tools-deprecated
>  * parquet-scrooge-deprecated
>  * parquet-cascading-common23-deprecated
>  * parquet-cascading-deprecated
>  * parquet-cascading3-deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-10-10 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614907#comment-17614907
 ] 

Gabor Szadovszky commented on PARQUET-1222:
---

[~emkornfield],

There are a couple of docs in the parquet-format repo. The related ones are 
[about logical 
types|[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]] 
and the main one that contains the description of the [primitive 
types|https://github.com/apache/parquet-format/blob/master/README.md#types]. 
Unfortunately, the latter one does not contain anything about sorting order.
So, I think, we need to do the following:
* Define the sorting order for the primitive types or reference the logical 
types description for it. (In most cases it would be referencing since the 
ordering depends on the related logical types e.g. signed/unsigned sorting of 
integral types)
* After defining the sorting order of the primitive floating point numbers 
based on what we've discussed above reference it from the new half-precision FP 
logical type.

(Another unfortunate thing is that we have some specification-like docs at the 
[parquet site|https://parquet.apache.org] as well. I think we should propagate 
the parquet-format docs to there automatically or simply link them from the 
site. But it is clearly a different topic.)

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-09-30 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611398#comment-17611398
 ] 

Gabor Szadovszky commented on PARQUET-1222:
---

[~emkornfield], I think we do not need to handle NaN values with a boolean to 
fix this issue. NaN is kind of similar than null values so we may even count 
them instead of having a boolean but this question is not tightly related to 
this topic.
What do you think about elevating the current suggestion in the thrift file to 
specification level for writing/reading FP min/max values?
{quote}Because the sorting order is not specified properly for floating point 
values (relations vs. total ordering) the following compatibility rules should 
be applied when reading statistics:
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is +0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain +0 values as well.
* When looking for NaN values, min and max should be ignored.{quote}
For writing we shall skip NaN values and use -0 for min and +0 for max any time 
when a 0 is to be taken into account.

With this solution we cannot do anything clever in case of searching for a NaN 
but it can be fixed separately. And we also need to double-check whether we 
really ignore the min/max stats in case of searching for a NaN.

I think it is a good idea to discuss such topics on the mailing list. However, 
we should also time-box the discussion and go forward with a proposed solution 
if there are no interests on the mailing list. (Personally, I do not follow the 
dev list anymore.)


> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2220) Parquet Filter predicate storing nested string causing OOM's

2022-12-31 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653313#comment-17653313
 ] 

Gabor Szadovszky commented on PARQUET-2220:
---

[~abhiSumo304], I agree eagerly storing the toString value is not a good idea. 
I don't think it has proper use case either. toString should be used for 
debugging purposes anyway so eagerly storing the value does not really make 
sense. Unfortunately, I don't work on the Parquet code base actively anymore. 
Feel free to put up a PR to fix this and I'll try to review it in time.

> Parquet Filter predicate storing nested string causing OOM's
> 
>
> Key: PARQUET-2220
> URL: https://issues.apache.org/jira/browse/PARQUET-2220
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Abhishek Jain
>Priority: Critical
>
> Each Instance of ColumnFilterPredicate stores the filter values in toString 
> variable eagerly. Which is not useful
> {code:java}
> static abstract class ColumnFilterPredicate> 
> implements FilterPredicate, Serializable  {
>   private final Column column;
>   private final T value;
>   private final String toString; 
> protected ColumnFilterPredicate(Column column, T value) {
>   this.column = Objects.requireNonNull(column, "column cannot be null");
>   // Eq and NotEq allow value to be null, Lt, Gt, LtEq, GtEq however do not, 
> so they guard against
>   // null in their own constructors.
>   this.value = value;
>   String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
>   this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + 
> value + ")";
> }{code}
>  
>  
> If your filter predicate is too long/nested this can take a lot of memory 
> while creating Filter.
> We have seen in our productions this can go upto 4gbs of space while opening 
> multiple parquet readers
> Same thing is replicated in BinaryLogicalFilterPredicate. Where toString is 
> eagerly calculated and stored in string and lot of duplication is happening 
> while making And/or filter.
> I did not find use case of storing it so eagerly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1980) Build and test Apache Parquet on ARM64 CPU architecture

2023-01-10 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17656754#comment-17656754
 ] 

Gabor Szadovszky commented on PARQUET-1980:
---

Perfect. Thank you, [~mgrigorov]!

> Build and test Apache Parquet on ARM64 CPU architecture
> ---
>
> Key: PARQUET-1980
> URL: https://issues.apache.org/jira/browse/PARQUET-1980
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-format
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> More and more deployments are being done on ARM64 machines.
> It would be good to make sure Parquet MR project builds fine on it.
> The project moved from TravisCI to GitHub Actions recently (PARQUET-1969) but 
> .travis.yml could be re-intorduced for ARM64 until GitHub Actions provide 
> aarch64 nodes!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2159) Parquet bit-packing de/encode optimization

2022-11-25 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2159:
-

Assignee: Fang-Xie

> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2226) Support merge Bloom Filter

2023-01-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2226.
---
Resolution: Fixed

> Support merge Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: miracle
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations
> https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2226) Support merge Bloom Filter

2023-01-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2226:
-

Assignee: miracle

> Support merge Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: miracle
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations
> https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2226) Support merge Bloom Filter

2023-01-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2226:
-

Assignee: (was: miracle)

> Support merge Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations
> https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2226) Support merge Bloom Filter

2023-01-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2226:
-

Assignee: miracle

> Support merge Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: miracle
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations
> https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (PARQUET-1980) Build and test Apache Parquet on ARM64 CPU architecture

2023-01-08 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reopened PARQUET-1980:
---

[~mgrigorov],

PMC just got a note from Apache IT that they are about to "move away from 
Travis at the beginning of 2023". I don't know if Github actions are now 
suitable for ARM64 or there any other solutions for this. If you have time, 
could you please take a look?

> Build and test Apache Parquet on ARM64 CPU architecture
> ---
>
> Key: PARQUET-1980
> URL: https://issues.apache.org/jira/browse/PARQUET-1980
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-format
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> More and more deployments are being done on ARM64 machines.
> It would be good to make sure Parquet MR project builds fine on it.
> The project moved from TravisCI to GitHub Actions recently (PARQUET-1969) but 
> .travis.yml could be re-intorduced for ARM64 until GitHub Actions provide 
> aarch64 nodes!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-03-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2254:
-

Assignee: Mars

> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-03-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697301#comment-17697301
 ] 

Gabor Szadovszky commented on PARQUET-2254:
---

I think this is a good idea. Meanwhile, it would increase the memory footprint 
of the writer. However, if you plan to keep the current logic that the user 
decides the columns which bloom filters are generated for, it should be 
acceptable.
However, I think, we need to take one step back and investigate/synchronize the 
efforts around row group filtering. Or maybe it is only me for whom the 
following questions are not obvious? :)
* Is it always true that reading the dictionary for filtering is cheaper than 
reading the bloom filter? Bloom filters should be usually smaller than 
dictionaries and faster to be scanned for a value.
* Based on the previous one if we decide that it might worth reading the bloom 
filter before the dictionary it also questions the logic of not writing bloom 
filters in case of the whole column chunk is dictionary encoded.
* Meanwhile, if the whole column chunk is dictionary encoded but the dictionary 
is still small (the redundancy is high) then it might not worth writing a bloom 
filter since checking the dictionary might be cheaper.
What do you think?

> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2258) Storing toString fields in FilterPredicate instances can lead to memory pressure

2023-03-17 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701568#comment-17701568
 ] 

Gabor Szadovszky commented on PARQUET-2258:
---

Thanks for fixing this, [~abstractdog]!
As far as I understood this is not a serious issue so I don't think we need to 
include it in a patch release. If you agree please update the version number to 
{{1.13.0}}. (I usually don't bother selecting version numbers which is targeted 
by {{master}}. We'll set them in a bulk update based on the changelog.)

> Storing toString fields in FilterPredicate instances can lead to memory 
> pressure
> 
>
> Key: PARQUET-2258
> URL: https://issues.apache.org/jira/browse/PARQUET-2258
> Project: Parquet
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Fix For: 1.12.3
>
> Attachments: Parquet_Predicate_toString_memory.png, 
> image-2023-03-14-13-27-54-008.png
>
>
> It happens with Hive (HiveServer2), a certain amount of predicate instances 
> can make HiveServer2 OOM. According to the heapdump and background 
> information, the predicates must have been simplified a bit, but still, 
> storing toString in the objects looks very weird.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701575#comment-17701575
 ] 

Gabor Szadovszky commented on PARQUET-2256:
---

[~mwish], would you mind to do some investigations before this update? Let's 
get the binary data of a mentioned 2M bloom filter and compress with some 
codecs to see the gain. If the ratio is good, it might worth adding this 
features. It is also worth to mention that compressing bloom filter might hit 
filtering from performance point of view.

> Adding Compression for BloomFilter
> --
>
> Key: PARQUET-2256
> URL: https://issues.apache.org/jira/browse/PARQUET-2256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Affects Versions: format-2.9.0
>Reporter: Xuwei Fu
>Assignee: Xuwei Fu
>Priority: Major
>
> In Current Parquet implementions, if BloomFilter doesn't set the ndv, most 
> implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 
> 0.01, the BloomFilter size may grows to 2M for each column, which is really 
> huge. Should we support compression for BloomFilter, like:
>  
> ```
>  /**
>  * The compression used in the Bloom filter.
>  **/
> struct Uncompressed {}
> union BloomFilterCompression {
>   1: Uncompressed UNCOMPRESSED;
> +2: CompressionCodec COMPRESSION;
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2256:
-

Assignee: Xuwei Fu

> Adding Compression for BloomFilter
> --
>
> Key: PARQUET-2256
> URL: https://issues.apache.org/jira/browse/PARQUET-2256
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Affects Versions: format-2.9.0
>Reporter: Xuwei Fu
>Assignee: Xuwei Fu
>Priority: Major
>
> In Current Parquet implementions, if BloomFilter doesn't set the ndv, most 
> implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 
> 0.01, the BloomFilter size may grows to 2M for each column, which is really 
> huge. Should we support compression for BloomFilter, like:
>  
> ```
>  /**
>  * The compression used in the Bloom filter.
>  **/
> struct Uncompressed {}
> union BloomFilterCompression {
>   1: Uncompressed UNCOMPRESSED;
> +2: CompressionCodec COMPRESSION;
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1690) Integer Overflow of BinaryStatistics#isSmallerThan()

2023-03-17 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701561#comment-17701561
 ] 

Gabor Szadovszky commented on PARQUET-1690:
---

[~humanoid], I don't know/remember the background of this issue and the closed 
PRs. I think it would be the best to start over with a new PR.
[~sha...@uber.com], do you remember why the last PR was closed and not 
reviewed/submitted?

> Integer Overflow of BinaryStatistics#isSmallerThan()
> 
>
> Key: PARQUET-1690
> URL: https://issues.apache.org/jira/browse/PARQUET-1690
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>  Labels: pull-request-available
>
> "(min.length() + max.length()) < size" didn't handle integer overflow 
> [https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L103]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

2023-03-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697510#comment-17697510
 ] 

Gabor Szadovszky commented on PARQUET-2254:
---

1) I think, for creating bloom filters we have the statistics to decide how 
much space the bloom filter shall occupy (we have the actual data). What we 
don't know if the bloom filter in itself will be useful or not. (Whould there 
be filtering on the related column and would it be Eq/NotEq/IsIn etc. like 
predicates.) This one shall be decided by the client by the already introduced 
properties. We do not write bloom filters by default anyway.
2) Of course it is hard to be smart for PPD since we don't know the actual data 
(we are just before reading it). But there is an actual order of checking the 
row group filters: statistics, dictionary, bloom filter. Checking the 
statistics first is obviously correct. What I am not sure about is if we want 
to check dictionary first and then the bloom filter or the other way around. 
Because of that question I am also unsure if it is a good practice to not write 
bloom filters if the whole column chunk is dictionary encoded.

> Build a BloomFilter with a more precise size
> 
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Assignee: Mars
>Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2255) BloomFilter and float point is ambiguous

2023-03-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699732#comment-17699732
 ] 

Gabor Szadovszky commented on PARQUET-2255:
---

But we don't build the dictionary for filtering but for encoding. We should not 
add anything else than what we have in the pages. So anything should be added 
to the read path.

Maybe we do not need to handle +0.0 and -0.0 differently from the other values. 
(We needed to handle them separately for min/max values because the comparison 
is not trivial and there were actual issues.) If someone deals with FP numbers 
they should know about the difference between +0.0 and -0.0. 

Because the FP spec allows to have multiple NaN values (even though java use 
one actual bitmap for it) we need to avoid using Bloom filter in this case. 
Dictionary is a different thing because we deserialize it to java Double/Float 
values in a Set so we will have one NaN value that is the very same one we are 
searching for. (It is more for the other implementations to deal with NaN if 
the language has several NaN values.)

> BloomFilter and float point is ambiguous
> 
>
> Key: PARQUET-2255
> URL: https://issues.apache.org/jira/browse/PARQUET-2255
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Xuwei Fu
>Priority: Major
> Fix For: format-2.9.0
>
>
> Currently, our Parquet can use BloomFilter for any physical types. However, 
> when BloomFilter apply on float:
>  # What does +0 -0 means? Are they equal?
>  # Should qNaN sNaN written in BloomFilter? Are they equal?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2255) BloomFilter and float point is ambiguous

2023-03-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17699712#comment-17699712
 ] 

Gabor Szadovszky commented on PARQUET-2255:
---

Bloom filters are for searching for exact values. Exact checking of floating 
point numbers are usually code smell. Usually checking if the difference is 
below an epsilon value is suggested over using exact equality. I am wondering 
if there is a real usecase for searching for an exact floating point number. 
Maybe disabling bloom filters completely for FP numbers is the simplest choice 
and probably won't bother anyone.

If we still want to handle FP bloom filters I agree with [~wgtmac]'s proposal. 
(It is a similar approach we implemented for min/max values.) Keep in mind that 
we need to handle the case when someone wants to filter on a NaN.



> BloomFilter and float point is ambiguous
> 
>
> Key: PARQUET-2255
> URL: https://issues.apache.org/jira/browse/PARQUET-2255
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Xuwei Fu
>Priority: Major
> Fix For: format-2.9.0
>
>
> Currently, our Parquet can use BloomFilter for any physical types. However, 
> when BloomFilter apply on float:
>  # What does +0 -0 means? Are they equal?
>  # Should qNaN sNaN written in BloomFilter? Are they equal?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2247.
---
Resolution: Fixed

> Fail-fast if CapacityByteArrayOutputStream write overflow
> -
>
> Key: PARQUET-2247
> URL: https://issues.apache.org/jira/browse/PARQUET-2247
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
>
> The bytesUsed of CapacityByteArrayOutputStream may overflow when writing some 
> large byte data, resulting in parquet file write corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2247:
-

Assignee: dzcxzl  (was: Gabor Szadovszky)

> Fail-fast if CapacityByteArrayOutputStream write overflow
> -
>
> Key: PARQUET-2247
> URL: https://issues.apache.org/jira/browse/PARQUET-2247
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
>
> The bytesUsed of CapacityByteArrayOutputStream may overflow when writing some 
> large byte data, resulting in parquet file write corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2247:
-

Assignee: Gabor Szadovszky

> Fail-fast if CapacityByteArrayOutputStream write overflow
> -
>
> Key: PARQUET-2247
> URL: https://issues.apache.org/jira/browse/PARQUET-2247
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: dzcxzl
>Assignee: Gabor Szadovszky
>Priority: Critical
>
> The bytesUsed of CapacityByteArrayOutputStream may overflow when writing some 
> large byte data, resulting in parquet file write corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls

2023-02-21 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2241.
---
Resolution: Fixed

> ByteStreamSplitDecoder broken in presence of nulls
> --
>
> Key: PARQUET-2241
> URL: https://issues.apache.org/jira/browse/PARQUET-2241
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Affects Versions: format-2.8.0
>Reporter: Xuwei Fu
>Assignee: Gang Wu
>Priority: Major
>
>  
> This problem is shown in this issue: 
> [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let]
> Let me talk about it briefly:
> * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, 
> but using "num_values" as stride in BYTE_STREAM_SPLIT
> * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in 
> the page, however, in DATA_PAGE_V1, without statistics, we should read 
> def-levels and rep-levels to get the real num-of-values. And without the 
> num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly
>  
> The bug-reproducing code is in the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2243) Support zstd-jni in DirectCodecFactory

2023-02-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2243.
---
Resolution: Fixed

> Support zstd-jni in DirectCodecFactory
> --
>
> Key: PARQUET-2243
> URL: https://issues.apache.org/jira/browse/PARQUET-2243
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> During switching to zstd-jni (from the Hadoop native zstd codec) we missed to 
> add proper implementations for {{DirectCodecFactory}}. Currently, NPE occurs 
> in case of the {{DirectCodecFactory}} is used while reading/writing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2246) Add short circuit logic to column index filter

2023-02-23 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2246:
-

Assignee: Yujiang Zhong

> Add short circuit logic to column index filter
> --
>
> Key: PARQUET-2246
> URL: https://issues.apache.org/jira/browse/PARQUET-2246
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Minor
>
> ColumnIndexFilter can be optimized by adding short-circuit logic to `AND` and 
> `OR` operations. It's not necessary to evaluating the right node in some 
> cases:
>  * If the left result row ranges of `AND` is empty
>  * If the left result row ranges of `OR` is full range of the row-group



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2246) Add short circuit logic to column index filter

2023-02-23 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2246.
---
Resolution: Fixed

> Add short circuit logic to column index filter
> --
>
> Key: PARQUET-2246
> URL: https://issues.apache.org/jira/browse/PARQUET-2246
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Minor
>
> ColumnIndexFilter can be optimized by adding short-circuit logic to `AND` and 
> `OR` operations. It's not necessary to evaluating the right node in some 
> cases:
>  * If the left result row ranges of `AND` is empty
>  * If the left result row ranges of `OR` is full range of the row-group



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-21 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2228.
---
Resolution: Fixed

> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-04-18 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713635#comment-17713635
 ] 

Gabor Szadovszky commented on PARQUET-2276:
---

I think it is fine to drop support of older systems from time to time. It is 
unfortunate though, it was not properly advertised in PARQUET-2158 that we did 
not simply upgrade hadoop version in our build but made it incompatible with 
hadoop2. 
Meanwhile, I think it is fine to re-add support for hadoop2 if it is 
practically feasible and won't break the hadoop3 support. 

> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Priority: Major
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls

2023-02-14 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688363#comment-17688363
 ] 

Gabor Szadovszky edited comment on PARQUET-2241 at 2/14/23 8:37 AM:


[~wgtmac], realted to your question about production. I haven't seen any usage 
of BYTE_STREAM_SPLIT in prod. Production envrionments I have been working on 
were stuck to Parquet v1 encodings.


was (Author: gszadovszky):
[~wgtmac], realted to your question about production. I haven't seen any usage 
of BYTE_STREAM_SPLIT in prod. Production envrionments I have been working on 
was stuck to Parquet v1 encodings.

> ByteStreamSplitDecoder broken in presence of nulls
> --
>
> Key: PARQUET-2241
> URL: https://issues.apache.org/jira/browse/PARQUET-2241
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Affects Versions: format-2.8.0
>Reporter: Xuwei Fu
>Assignee: Gang Wu
>Priority: Major
>
>  
> This problem is shown in this issue: 
> [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let]
> Let me talk about it briefly:
> * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, 
> but using "num_values" as stride in BYTE_STREAM_SPLIT
> * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in 
> the page, however, in DATA_PAGE_V1, without statistics, we should read 
> def-levels and rep-levels to get the real num-of-values. And without the 
> num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly
>  
> The bug-reproducing code is in the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls

2023-02-14 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688363#comment-17688363
 ] 

Gabor Szadovszky commented on PARQUET-2241:
---

[~wgtmac], realted to your question about production. I haven't seen any usage 
of BYTE_STREAM_SPLIT in prod. Production envrionments I have been working on 
was stuck to Parquet v1 encodings.

> ByteStreamSplitDecoder broken in presence of nulls
> --
>
> Key: PARQUET-2241
> URL: https://issues.apache.org/jira/browse/PARQUET-2241
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Affects Versions: format-2.8.0
>Reporter: Xuwei Fu
>Assignee: Gang Wu
>Priority: Major
>
>  
> This problem is shown in this issue: 
> [https://github.com/apache/arrow/issues/15173|https://github.com/apache/arrow/issues/15173Let]
> Let me talk about it briefly:
> * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, 
> but using "num_values" as stride in BYTE_STREAM_SPLIT
> * When decoding, for DATA_PAGE_V2, it can now the num_values and num_nulls in 
> the page, however, in DATA_PAGE_V1, without statistics, we should read 
> def-levels and rep-levels to get the real num-of-values. And without the 
> num-of-values, we aren't able to decode BYTE_STREAM_SPLIT correctly
>  
> The bug-reproducing code is in the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2243) Support zstd-jni in DirectCodecFactory

2023-02-14 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2243:
-

 Summary: Support zstd-jni in DirectCodecFactory
 Key: PARQUET-2243
 URL: https://issues.apache.org/jira/browse/PARQUET-2243
 Project: Parquet
  Issue Type: Bug
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


During switching to zstd-jni (from the Hadoop native zstd codec) we missed to 
add proper implementations for {{DirectCodecFactory}}. Currently, NPE occurs in 
case of the {{DirectCodecFactory}} is used while reading/writing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2244:
-

Assignee: Yujiang Zhong

> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-15 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2244.
---
Resolution: Fixed

> Dictionary filter may skip row-groups incorrectly when evaluating notIn
> ---
>
> Key: PARQUET-2244
> URL: https://issues.apache.org/jira/browse/PARQUET-2244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Dictionary filter may skip row-groups incorrectly when evaluating `notIn` on 
> optional columns with null values. Here is an example:
> Say there is a optional column `c1` with all pages dict encoded, `c1` has and 
> only has two distinct values: ['foo', null],  and the predicate is  `c1 not 
> in ('foo', 'bar')`. 
> Now dictionary filter may skip this row-group that is actually should not be 
> skipped, because there are nulls in the column.
>  
> This is a bug similar to #1510.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-06-05 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729214#comment-17729214
 ] 

Gabor Szadovszky commented on PARQUET-758:
--

Hey everyone, who is interested in the half-float type,

When I've reviewed the format change it was obvious to me to use the "2-byte 
IEEE little-endian format". Now, I've faced another approach to encode 2 byte 
FP numbers: 
[bfloat16|https://en.wikipedia.org/wiki/Bfloat16_floating-point_format]. Since 
neither java nor c++ support 2 byte FP numbers natively we probably need to 
convert the encoded numbers to {{float}}. For {{bfloat16}} it would be more 
performant to do so.
It might worth adding {{bfloat16}} to the format as well and add 
implementations for it in the same round. WDYT?

> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-:
-

Assignee: Gang Wu

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-09 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730904#comment-17730904
 ] 

Gabor Szadovszky commented on PARQUET-:
---

[~apitrou], [~wgtmac],

It seems my review was not deep enough. Sorry for that. So, parquet-mr does not 
use RLE encoding for boolean values in case of V1 but only bit packing: 
* 
[V1|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L53]
 -> ... -> [Bit 
packing|https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPackingValuesWriter.java]
 (encoding written to page header: PLAIN)
* 
[V2|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L57]
 -> ... -> 
[RLE|https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridValuesWriter.java]
 (encoding written to page header: RLE)

[~apitrou], could you please confirm that is the same for parquet cpp?

So the table we added in this PR about prepending the length is misleading. 
Also, the link in the PLAIN encoding for boolean is dead and misleading. It 
should point to BIT_PACKED. In the definition of BIT_PACKED it is also wrongly 
stated that it is valid only for RL/DL. I think, the deprecation is valid since 
the "BIT_PACKED" encoding should not be written to anywhere but the actual 
encoding is used under PLAIN for boolean.
Would you guys like to work on this? We probably want to add this to the 
current format release.

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-09 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730988#comment-17730988
 ] 

Gabor Szadovszky edited comment on PARQUET- at 6/9/23 2:40 PM:
---

[~mwish], -This is specifically about BOOLEAN values (data pages), not rl/dl. 
(In parquet-mr we write rl/dl and dictionary indices using RLE for both v1 and 
v2 settings.)-
Sorry, misread your comment. So parquet-cpp does not write BOOLEAN data pages 
in any case using RLE?


was (Author: gszadovszky):
[~mwish], This is specifically about BOOLEAN values (data pages), not rl/dl. 
(In parquet-mr we write rl/dl and dictionary indices using RLE for both v1 and 
v2 settings.)

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-09 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730988#comment-17730988
 ] 

Gabor Szadovszky commented on PARQUET-:
---

[~mwish], This is specifically about BOOLEAN values (data pages), not rl/dl. 
(In parquet-mr we write rl/dl and dictionary indices using RLE for both v1 and 
v2 settings.)

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-758) [Format] HALF precision FLOAT Logical type

2023-06-06 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729630#comment-17729630
 ] 

Gabor Szadovszky commented on PARQUET-758:
--

Thanks for your reply, [~anjakefala]!

I've mentioned {{bfloat16}} only because of the ease of converting back and 
forth to java/c++ {{float}} which we will probably need to be implemented for 
{{IEEE Float16}} as well. But I agree, we should not block the format release 
because of additional discussions about this additional topic.

> [Format] HALF precision FLOAT Logical type
> --
>
> Key: PARQUET-758
> URL: https://issues.apache.org/jira/browse/PARQUET-758
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2317) parquet-format and parquet-format-structures defines Util with inconsitent methods provided

2023-06-25 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736990#comment-17736990
 ] 

Gabor Szadovszky commented on PARQUET-2317:
---

[~wgtmac], Let me summarize the history of this. parquet-format contains all 
the specification docs and the parquet.thrift itself which is a kind of source 
code and spec at the same time. This is good to have all of these separated 
from the implementations. Meanwhile, since the thrift file is there, it was 
natural to have Thrift code generation and the Util there as well. But it was 
not a good choice since we only had the java code there. In some new features 
we had to extend Util which is clearly related to parquet-mr. So, we decided to 
deprecate all of the java related stuff in parquet-format and moved them to 
parquet-format-structures under parquet-mr.
So, it would be good to not only have Util be removed but all the other java 
classes including the Thrift generated ones to be part of the jar.
The catch is we still need to have some mechanism that validates the thrift 
file so we won't add invalid changes. Also, the distribution should be changed 
because providing a jar file without java classes would not make sense. I 
think, we should release a tarball instead that contains all the specs and the 
thrift file as well. Of course, we would need to update the parquet-mr (and 
maybe other affected implementations) to download that tarball instead of the 
jar file.

> parquet-format and parquet-format-structures defines Util with inconsitent 
> methods provided
> ---
>
> Key: PARQUET-2317
> URL: https://issues.apache.org/jira/browse/PARQUET-2317
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: 1.12.0, 1.13.0
>Reporter: Joey Pereira
>Priority: Major
>
> I have been running into a bug due to {{parquet-format}} and 
> {{parquet-format-structures}} both defining the 
> {{org.apache.parquet.format.Util}} class but doing so inconsistently.
> Examples of this are several methods which include a {{BlockCipher}} 
> parameter that are defined from {{parquet-format-structures}} but not 
> {{{}parquet-format{}}}. While invoking code that happens to use these, such 
> as {{{}org.apache.parquet.hadoop.ParquetFileReader.readFooter{}}}, the code 
> will fail if the {{parquet-format}} happens to be loaded first on the 
> classpath.
> Here is an example stack trace for a Scala Spark application.
> {code:java}
> Caused by: java.lang.NoSuchMethodError: 
> 'org.apache.parquet.format.FileMetaData 
> org.apache.parquet.format.Util.readFileMetaData(java.io.InputStream, 
> org.apache.parquet.format.BlockCipher$Decryptor, byte[])'
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1441)
>  ~[parquet_hadoop.jar:1.13.1]
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1438)
>  ~[parquet_hadoop.jar:1.13.1]
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter$NoFilter.accept(ParquetMetadataConverter.java:1173)
>  ~[parquet_hadoop.jar:1.13.1]
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1438)
>  ~[parquet_hadoop.jar:1.13.1]
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:591)
>  ~[parquet_hadoop.jar:1.13.1]
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:536)
>  ~[parquet_hadoop.jar:1.13.1]
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:530)
>  ~[parquet_hadoop.jar:1.13.1]
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:478)
>  ~[parquet_hadoop.jar:1.13.1]
> ... (my application code invoking the above)
> {code}
> Because of issues external to Parquet that I have yet to figure out (a 
> complex Spark and dependency setup), my classpaths are not deterministically 
> ordered and I am unable to pin the {{parquet-format-structures}} ahead hence 
> why I'm chiming in about this.
> Even if that weren't the case, this is a fairly prickly edge to run into as 
> both modules define overlapping classes. {{Util}} is not the only class that 
> appears to be defined by both, just what I have been focusing on due to this 
> bug.
> It appears these methods were introduced in at least 1.12: 
> [https://github.com/apache/parquet-mr/commit/65b95fb72be8f5a8a193a6f7bc4560fdcd742fc7#diff-852341c99dcae06c8fa2b764bcf3d9e6860e40442d0ab1cf5b935df80a9cacb7]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2318) Implement a tool to list page headers

2023-06-30 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2318.
---
Resolution: Fixed

> Implement a tool to list page headers
> -
>
> Key: PARQUET-2318
> URL: https://issues.apache.org/jira/browse/PARQUET-2318
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cli
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Needs a tool which lists the page headers in a Parquet file for debugging 
> purposes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2318) Implement a tool to list page headers

2023-06-27 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2318:
-

 Summary: Implement a tool to list page headers
 Key: PARQUET-2318
 URL: https://issues.apache.org/jira/browse/PARQUET-2318
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-cli
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Needs a tool which lists the page headers in a Parquet file for debugging 
purposes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


<    4   5   6   7   8   9