[jira] [Updated] (PARQUET-1672) [DOC] Broken link to "How To Contribute" section in Parquet-MR project

2019-12-11 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1672:
--
Fix Version/s: format-2.8.0

> [DOC] Broken link to "How To Contribute" section in Parquet-MR project
> --
>
> Key: PARQUET-1672
> URL: https://issues.apache.org/jira/browse/PARQUET-1672
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Tarek Allam
>Assignee: Tarek Allam
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: format-2.8.0
>
>
> Link the "How To Contribute" section in Parquet-MR project returns 404 error 
> as the URL is expanded incorrectly.
> A small change to [https://github.com/apache/parquet-mr#how-to-contribute] 
> should correct this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1703) Update API compatibility check

2019-12-12 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1703:
-

Assignee: Gabor Szadovszky

> Update API compatibility check
> --
>
> Key: PARQUET-1703
> URL: https://issues.apache.org/jira/browse/PARQUET-1703
> Project: Parquet
>  Issue Type: Task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> The current API compatibility check is comparing the current version of 
> parquet-mr to the release 1.7.0. It is not correct because several changes 
> made in the public API since then which is not verified. Also, many packages 
> are excluded from the check which are part of the public API. The semver 
> plugin is also out of date and not maintained any more. The following tasks 
> are to be done:
> * Find a good tool to check API compatibility
> * Always compare to the previous minor release on master (e.g. 1.11.0 before 
> releasing 1.12.0)
> * Exclude only packages/classes that are clear to not being used by our 
> clients



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1622) Adding an encoding for FP data

2019-12-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1622:
--
Issue Type: New Feature  (was: Wish)

> Adding an encoding for FP data
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1622) Adding an encoding for FP data

2019-12-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1622:
--
Fix Version/s: format-2.8.0

> Adding an encoding for FP data
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: Wish
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

2019-12-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1622:
--
Summary: Add BYTE_STREAM_SPLIT encoding  (was: Adding an encoding for FP 
data)

> Add BYTE_STREAM_SPLIT encoding
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1703) Update API compatibility check

2020-01-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1703.
---
Resolution: Fixed

> Update API compatibility check
> --
>
> Key: PARQUET-1703
> URL: https://issues.apache.org/jira/browse/PARQUET-1703
> Project: Parquet
>  Issue Type: Task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> The current API compatibility check is comparing the current version of 
> parquet-mr to the release 1.7.0. It is not correct because several changes 
> made in the public API since then which is not verified. Also, many packages 
> are excluded from the check which are part of the public API. The semver 
> plugin is also out of date and not maintained any more. The following tasks 
> are to be done:
> * Find a good tool to check API compatibility
> * Always compare to the previous minor release on master (e.g. 1.11.0 before 
> releasing 1.12.0)
> * Exclude only packages/classes that are clear to not being used by our 
> clients



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1739) Make Spark SQL support Column indexes

2020-01-08 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1739:
--
Fix Version/s: 1.11.1

> Make Spark SQL support Column indexes
> -
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 1.11.1
>
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1744) Some filters throws ArrayIndexOutOfBoundsException

2020-01-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1744:
-

Assignee: Gabor Szadovszky

> Some filters throws ArrayIndexOutOfBoundsException
> --
>
> Key: PARQUET-1744
> URL: https://issues.apache.org/jira/browse/PARQUET-1744
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Gabor Szadovszky
>Priority: Major
>
> How to reproduce:
> * Build Spark
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1744
> git checkout PARQUET-1744
> build/sbt  package
> bin/spark-shell
> {code}
> * Prepare data:
> {code:scala}
> spark.sql("create table t1(a int, b int, c int) using parquet")
> spark.sql("insert into t1 values(1,0,0)")
> spark.sql("insert into t1 values(2,0,1)")
> spark.sql("insert into t1 values(3,1,0)")
> spark.sql("insert into t1 values(4,1,1)")
> spark.sql("insert into t1 values(5,null,0)")
> spark.sql("insert into t1 values(6,null,1)")
> spark.sql("insert into t1 values(7,null,null)")
> {code}
> * Run test 1
> {code:scala}
> scala> spark.sql("select a+120 from t1 where b<10 OR c=1").show
> java.lang.reflect.InvocationTargetException
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:155)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:319)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:486)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:339)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds 
> for length 0
>   at 
> org.apache.parquet.internal.column.columnindex.IntColumnIndexBuilder$IntColumnIndex$1.compareValueToMin(IntColumnIndexBuilder.java:74)
>   at 
> org.apache.parquet.internal.column.columnindex.BoundaryOrder$2.lt(BoundaryOrder.java:123)
>   at 
> org.apache.parquet.internal.column.columnindex.ColumnIndexBuilder$ColumnIndexBase.visit(ColumnIndexBuilder.java:2

[jira] [Commented] (PARQUET-1744) Some filters throws ArrayIndexOutOfBoundsException

2020-01-09 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011792#comment-17011792
 ] 

Gabor Szadovszky commented on PARQUET-1744:
---

Thanks for creating this issue.
The problem is that ColumnIndex does not handle the case properly when all the 
pages are null pages and the boundary order is ASCENDING/DESCENDING. Let me fix 
this.

> Some filters throws ArrayIndexOutOfBoundsException
> --
>
> Key: PARQUET-1744
> URL: https://issues.apache.org/jira/browse/PARQUET-1744
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Gabor Szadovszky
>Priority: Major
>
> How to reproduce:
> * Build Spark
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1744
> git checkout PARQUET-1744
> build/sbt  package
> bin/spark-shell
> {code}
> * Prepare data:
> {code:scala}
> spark.sql("create table t1(a int, b int, c int) using parquet")
> spark.sql("insert into t1 values(1,0,0)")
> spark.sql("insert into t1 values(2,0,1)")
> spark.sql("insert into t1 values(3,1,0)")
> spark.sql("insert into t1 values(4,1,1)")
> spark.sql("insert into t1 values(5,null,0)")
> spark.sql("insert into t1 values(6,null,1)")
> spark.sql("insert into t1 values(7,null,null)")
> {code}
> * Run test 1
> {code:scala}
> scala> spark.sql("select a+120 from t1 where b<10 OR c=1").show
> java.lang.reflect.InvocationTargetException
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:155)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:319)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:486)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:339)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds 
> for length 0
>   at 
> org.apache.parquet.internal.column.columnindex.IntColumnIndexBuilder$IntColumnIndex$1.compareValueToMin(IntColumnIndexBuilder.java:74)
>   at 

[jira] [Updated] (PARQUET-1744) Some filters throws ArrayIndexOutOfBoundsException

2020-01-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1744:
--
Fix Version/s: 1.11.1

> Some filters throws ArrayIndexOutOfBoundsException
> --
>
> Key: PARQUET-1744
> URL: https://issues.apache.org/jira/browse/PARQUET-1744
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.11.1
>
>
> How to reproduce:
> * Build Spark
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1744
> git checkout PARQUET-1744
> build/sbt  package
> bin/spark-shell
> {code}
> * Prepare data:
> {code:scala}
> spark.sql("create table t1(a int, b int, c int) using parquet")
> spark.sql("insert into t1 values(1,0,0)")
> spark.sql("insert into t1 values(2,0,1)")
> spark.sql("insert into t1 values(3,1,0)")
> spark.sql("insert into t1 values(4,1,1)")
> spark.sql("insert into t1 values(5,null,0)")
> spark.sql("insert into t1 values(6,null,1)")
> spark.sql("insert into t1 values(7,null,null)")
> {code}
> * Run test 1
> {code:scala}
> scala> spark.sql("select a+120 from t1 where b<10 OR c=1").show
> java.lang.reflect.InvocationTargetException
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:155)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:319)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:486)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:339)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds 
> for length 0
>   at 
> org.apache.parquet.internal.column.columnindex.IntColumnIndexBuilder$IntColumnIndex$1.compareValueToMin(IntColumnIndexBuilder.java:74)
>   at 
> org.apache.parquet.internal.column.columnindex.BoundaryOrder$2.lt(BoundaryOrder.java:123)
>   at 
> org.apache.parquet.internal.column.columnindex.ColumnIndexBuilder$ColumnIndexBase.visit(Colum

[jira] [Updated] (PARQUET-1740) Make ParquetFileReader.getFilteredRecordCount public

2020-01-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1740:
--
Fix Version/s: 1.11.1

> Make ParquetFileReader.getFilteredRecordCount public
> 
>
> Key: PARQUET-1740
> URL: https://issues.apache.org/jira/browse/PARQUET-1740
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.1
>
>
> Please see  
> [https://github.com/apache/spark/pull/26804/commits/4756e67dddbbf891c445efb78b202706e133cb46]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1740) Make ParquetFileReader.getFilteredRecordCount public

2020-01-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1740.
---
Resolution: Fixed

> Make ParquetFileReader.getFilteredRecordCount public
> 
>
> Key: PARQUET-1740
> URL: https://issues.apache.org/jira/browse/PARQUET-1740
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.1
>
>
> Please see  
> [https://github.com/apache/spark/pull/26804/commits/4756e67dddbbf891c445efb78b202706e133cb46]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1740) Make ParquetFileReader.getFilteredRecordCount public

2020-01-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1740:
-

Assignee: Yuming Wang

> Make ParquetFileReader.getFilteredRecordCount public
> 
>
> Key: PARQUET-1740
> URL: https://issues.apache.org/jira/browse/PARQUET-1740
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.1
>
>
> Please see  
> [https://github.com/apache/spark/pull/26804/commits/4756e67dddbbf891c445efb78b202706e133cb46]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1745) No result for partition key included in Parquet file

2020-01-09 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011868#comment-17011868
 ] 

Gabor Szadovszky commented on PARQUET-1745:
---

Unfortunately, I don't understand what exactly is missing from the parquet 
file. Do you store the partitioned key in the parquet file's key-value metadata?
Usually, partitioning logic is implemented above the parquet file format and it 
has nothing to do with it. Could you debug/explain the issue in more details 
from parquet point of view?

> No result for partition key included in Parquet file
> 
>
> Key: PARQUET-1745
> URL: https://issues.apache.org/jira/browse/PARQUET-1745
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1745
> git checkout PARQUET-1745
> build/sbt "sql/test-only *ParquetV2PartitionDiscoverySuite"
> {code}
> output:
> {noformat}
> [info] - read partitioned table - partition key included in Parquet file *** 
> FAILED *** (1 second, 57 milliseconds)
> [info]   Results do not match for query:
> [info]   Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> [info]   Timezone Env:
> [info]
> [info]   == Parsed Logical Plan ==
> [info]   'Project [*]
> [info]   +- 'Filter ('pi = 1)
> [info]  +- 'UnresolvedRelation [t]
> [info]
> [info]   == Analyzed Logical Plan ==
> [info]   intField: int, stringField: string, pi: int, ps: string
> [info]   Project [intField#1788, stringField#1789, pi#1790, ps#1791]
> [info]   +- Filter (pi#1790 = 1)
> [info]  +- SubqueryAlias `t`
> [info] +- RelationV2[intField#1788, stringField#1789, pi#1790, 
> ps#1791] parquet 
> file:/root/opensource/apache-spark/target/tmp/spark-c7e85130-3e1f-4137-ac7c-32f48be3b74a
> [info]
> [info]   == Optimized Logical Plan ==
> [info]   Filter (isnotnull(pi#1790) AND (pi#1790 = 1))
> [info]   +- RelationV2[intField#1788, stringField#1789, pi#1790, ps#1791] 
> parquet 
> file:/root/opensource/apache-spark/target/tmp/spark-c7e85130-3e1f-4137-ac7c-32f48be3b74a
> [info]
> [info]   == Physical Plan ==
> [info]   *(1) Project [intField#1788, stringField#1789, pi#1790, ps#1791]
> [info]   +- *(1) Filter (isnotnull(pi#1790) AND (pi#1790 = 1))
> [info]  +- *(1) ColumnarToRow
> [info] +- BatchScan[intField#1788, stringField#1789, pi#1790, 
> ps#1791] ParquetScan Location: 
> InMemoryFileIndex[file:/root/opensource/apache-spark/target/tmp/spark-c7e85130-3e1f-4137-ac7c-32f...,
>  ReadSchema: struct, PushedFilters: 
> [IsNotNull(pi), EqualTo(pi,1)]
> [info]
> [info]   == Results ==
> [info]
> [info]   == Results ==
> [info]   !== Correct Answer - 20 ==   == Spark Answer - 0 ==
> [info]struct<>struct<>
> [info]   ![1,1,1,bar]
> [info]   ![1,1,1,foo]
> [info]   ![10,10,1,bar]
> [info]   ![10,10,1,foo]
> [info]   ![2,2,1,bar]
> [info]   ![2,2,1,foo]
> [info]   ![3,3,1,bar]
> [info]   ![3,3,1,foo]
> [info]   ![4,4,1,bar]
> [info]   ![4,4,1,foo]
> [info]   ![5,5,1,bar]
> [info]   ![5,5,1,foo]
> [info]   ![6,6,1,bar]
> [info]   ![6,6,1,foo]
> [info]   ![7,7,1,bar]
> [info]   ![7,7,1,foo]
> [info]   ![8,8,1,bar]
> [info]   ![8,8,1,foo]
> [info]   ![9,9,1,bar]
> [info]   ![9,9,1,foo] (QueryTest.scala:248)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
> [info]   at 
> org.apache.spark.sql.QueryTest$.newAssertionFailedException(QueryTest.scala:238)
> [info]   at org.scalatest.Assertions.fail(Assertions.scala:1091)
> [info]   at org.scalatest.Assertions.fail$(Assertions.scala:1087)
> [info]   at org.apache.spark.sql.QueryTest$.fail(QueryTest.scala:238)
> [info]   at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:248)
> [info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:156)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetV2PartitionDiscoverySuite.$anonfun$new$194(ParquetPartitionDiscoverySuite.scala:1232)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>

[jira] [Commented] (PARQUET-1746) Changed the data order after DataFrame reuse

2020-01-09 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011872#comment-17011872
 ] 

Gabor Szadovszky commented on PARQUET-1746:
---

What exactly is reordered here? If it is a list in the parquet schema then the 
order shall not change and it is indeed a serious issue. However, I cannot see 
a reason how it could happen. Could you explain in more details from the 
parquet point of view?

> Changed the data order after DataFrame reuse
> 
>
> Key: PARQUET-1746
> URL: https://issues.apache.org/jira/browse/PARQUET-1746
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1746
> git checkout PARQUET-1746
> build/sbt "sql/test-only *StreamSuite"
> {code}
> output:
> {noformat}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Decoded objects do not match expected objects:
> expected: WrappedArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> actual:   WrappedArray(0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 2)
> assertnotnull(upcast(getcolumnbyordinal(0, LongType), LongType, - root class: 
> "scala.Long"))
> +- upcast(getcolumnbyordinal(0, LongType), LongType, - root class: 
> "scala.Long")
>+- getcolumnbyordinal(0, LongType)
>  
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions.fail(Assertions.scala:1091)
>   at org.scalatest.Assertions.fail$(Assertions.scala:1087)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1560)
>   at org.apache.spark.sql.QueryTest.checkDataset(QueryTest.scala:73)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$22(StreamSuite.scala:215)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$22$adapted(StreamSuite.scala:208)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:76)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:75)
>   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:75)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:74)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$21(StreamSuite.scala:208)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$21$adapted(StreamSuite.scala:207)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:76)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:75)
>   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:75)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:74)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.assertDF$1(StreamSuite.scala:207)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$25(StreamSuite.scala:226)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:52)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:36)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:231)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:229)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.withSQLConf(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$24(StreamSuite.scala:225)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$24$adapted(StreamSuite.scala:224)
>   at scala.collection.immutable.List.foreach(List.sc

[jira] [Resolved] (PARQUET-1744) Some filters throws ArrayIndexOutOfBoundsException

2020-01-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1744.
---
Resolution: Fixed

> Some filters throws ArrayIndexOutOfBoundsException
> --
>
> Key: PARQUET-1744
> URL: https://issues.apache.org/jira/browse/PARQUET-1744
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.1
>
>
> How to reproduce:
> * Build Spark
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1744
> git checkout PARQUET-1744
> build/sbt  package
> bin/spark-shell
> {code}
> * Prepare data:
> {code:scala}
> spark.sql("create table t1(a int, b int, c int) using parquet")
> spark.sql("insert into t1 values(1,0,0)")
> spark.sql("insert into t1 values(2,0,1)")
> spark.sql("insert into t1 values(3,1,0)")
> spark.sql("insert into t1 values(4,1,1)")
> spark.sql("insert into t1 values(5,null,0)")
> spark.sql("insert into t1 values(6,null,1)")
> spark.sql("insert into t1 values(7,null,null)")
> {code}
> * Run test 1
> {code:scala}
> scala> spark.sql("select a+120 from t1 where b<10 OR c=1").show
> java.lang.reflect.InvocationTargetException
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:155)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:131)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:319)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:486)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:339)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds 
> for length 0
>   at 
> org.apache.parquet.internal.column.columnindex.IntColumnIndexBuilder$IntColumnIndex$1.compareValueToMin(IntColumnIndexBuilder.java:74)
>   at 
> org.apache.parquet.internal.column.columnindex.BoundaryOrder$2.lt(BoundaryOrder.java:123)
>   at 
> org.apache.parquet.internal.column.columnindex.Co

[jira] [Commented] (PARQUET-1745) No result for partition key included in Parquet file

2020-01-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014309#comment-17014309
 ] 

Gabor Szadovszky commented on PARQUET-1745:
---

The problem here is Spark sets a projection to the columns {{intField}} and 
{{stringField}}. Because of this projection the column index filter handles 
columns {{pi}} and {{ps}} as if they were not saved in the parquet file so as 
if they would have all values {{null}}. From this point of view the empty list 
retrieved is correct.
 I think, even if it seems to be a regression, it is a bug in Spark. The 
projection shall always contain the columns that are in the filter to use the 
filtering correctly.
To have a consensus about this topic I've started a thread on the [parquet dev 
list|https://lists.apache.org/thread.html/r57d710b5ae674da658e6dc5ab019b31ba4e8d84af2e7b992e7409a14%40%3Cdev.parquet.apache.org%3E].
 Feel free to comment in the thread.
 

> No result for partition key included in Parquet file
> 
>
> Key: PARQUET-1745
> URL: https://issues.apache.org/jira/browse/PARQUET-1745
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: FilterByColumnIndex.png
>
>
> How to reproduce:
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1745
> git checkout PARQUET-1745
> build/sbt "sql/test-only *ParquetV2PartitionDiscoverySuite"
> {code}
> output:
> {noformat}
> [info] - read partitioned table - partition key included in Parquet file *** 
> FAILED *** (1 second, 57 milliseconds)
> [info]   Results do not match for query:
> [info]   Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> [info]   Timezone Env:
> [info]
> [info]   == Parsed Logical Plan ==
> [info]   'Project [*]
> [info]   +- 'Filter ('pi = 1)
> [info]  +- 'UnresolvedRelation [t]
> [info]
> [info]   == Analyzed Logical Plan ==
> [info]   intField: int, stringField: string, pi: int, ps: string
> [info]   Project [intField#1788, stringField#1789, pi#1790, ps#1791]
> [info]   +- Filter (pi#1790 = 1)
> [info]  +- SubqueryAlias `t`
> [info] +- RelationV2[intField#1788, stringField#1789, pi#1790, 
> ps#1791] parquet 
> file:/root/opensource/apache-spark/target/tmp/spark-c7e85130-3e1f-4137-ac7c-32f48be3b74a
> [info]
> [info]   == Optimized Logical Plan ==
> [info]   Filter (isnotnull(pi#1790) AND (pi#1790 = 1))
> [info]   +- RelationV2[intField#1788, stringField#1789, pi#1790, ps#1791] 
> parquet 
> file:/root/opensource/apache-spark/target/tmp/spark-c7e85130-3e1f-4137-ac7c-32f48be3b74a
> [info]
> [info]   == Physical Plan ==
> [info]   *(1) Project [intField#1788, stringField#1789, pi#1790, ps#1791]
> [info]   +- *(1) Filter (isnotnull(pi#1790) AND (pi#1790 = 1))
> [info]  +- *(1) ColumnarToRow
> [info] +- BatchScan[intField#1788, stringField#1789, pi#1790, 
> ps#1791] ParquetScan Location: 
> InMemoryFileIndex[file:/root/opensource/apache-spark/target/tmp/spark-c7e85130-3e1f-4137-ac7c-32f...,
>  ReadSchema: struct, PushedFilters: 
> [IsNotNull(pi), EqualTo(pi,1)]
> [info]
> [info]   == Results ==
> [info]
> [info]   == Results ==
> [info]   !== Correct Answer - 20 ==   == Spark Answer - 0 ==
> [info]struct<>struct<>
> [info]   ![1,1,1,bar]
> [info]   ![1,1,1,foo]
> [info]   ![10,10,1,bar]
> [info]   ![10,10,1,foo]
> [info]   ![2,2,1,bar]
> [info]   ![2,2,1,foo]
> [info]   ![3,3,1,bar]
> [info]   ![3,3,1,foo]
> [info]   ![4,4,1,bar]
> [info]   ![4,4,1,foo]
> [info]   ![5,5,1,bar]
> [info]   ![5,5,1,foo]
> [info]   ![6,6,1,bar]
> [info]   ![6,6,1,foo]
> [info]   ![7,7,1,bar]
> [info]   ![7,7,1,foo]
> [info]   ![8,8,1,bar]
> [info]   ![8,8,1,foo]
> [info]   ![9,9,1,bar]
> [info]   ![9,9,1,foo] (QueryTest.scala:248)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
> [info]   at 
> org.apache.spark.sql.QueryTest$.newAssertionFailedException(QueryTest.scala:238)
> [info]   at org.scalatest.Assertions.fail(Assertions.scala:1091)
> [info]   at org.scalatest.Assertions.fail$(Assertions.scala:1087)
> [info]   at org.apache.spark.sql.QueryTest$.fail(QueryTest.scala:238)
> [info]   at org.apa

[jira] [Created] (PARQUET-1765) Invalid filteredRowCount in InternalParquetRecordReader

2020-01-13 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1765:
-

 Summary: Invalid filteredRowCount in InternalParquetRecordReader
 Key: PARQUET-1765
 URL: https://issues.apache.org/jira/browse/PARQUET-1765
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.11.0
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky
 Fix For: 1.11.1


The [record 
count|https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L185]
 is retrieved before setting the [projection 
schema|https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L188]
 so the value might be invalid if the projection impacts the filter.

In normal cases it does not cause any issue because the record filter will 
filter correctly only that we are filtering the records one-by-one instead of 
dropping the related pages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1746) Changed the data order after DataFrame reuse

2020-01-15 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015736#comment-17015736
 ] 

Gabor Szadovszky commented on PARQUET-1746:
---

For me the issue is reproducible with the current parquet-mr master 
(1.12.0-SNAPSHOT).

> Changed the data order after DataFrame reuse
> 
>
> Key: PARQUET-1746
> URL: https://issues.apache.org/jira/browse/PARQUET-1746
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1746
> git checkout PARQUET-1746
> build/sbt "sql/test-only *StreamSuite"
> {code}
> output:
> {noformat}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Decoded objects do not match expected objects:
> expected: WrappedArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> actual:   WrappedArray(0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 2)
> assertnotnull(upcast(getcolumnbyordinal(0, LongType), LongType, - root class: 
> "scala.Long"))
> +- upcast(getcolumnbyordinal(0, LongType), LongType, - root class: 
> "scala.Long")
>+- getcolumnbyordinal(0, LongType)
>  
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions.fail(Assertions.scala:1091)
>   at org.scalatest.Assertions.fail$(Assertions.scala:1087)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1560)
>   at org.apache.spark.sql.QueryTest.checkDataset(QueryTest.scala:73)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$22(StreamSuite.scala:215)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$22$adapted(StreamSuite.scala:208)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:76)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:75)
>   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:75)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:74)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$21(StreamSuite.scala:208)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$21$adapted(StreamSuite.scala:207)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:76)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:75)
>   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:75)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:74)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.assertDF$1(StreamSuite.scala:207)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$25(StreamSuite.scala:226)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:52)
>   at 
> org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:36)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:231)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:229)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.withSQLConf(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$24(StreamSuite.scala:225)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$24$adapted(StreamSuite.scala:224)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$20(StreamSuite.scala:224)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunct

[jira] [Resolved] (PARQUET-1765) Invalid filteredRowCount in InternalParquetRecordReader

2020-01-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1765.
---
Resolution: Fixed

> Invalid filteredRowCount in InternalParquetRecordReader
> ---
>
> Key: PARQUET-1765
> URL: https://issues.apache.org/jira/browse/PARQUET-1765
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.1
>
>
> The [record 
> count|https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L185]
>  is retrieved before setting the [projection 
> schema|https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L188]
>  so the value might be invalid if the projection impacts the filter.
> In normal cases it does not cause any issue because the record filter will 
> filter correctly only that we are filtering the records one-by-one instead of 
> dropping the related pages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1746) Changed the data order after DataFrame reuse

2020-01-20 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1746.
---
Resolution: Not A Problem

The related Spark test generates 22 parquet files. The first 11 are empty 
meaning no data is in them. (I am not sure if they are even valid this way.)

The last 11 contains only 1 value in each:
{noformat}
$> ls *.parquet| while read file; do echo "$file"; parquet-tools cat $file 
2>/dev/null; done
part-0-19f5b358-410b-4dd4-b167-4016984ac6ef-c000.snappy.parquet
part-0-212d052b-d03a-413b-98f3-1348c2d06855-c000.snappy.parquet
part-0-311f4442-4225-47f1-aaf1-c7a8e38a875f-c000.snappy.parquet
part-0-459612f9-d564-43a9-bf31-2d174c996fa6-c000.snappy.parquet
part-0-5e20cfa6-a5d0-4d5f-a382-741907a74874-c000.snappy.parquet
part-0-62881d28-7226-4a78-9fe7-2ed41b895e1c-c000.snappy.parquet
part-0-9aaa784f-080a-43ae-9296-20bd033aa300-c000.snappy.parquet
part-0-a01e81ab-a987-4929-991d-60f01acab1ca-c000.snappy.parquet
part-0-add0de8e-26eb-406b-bf02-702924f89f1a-c000.snappy.parquet
part-0-e8dd315d-b97e-4257-917c-34696d0a866c-c000.snappy.parquet
part-0-ed8be0d2-508f-4666-b66f-93182413472e-c000.snappy.parquet
part-1-20b63b66-8f9a-4e3b-893c-4acb106ddac1-c000.snappy.parquet
a = 7

part-1-227ff83d-5341-48be-97be-00cde92cb303-c000.snappy.parquet
a = 1

part-1-38e186bb-ca67-4e3d-87fe-780585f25c84-c000.snappy.parquet
a = 0

part-1-3b06880b-6d57-49d7-bb63-4220092ef1ae-c000.snappy.parquet
a = 4

part-1-449026a6-f486-4fca-81fa-b7cdeaddfa3b-c000.snappy.parquet
a = 5

part-1-567ed849-b1e9-494f-b33f-495592826b28-c000.snappy.parquet
a = 2

part-1-70fa8c7e-9b45-4103-a99e-5b0f61b6062a-c000.snappy.parquet
a = 10

part-1-7399d477-c393-481b-b76f-1289deb72bc0-c000.snappy.parquet
a = 3

part-1-93678ef9-27d4-4a5d-aaa1-58492de248e7-c000.snappy.parquet
a = 6

part-1-c1b934d8-0058-40e0-87f9-40ee7eca52ed-c000.snappy.parquet
a = 8

part-1-c599dd4d-32c8-4032-935a-b1d45bc508e1-c000.snappy.parquet
a = 9
{noformat}

This way the parquet-mr library has nothing to do with the ordering of these 
values.

> Changed the data order after DataFrame reuse
> 
>
> Key: PARQUET-1746
> URL: https://issues.apache.org/jira/browse/PARQUET-1746
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1746
> git checkout PARQUET-1746
> build/sbt "sql/test-only *StreamSuite"
> {code}
> output:
> {noformat}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Decoded objects do not match expected objects:
> expected: WrappedArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> actual:   WrappedArray(0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 2)
> assertnotnull(upcast(getcolumnbyordinal(0, LongType), LongType, - root class: 
> "scala.Long"))
> +- upcast(getcolumnbyordinal(0, LongType), LongType, - root class: 
> "scala.Long")
>+- getcolumnbyordinal(0, LongType)
>  
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions.fail(Assertions.scala:1091)
>   at org.scalatest.Assertions.fail$(Assertions.scala:1087)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1560)
>   at org.apache.spark.sql.QueryTest.checkDataset(QueryTest.scala:73)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$22(StreamSuite.scala:215)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$22$adapted(StreamSuite.scala:208)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:76)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:75)
>   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:75)
>   at 
> org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:74)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.withTempDir(StreamSuite.scala:51)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$21(StreamSuite.scala:208)
>   at 
> org.apache.spark.sql.streaming.StreamSuite.$anonfun$new$21$adapted(StreamSuite.scala:207)
>   at 
> org.apache.spark.sql.test.SQLTestUt

[jira] [Resolved] (PARQUET-1745) No result for partition key included in Parquet file

2020-01-20 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1745.
---
Resolution: Not A Bug

Closing this issue as "Not a Bug". See my previous comment and the referenced 
mail thread.

> No result for partition key included in Parquet file
> 
>
> Key: PARQUET-1745
> URL: https://issues.apache.org/jira/browse/PARQUET-1745
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: FilterByColumnIndex.png
>
>
> How to reproduce:
> {code:sh}
> git clone https://github.com/apache/spark.git && cd spark
> git fetch origin pull/26804/head:PARQUET-1745
> git checkout PARQUET-1745
> build/sbt "sql/test-only *ParquetV2PartitionDiscoverySuite"
> {code}
> output:
> {noformat}
> [info] - read partitioned table - partition key included in Parquet file *** 
> FAILED *** (1 second, 57 milliseconds)
> [info]   Results do not match for query:
> [info]   Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> [info]   Timezone Env:
> [info]
> [info]   == Parsed Logical Plan ==
> [info]   'Project [*]
> [info]   +- 'Filter ('pi = 1)
> [info]  +- 'UnresolvedRelation [t]
> [info]
> [info]   == Analyzed Logical Plan ==
> [info]   intField: int, stringField: string, pi: int, ps: string
> [info]   Project [intField#1788, stringField#1789, pi#1790, ps#1791]
> [info]   +- Filter (pi#1790 = 1)
> [info]  +- SubqueryAlias `t`
> [info] +- RelationV2[intField#1788, stringField#1789, pi#1790, 
> ps#1791] parquet 
> file:/root/opensource/apache-spark/target/tmp/spark-c7e85130-3e1f-4137-ac7c-32f48be3b74a
> [info]
> [info]   == Optimized Logical Plan ==
> [info]   Filter (isnotnull(pi#1790) AND (pi#1790 = 1))
> [info]   +- RelationV2[intField#1788, stringField#1789, pi#1790, ps#1791] 
> parquet 
> file:/root/opensource/apache-spark/target/tmp/spark-c7e85130-3e1f-4137-ac7c-32f48be3b74a
> [info]
> [info]   == Physical Plan ==
> [info]   *(1) Project [intField#1788, stringField#1789, pi#1790, ps#1791]
> [info]   +- *(1) Filter (isnotnull(pi#1790) AND (pi#1790 = 1))
> [info]  +- *(1) ColumnarToRow
> [info] +- BatchScan[intField#1788, stringField#1789, pi#1790, 
> ps#1791] ParquetScan Location: 
> InMemoryFileIndex[file:/root/opensource/apache-spark/target/tmp/spark-c7e85130-3e1f-4137-ac7c-32f...,
>  ReadSchema: struct, PushedFilters: 
> [IsNotNull(pi), EqualTo(pi,1)]
> [info]
> [info]   == Results ==
> [info]
> [info]   == Results ==
> [info]   !== Correct Answer - 20 ==   == Spark Answer - 0 ==
> [info]struct<>struct<>
> [info]   ![1,1,1,bar]
> [info]   ![1,1,1,foo]
> [info]   ![10,10,1,bar]
> [info]   ![10,10,1,foo]
> [info]   ![2,2,1,bar]
> [info]   ![2,2,1,foo]
> [info]   ![3,3,1,bar]
> [info]   ![3,3,1,foo]
> [info]   ![4,4,1,bar]
> [info]   ![4,4,1,foo]
> [info]   ![5,5,1,bar]
> [info]   ![5,5,1,foo]
> [info]   ![6,6,1,bar]
> [info]   ![6,6,1,foo]
> [info]   ![7,7,1,bar]
> [info]   ![7,7,1,foo]
> [info]   ![8,8,1,bar]
> [info]   ![8,8,1,foo]
> [info]   ![9,9,1,bar]
> [info]   ![9,9,1,foo] (QueryTest.scala:248)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
> [info]   at 
> org.apache.spark.sql.QueryTest$.newAssertionFailedException(QueryTest.scala:238)
> [info]   at org.scalatest.Assertions.fail(Assertions.scala:1091)
> [info]   at org.scalatest.Assertions.fail$(Assertions.scala:1087)
> [info]   at org.apache.spark.sql.QueryTest$.fail(QueryTest.scala:238)
> [info]   at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:248)
> [info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:156)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetV2PartitionDiscoverySuite.$anonfun$new$194(ParquetPartitionDiscoverySuite.scala:1232)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withTempView(SQLTestUtils.scala:260)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtilsBase.withTempView$(SQLTestUtils.scala:258)
> [info]   at 
> org.apache.spark.sql.

[jira] [Created] (PARQUET-1774) Release parquet 1.11.1

2020-01-22 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1774:
-

 Summary: Release parquet 1.11.1
 Key: PARQUET-1774
 URL: https://issues.apache.org/jira/browse/PARQUET-1774
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Affects Versions: 1.11.0
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky
 Fix For: 1.11.1


Some issues are discovered during the migration to the parquet-mr release 
1.11.0 in Spark. These issues are to be fixed and release in the minor release 
1.11.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1784) Column-wise configuration

2020-02-05 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1784:
-

 Summary: Column-wise configuration
 Key: PARQUET-1784
 URL: https://issues.apache.org/jira/browse/PARQUET-1784
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


After adding some new statistics and encodings into Parquet it is getting very 
hard to be smart and choose the best configs automatically. For example for 
which columns should we save column index and/or bloom-filters? Is it worth 
using dictionary for a column that we know will fall back to another encoding?

The idea of this feature is to allow the library user to fine-tune the 
configuration by setting it column-wise. To support this we extend the existing 
configuration keys by a suffix to identify the related column. (From now on we 
introduce new keys following the same syntax.)
 \{key of the configuration}{{#}}{column path or column index in the projection}
For example: {{parquet.enable.dictionary#column.path.col_1}} or 
{{parquet.enable.dictionary#3}}

This jira covers the framework to support the column-wise configuration with 
the implementation of some existing configs where it make sense (e.g. 
{{parquet.enable.dictionary}}). Implementing new configuration is not part of 
this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1784) Column-wise configuration

2020-02-05 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1784:
--
Description: 
After adding some new statistics and encodings into Parquet it is getting very 
hard to be smart and choose the best configs automatically. For example for 
which columns should we save column index and/or bloom-filters? Is it worth 
using dictionary for a column that we know will fall back to another encoding?

The idea of this feature is to allow the library user to fine-tune the 
configuration by setting it column-wise. To support this we extend the existing 
configuration keys by a suffix to identify the related column. (From now on we 
introduce new keys following the same syntax.)
 \{key of the configuration}{{#}}\{column path or column index in the 
projection}
For example: {{parquet.enable.dictionary#column.path.col_1}} or 
{{parquet.enable.dictionary#3}}

This jira covers the framework to support the column-wise configuration with 
the implementation of some existing configs where it make sense (e.g. 
{{parquet.enable.dictionary}}). Implementing new configuration is not part of 
this effort.

  was:
After adding some new statistics and encodings into Parquet it is getting very 
hard to be smart and choose the best configs automatically. For example for 
which columns should we save column index and/or bloom-filters? Is it worth 
using dictionary for a column that we know will fall back to another encoding?

The idea of this feature is to allow the library user to fine-tune the 
configuration by setting it column-wise. To support this we extend the existing 
configuration keys by a suffix to identify the related column. (From now on we 
introduce new keys following the same syntax.)
 \{key of the configuration}{{#}}{column path or column index in the projection}
For example: {{parquet.enable.dictionary#column.path.col_1}} or 
{{parquet.enable.dictionary#3}}

This jira covers the framework to support the column-wise configuration with 
the implementation of some existing configs where it make sense (e.g. 
{{parquet.enable.dictionary}}). Implementing new configuration is not part of 
this effort.


> Column-wise configuration
> -
>
> Key: PARQUET-1784
> URL: https://issues.apache.org/jira/browse/PARQUET-1784
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> After adding some new statistics and encodings into Parquet it is getting 
> very hard to be smart and choose the best configs automatically. For example 
> for which columns should we save column index and/or bloom-filters? Is it 
> worth using dictionary for a column that we know will fall back to another 
> encoding?
> The idea of this feature is to allow the library user to fine-tune the 
> configuration by setting it column-wise. To support this we extend the 
> existing configuration keys by a suffix to identify the related column. (From 
> now on we introduce new keys following the same syntax.)
>  \{key of the configuration}{{#}}\{column path or column index in the 
> projection}
> For example: {{parquet.enable.dictionary#column.path.col_1}} or 
> {{parquet.enable.dictionary#3}}
> This jira covers the framework to support the column-wise configuration with 
> the implementation of some existing configs where it make sense (e.g. 
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of 
> this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1784) Column-wise configuration

2020-02-05 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1784:
--
Description: 
After adding some new statistics and encodings into Parquet it is getting very 
hard to be smart and choose the best configs automatically. For example for 
which columns should we save column index and/or bloom-filters? Is it worth 
using dictionary for a column that we know will fall back to another encoding?

The idea of this feature is to allow the library user to fine-tune the 
configuration by setting it column-wise. To support this we extend the existing 
configuration keys by a suffix to identify the related column. (From now on we 
introduce new keys following the same syntax.)
 \{key of the configuration}{{#}}{column path in the file schema}
 For example: {{parquet.enable.dictionary#column.path.col_1}}

This jira covers the framework to support the column-wise configuration with 
the implementation of some existing configs where it make sense (e.g. 
{{parquet.enable.dictionary}}). Implementing new configuration is not part of 
this effort.

  was:
After adding some new statistics and encodings into Parquet it is getting very 
hard to be smart and choose the best configs automatically. For example for 
which columns should we save column index and/or bloom-filters? Is it worth 
using dictionary for a column that we know will fall back to another encoding?

The idea of this feature is to allow the library user to fine-tune the 
configuration by setting it column-wise. To support this we extend the existing 
configuration keys by a suffix to identify the related column. (From now on we 
introduce new keys following the same syntax.)
 \{key of the configuration}{{#}}\{column path or column index in the 
projection}
For example: {{parquet.enable.dictionary#column.path.col_1}} or 
{{parquet.enable.dictionary#3}}

This jira covers the framework to support the column-wise configuration with 
the implementation of some existing configs where it make sense (e.g. 
{{parquet.enable.dictionary}}). Implementing new configuration is not part of 
this effort.


> Column-wise configuration
> -
>
> Key: PARQUET-1784
> URL: https://issues.apache.org/jira/browse/PARQUET-1784
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> After adding some new statistics and encodings into Parquet it is getting 
> very hard to be smart and choose the best configs automatically. For example 
> for which columns should we save column index and/or bloom-filters? Is it 
> worth using dictionary for a column that we know will fall back to another 
> encoding?
> The idea of this feature is to allow the library user to fine-tune the 
> configuration by setting it column-wise. To support this we extend the 
> existing configuration keys by a suffix to identify the related column. (From 
> now on we introduce new keys following the same syntax.)
>  \{key of the configuration}{{#}}{column path in the file schema}
>  For example: {{parquet.enable.dictionary#column.path.col_1}}
> This jira covers the framework to support the column-wise configuration with 
> the implementation of some existing configs where it make sense (e.g. 
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of 
> this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1784) Column-wise configuration

2020-02-05 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1784:
--
Description: 
After adding some new statistics and encodings into Parquet it is getting very 
hard to be smart and choose the best configs automatically. For example for 
which columns should we save column index and/or bloom-filters? Is it worth 
using dictionary for a column that we know will fall back to another encoding?

The idea of this feature is to allow the library user to fine-tune the 
configuration by setting it column-wise. To support this we extend the existing 
configuration keys by a suffix to identify the related column. (From now on we 
introduce new keys following the same syntax.)
 \{key of the configuration}{{#}}\{column path in the file schema}
 For example: {{parquet.enable.dictionary#column.path.col_1}}

This jira covers the framework to support the column-wise configuration with 
the implementation of some existing configs where it make sense (e.g. 
{{parquet.enable.dictionary}}). Implementing new configuration is not part of 
this effort.

  was:
After adding some new statistics and encodings into Parquet it is getting very 
hard to be smart and choose the best configs automatically. For example for 
which columns should we save column index and/or bloom-filters? Is it worth 
using dictionary for a column that we know will fall back to another encoding?

The idea of this feature is to allow the library user to fine-tune the 
configuration by setting it column-wise. To support this we extend the existing 
configuration keys by a suffix to identify the related column. (From now on we 
introduce new keys following the same syntax.)
 \{key of the configuration}{{#}}{column path in the file schema}
 For example: {{parquet.enable.dictionary#column.path.col_1}}

This jira covers the framework to support the column-wise configuration with 
the implementation of some existing configs where it make sense (e.g. 
{{parquet.enable.dictionary}}). Implementing new configuration is not part of 
this effort.


> Column-wise configuration
> -
>
> Key: PARQUET-1784
> URL: https://issues.apache.org/jira/browse/PARQUET-1784
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> After adding some new statistics and encodings into Parquet it is getting 
> very hard to be smart and choose the best configs automatically. For example 
> for which columns should we save column index and/or bloom-filters? Is it 
> worth using dictionary for a column that we know will fall back to another 
> encoding?
> The idea of this feature is to allow the library user to fine-tune the 
> configuration by setting it column-wise. To support this we extend the 
> existing configuration keys by a suffix to identify the related column. (From 
> now on we introduce new keys following the same syntax.)
>  \{key of the configuration}{{#}}\{column path in the file schema}
>  For example: {{parquet.enable.dictionary#column.path.col_1}}
> This jira covers the framework to support the column-wise configuration with 
> the implementation of some existing configs where it make sense (e.g. 
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of 
> this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1784) Column-wise configuration

2020-02-05 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030657#comment-17030657
 ] 

Gabor Szadovszky commented on PARQUET-1784:
---

Referencing columns by their index is really hard to be implemented. Most of 
the code parts where the config is used the schema is not available. This 
option is removed from the description.

> Column-wise configuration
> -
>
> Key: PARQUET-1784
> URL: https://issues.apache.org/jira/browse/PARQUET-1784
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> After adding some new statistics and encodings into Parquet it is getting 
> very hard to be smart and choose the best configs automatically. For example 
> for which columns should we save column index and/or bloom-filters? Is it 
> worth using dictionary for a column that we know will fall back to another 
> encoding?
> The idea of this feature is to allow the library user to fine-tune the 
> configuration by setting it column-wise. To support this we extend the 
> existing configuration keys by a suffix to identify the related column. (From 
> now on we introduce new keys following the same syntax.)
>  \{key of the configuration}{{#}}\{column path in the file schema}
>  For example: {{parquet.enable.dictionary#column.path.col_1}}
> This jira covers the framework to support the column-wise configuration with 
> the implementation of some existing configs where it make sense (e.g. 
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of 
> this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1784) Column-wise configuration

2020-02-06 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031377#comment-17031377
 ] 

Gabor Szadovszky commented on PARQUET-1784:
---

[~garawalid],

The idea is to use a "root" key for the configuration and add a specific 
{{#column.path}} suffix to the key to set the configuration for the related 
column only. So, to turn the bloom filters on for all the columns would be
{code:java}
conf.set("parquet.bloom.filter", true); 
{code}
and to turn it off for a specific column:
{code:java}
conf.set("parquet.bloom.filter#column.path", false);
{code}

> Column-wise configuration
> -
>
> Key: PARQUET-1784
> URL: https://issues.apache.org/jira/browse/PARQUET-1784
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> After adding some new statistics and encodings into Parquet it is getting 
> very hard to be smart and choose the best configs automatically. For example 
> for which columns should we save column index and/or bloom-filters? Is it 
> worth using dictionary for a column that we know will fall back to another 
> encoding?
> The idea of this feature is to allow the library user to fine-tune the 
> configuration by setting it column-wise. To support this we extend the 
> existing configuration keys by a suffix to identify the related column. (From 
> now on we introduce new keys following the same syntax.)
>  \{key of the configuration}{{#}}\{column path in the file schema}
>  For example: {{parquet.enable.dictionary#column.path.col_1}}
> This jira covers the framework to support the column-wise configuration with 
> the implementation of some existing configs where it make sense (e.g. 
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of 
> this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1787) Expected distinct numbers is not parsed correctly

2020-02-06 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031399#comment-17031399
 ] 

Gabor Szadovszky edited comment on PARQUET-1787 at 2/6/20 9:26 AM:
---

I'm working on a general concept of allowing configuration to be set for 
specific columns. See PARQUET-1784 for details.
What do you think of having the mentioned configuration as follows?
{code:java}
conf.set("parquet.bloom.filter.enabled", false); // Might not be required as 
this is the default
conf.set("parquet.bloom.filter.enabled#content", true); // Might not be 
necessary as by setting the expected ndv you explicitly sets this one
conf.set("parquet.bloom.filter.enabled#line", true); // Might not be necessary 
as by setting the expected ndv you explicitly sets this one
conf.set("parquet.bloom.filter.expected.ndv#content", 1000);
conf.set("parquet.bloom.filter.expected.ndv#line", 200);
{code}
This might require more writing but more clear and less error prone.


was (Author: gszadovszky):
I'm working on a general concept of allowing configuration to be set for 
specific columns. See PARQUET-1784 for details.
What do you think of having the mentioned configuration as follows?
{code:java}
conf.set("parquet.bloom.filter.enabled", false); // Might not be required as 
this is the default
conf.set("parquet.bloom.filter.enabled#content", true); // Might not be 
necessary as by setting the expected ndv you explicitly sets it
conf.set("parquet.bloom.filter.enabled#line", true); // Might not be necessary 
as by setting the expected ndv you explicitly sets it
conf.set("parquet.bloom.filter.expected.ndv#content", 1000);
conf.set("parquet.bloom.filter.expected.ndv#line", 200);
{code}
This might require more writing but more clear and less error prone.

> Expected distinct numbers is not parsed correctly
> -
>
> Key: PARQUET-1787
> URL: https://issues.apache.org/jira/browse/PARQUET-1787
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Walid Gara
>Priority: Critical
>  Labels: pull-request-available
>
> In the bloom filter feature, when I pass the expected distinct numbers as 
> below, I got null values instead of 1000 and 200.
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> Configuration conf = new Configuration();
> conf.set("parquet.bloom.filter.column.names", "content,line"); 
> conf.set("parquet.bloom.filter.expected.ndv","1000,200");
> {code}
>  
>  The issue is coming from getting the system property of expected distinct 
> numbers through 
> [Long.getLong(expectedNDVs[i])|https://github.com/apache/parquet-mr/blob/a737141a571e3cb6cee2c252dc4406e26e6c1177/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L251].
>  
> It's possible to fix it by parsing the string with 
> Long.parseLong(expectedNDVs[i]).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1787) Expected distinct numbers is not parsed correctly

2020-02-06 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031399#comment-17031399
 ] 

Gabor Szadovszky commented on PARQUET-1787:
---

I'm working on a general concept of allowing configuration to be set for 
specific columns. See PARQUET-1784 for details.
What do you think of having the mentioned configuration as follows?
{code:java}
conf.set("parquet.bloom.filter.enabled", false); // Might not be required as 
this is the default
conf.set("parquet.bloom.filter.enabled#content", true); // Might not be 
necessary as by setting the expected ndv you explicitly sets it
conf.set("parquet.bloom.filter.enabled#line", true); // Might not be necessary 
as by setting the expected ndv you explicitly sets it
conf.set("parquet.bloom.filter.expected.ndv#content", 1000);
conf.set("parquet.bloom.filter.expected.ndv#line", 200);
{code}
This might require more writing but more clear and less error prone.

> Expected distinct numbers is not parsed correctly
> -
>
> Key: PARQUET-1787
> URL: https://issues.apache.org/jira/browse/PARQUET-1787
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Walid Gara
>Priority: Critical
>  Labels: pull-request-available
>
> In the bloom filter feature, when I pass the expected distinct numbers as 
> below, I got null values instead of 1000 and 200.
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> Configuration conf = new Configuration();
> conf.set("parquet.bloom.filter.column.names", "content,line"); 
> conf.set("parquet.bloom.filter.expected.ndv","1000,200");
> {code}
>  
>  The issue is coming from getting the system property of expected distinct 
> numbers through 
> [Long.getLong(expectedNDVs[i])|https://github.com/apache/parquet-mr/blob/a737141a571e3cb6cee2c252dc4406e26e6c1177/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L251].
>  
> It's possible to fix it by parsing the string with 
> Long.parseLong(expectedNDVs[i]).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1784) Column-wise configuration

2020-02-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032196#comment-17032196
 ] 

Gabor Szadovszky commented on PARQUET-1784:
---

[~garawalid],

Thanks for the research and the examples.

If one would like to set some parquet specific configuration it needs to 
consult the Parquet documentations to know which key is to be used and which 
values are allowed. Therefore,  I don't think Parquet needs to follow the 
existing configurations of other components.

What I would like to implement here is to have a common way for setting the 
configuration of different columns. Let's check the following example. We would 
like to set the encoding of some specific columns while we would like to keep 
the encoding of the other columns to be selected automatically . We might 
configure it the following way using lists.
{code:java}
conf.setStrings("parquet.encoding.columns", "float_col", "double_col");
conf.setStrings("parquet.encoding", "byte_stream_split", "byte_stream_split");
{code}
Or, we use the pattern described in this jira:
{code:java}
conf.set("parquet.encoding#float_col", "byte_stream_split");
conf.set("parquet.encoding#double_col", "byte_stream_split");
{code}
I think, the latter is cleaner and less error prone. Moreover, 
{{org.apache.hadoop.conf.Configuration}} only gives support for string lists 
while in the latter case you can use any value type supported by 
{{org.apache.hadoop.conf.Configuration}} in a clean way.

What do you think? If you have time, you may also like to check my PR as well?

> Column-wise configuration
> -
>
> Key: PARQUET-1784
> URL: https://issues.apache.org/jira/browse/PARQUET-1784
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> After adding some new statistics and encodings into Parquet it is getting 
> very hard to be smart and choose the best configs automatically. For example 
> for which columns should we save column index and/or bloom-filters? Is it 
> worth using dictionary for a column that we know will fall back to another 
> encoding?
> The idea of this feature is to allow the library user to fine-tune the 
> configuration by setting it column-wise. To support this we extend the 
> existing configuration keys by a suffix to identify the related column. (From 
> now on we introduce new keys following the same syntax.)
>  \{key of the configuration}{{#}}\{column path in the file schema}
>  For example: {{parquet.enable.dictionary#column.path.col_1}}
> This jira covers the framework to support the column-wise configuration with 
> the implementation of some existing configs where it make sense (e.g. 
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of 
> this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-02-11 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034294#comment-17034294
 ] 

Gabor Szadovszky commented on PARQUET-1792:
---

If you are talking about one file at a time you might be right that it is 10x 
faster than doing it by a query engine. But the tool is running on one node 
while the query engine uses several ones at the same time so I am not sure 
about the 10x performance.
Pruning the file makes sense to me to be written at the library level because 
you can do it in an effective way (do not need to unpack/decode the pages or 
the entire column chunks). To mask the values in the other hand requires to 
read the actual values and to generate the hashes. You also need to generate 
the related statistics.
Therefore, I am not sure if this masking feature properly suited for parquet-mr.

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1794) Random data generation may cause flaky tests

2020-02-12 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1794:
-

 Summary: Random data generation may cause flaky tests
 Key: PARQUET-1794
 URL: https://issues.apache.org/jira/browse/PARQUET-1794
 Project: Parquet
  Issue Type: Bug
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Some code parts uses {{BigInteger}} objects to generate {{FIX_LEN_BYTE_ARRAY}} 
or {{INT96}} values. The problem is with {{BigInteger.toByteArray()}} which 
creates the shortest {{byte}} array which can hold the value so the array might 
be shorter than expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

2020-02-12 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1622.
---
Fix Version/s: 1.12.0
   Resolution: Fixed

> Add BYTE_STREAM_SPLIT encoding
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: 1.12.0, format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1790) ParquetFileWriter missing Api for DataPageV2

2020-02-12 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1790.
---
Resolution: Fixed

> ParquetFileWriter missing Api for  DataPageV2
> -
>
> Key: PARQUET-1790
> URL: https://issues.apache.org/jira/browse/PARQUET-1790
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Brian Mwambazi
>Assignee: Brian Mwambazi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The _ParquetFileWriter_ class currently does not  have an API for writing a 
> DataPageV2 page.  
> A similar API is already defined in _ColumnChunkPageWriteStore_ and 
> inspiration can/should be derived from there in implementing this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1790) ParquetFileWriter missing Api for DataPageV2

2020-02-12 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1790:
-

Assignee: Brian Mwambazi

> ParquetFileWriter missing Api for  DataPageV2
> -
>
> Key: PARQUET-1790
> URL: https://issues.apache.org/jira/browse/PARQUET-1790
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Brian Mwambazi
>Assignee: Brian Mwambazi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The _ParquetFileWriter_ class currently does not  have an API for writing a 
> DataPageV2 page.  
> A similar API is already defined in _ColumnChunkPageWriteStore_ and 
> inspiration can/should be derived from there in implementing this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1796) Bump Apache Avro to 1.9.2

2020-02-14 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1796.
---
Resolution: Fixed

> Bump Apache Avro to 1.9.2
> -
>
> Key: PARQUET-1796
> URL: https://issues.apache.org/jira/browse/PARQUET-1796
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-avro
>Affects Versions: 1.11.0
>Reporter: Ryan Skraba
>Assignee: Ryan Skraba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1801) Add column index support for 'prune' command in Parquet-tools/cli

2020-02-17 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038172#comment-17038172
 ] 

Gabor Szadovszky commented on PARQUET-1801:
---

Currently, only column indexes are the special data that does not belong to the 
row groups/pages neither to the footer. But, bloom filter is also on the way 
which will be similar to the column indexes. Also, there are more than one 
feature requires to copy existing column chunks and column indexes untouched 
and some others to be rewritten.
I think, it would be a good idea to have these requirements implemented in a 
way that it does not belong to parquet-tools or parquet-cli and can be used for 
both the 'prune' and the 'mask' features (and maybe the properly implemented 
merge feature as well). 

> Add column index support for 'prune' command in Parquet-tools/cli
> -
>
> Key: PARQUET-1801
> URL: https://issues.apache.org/jira/browse/PARQUET-1801
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1794) Random data generation may cause flaky tests

2020-02-17 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1794.
---
Resolution: Fixed

> Random data generation may cause flaky tests
> 
>
> Key: PARQUET-1794
> URL: https://issues.apache.org/jira/browse/PARQUET-1794
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> Some code parts uses {{BigInteger}} objects to generate 
> {{FIX_LEN_BYTE_ARRAY}} or {{INT96}} values. The problem is with 
> {{BigInteger.toByteArray()}} which creates the shortest {{byte}} array which 
> can hold the value so the array might be shorter than expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1796) Bump Apache Avro to 1.9.2

2020-02-19 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1796:
--
Fix Version/s: 1.11.1

> Bump Apache Avro to 1.9.2
> -
>
> Key: PARQUET-1796
> URL: https://issues.apache.org/jira/browse/PARQUET-1796
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-avro
>Affects Versions: 1.11.0
>Reporter: Ryan Skraba
>Assignee: Ryan Skraba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1774) Release parquet 1.11.1

2020-02-19 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039845#comment-17039845
 ] 

Gabor Szadovszky commented on PARQUET-1774:
---

Waiting for Spark to confirm that 1.11.1-SNAPSHOT works properly.

> Release parquet 1.11.1
> --
>
> Key: PARQUET-1774
> URL: https://issues.apache.org/jira/browse/PARQUET-1774
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.11.1
>
>
> Some issues are discovered during the migration to the parquet-mr release 
> 1.11.0 in Spark. These issues are to be fixed and release in the minor 
> release 1.11.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1802) CompressionCodec class not found if the codec class is not in the same defining classloader as the CodecFactory class

2020-02-20 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1802:
-

Assignee: Terence Yim

> CompressionCodec class not found if the codec class is not in the same 
> defining classloader as the CodecFactory class
> -
>
> Key: PARQUET-1802
> URL: https://issues.apache.org/jira/browse/PARQUET-1802
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Terence Yim
>Assignee: Terence Yim
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1802) CompressionCodec class not found if the codec class is not in the same defining classloader as the CodecFactory class

2020-02-24 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1802.
---
Resolution: Fixed

> CompressionCodec class not found if the codec class is not in the same 
> defining classloader as the CodecFactory class
> -
>
> Key: PARQUET-1802
> URL: https://issues.apache.org/jira/browse/PARQUET-1802
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Terence Yim
>Assignee: Terence Yim
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1381) Add merge blocks command to parquet-tools

2020-02-24 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043505#comment-17043505
 ] 

Gabor Szadovszky commented on PARQUET-1381:
---

I don't think anyone is working on it. Feel free to open a PR and I'm happy to 
review.
However, when I had to revert the previous implementation I was thinking a 
correct solution. It turned out that it's so complicated to be optimal in 
performance yet correct from statistics point of view. Currently, I am not sure 
if it worth the potential efforts to implement this feature while you can get a 
similar functionality by rewriting a whole table of parquet files in a 
distributed way using a query engine (e.g. Hive/Impala/Spark). Also, you can do 
it as a background process using the Hive compaction feature.

> Add merge blocks command to parquet-tools
> -
>
> Key: PARQUET-1381
> URL: https://issues.apache.org/jira/browse/PARQUET-1381
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Ekaterina Galieva
>Assignee: Ekaterina Galieva
>Priority: Major
>  Labels: pull-request-available
>
> Current implementation of merge command in parquet-tools doesn't merge row 
> groups, just places one after the other. Add API and command option to be 
> able to merge small blocks into larger ones up to specified size limit.
> h6. Implementation details:
> Blocks are not reordered not to break possible initial predicate pushdown 
> optimizations.
> Blocks are not divided to fit upper bound perfectly. 
> This is an intentional performance optimization. 
> This gives an opportunity to form new blocks by coping full content of 
> smaller blocks by column, not by row.
> h6. Examples:
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [128 | 40], [120]{code}
> Expected output file blocks sizes:
> {{merge }}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b}}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b -l 256 }}
> {code:java}
> [163 | 168 | 120]
> {code}
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [40], [120], [6] {code}
> Expected output file blocks sizes:
> {{merge}}
> {code:java}
> [128 | 35 | 40 | 120 | 6] 
> {code}
> {{merge -b}}
> {code:java}
> [128 | 75 | 126] 
> {code}
> {{merge -b -l 256}}
> {code:java}
> [203 | 126]{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1791) Add 'prune' command to parquet-tools

2020-02-25 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1791.
---
Resolution: Fixed

> Add 'prune' command to parquet-tools 
> -
>
> Key: PARQUET-1791
> URL: https://issues.apache.org/jira/browse/PARQUET-1791
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> During data retention, there is a need to remove unused or personal columns. 
> Adding a 'prune' command in Parquet-tool to remove columns and retain all 
> other columns unchanged. The way it works should be like 'merge' command in 
> Parquet-tools, for example moving the column chunks as a whole.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1784) Column-wise configuration

2020-02-26 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1784.
---
Resolution: Fixed

> Column-wise configuration
> -
>
> Key: PARQUET-1784
> URL: https://issues.apache.org/jira/browse/PARQUET-1784
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> After adding some new statistics and encodings into Parquet it is getting 
> very hard to be smart and choose the best configs automatically. For example 
> for which columns should we save column index and/or bloom-filters? Is it 
> worth using dictionary for a column that we know will fall back to another 
> encoding?
> The idea of this feature is to allow the library user to fine-tune the 
> configuration by setting it column-wise. To support this we extend the 
> existing configuration keys by a suffix to identify the related column. (From 
> now on we introduce new keys following the same syntax.)
>  \{key of the configuration}{{#}}\{column path in the file schema}
>  For example: {{parquet.enable.dictionary#column.path.col_1}}
> This jira covers the framework to support the column-wise configuration with 
> the implementation of some existing configs where it make sense (e.g. 
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of 
> this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2020-02-26 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045530#comment-17045530
 ] 

Gabor Szadovszky commented on PARQUET-41:
-

[~junjie], feature branch for parquet-mr has been merged to master. Please 
resolve the sub-tasks and this jira accordingly.

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Junjie Chen
>Priority: Major
>  Labels: filter2, pull-request-available
> Fix For: format-2.7.0
>
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1805) Refactor the configuration for bloom filters

2020-02-26 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1805:
-

 Summary: Refactor the configuration for bloom filters
 Key: PARQUET-1805
 URL: https://issues.apache.org/jira/browse/PARQUET-1805
 Project: Parquet
  Issue Type: Improvement
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Refactor the hadoop configuration for bloom filters according to PARQUET-1784.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1803) Could not find FilleInputSplit in ParquetInputSplit

2020-02-28 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1803:
--
Affects Version/s: (was: format-2.7.0)
   1.11.0

> Could not find FilleInputSplit in ParquetInputSplit 
> 
>
> Key: PARQUET-1803
> URL: https://issues.apache.org/jira/browse/PARQUET-1803
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Shankar Koirala
>Assignee: Shankar Koirala
>Priority: Minor
>  Labels: documentation, pull-request-available
>
> @deprecated will be removed in 2.0.0. use FileInputSplit instead. 
> This is confusion where we can't find FileInputSplit, It should be 
> {{FileSplit}}, provided by Hadoop: 
> [https://hadoop.apache.org/docs/r2.7.3/api/index.html?org/apache/hadoop/mapred/FileSplit.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1803) Could not find FilleInputSplit in ParquetInputSplit

2020-02-28 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1803:
-

Assignee: Shankar Koirala

> Could not find FilleInputSplit in ParquetInputSplit 
> 
>
> Key: PARQUET-1803
> URL: https://issues.apache.org/jira/browse/PARQUET-1803
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: format-2.7.0
>Reporter: Shankar Koirala
>Assignee: Shankar Koirala
>Priority: Minor
>  Labels: documentation, pull-request-available
>
> @deprecated will be removed in 2.0.0. use FileInputSplit instead. 
> This is confusion where we can't find FileInputSplit, It should be 
> {{FileSplit}}, provided by Hadoop: 
> [https://hadoop.apache.org/docs/r2.7.3/api/index.html?org/apache/hadoop/mapred/FileSplit.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1803) Could not find FilleInputSplit in ParquetInputSplit

2020-02-28 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1803.
---
Resolution: Fixed

> Could not find FilleInputSplit in ParquetInputSplit 
> 
>
> Key: PARQUET-1803
> URL: https://issues.apache.org/jira/browse/PARQUET-1803
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Shankar Koirala
>Assignee: Shankar Koirala
>Priority: Minor
>  Labels: documentation, pull-request-available
>
> @deprecated will be removed in 2.0.0. use FileInputSplit instead. 
> This is confusion where we can't find FileInputSplit, It should be 
> {{FileSplit}}, provided by Hadoop: 
> [https://hadoop.apache.org/docs/r2.7.3/api/index.html?org/apache/hadoop/mapred/FileSplit.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1808) SimpleGroup.toString() uses String += and so has poor performance

2020-03-03 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050013#comment-17050013
 ] 

Gabor Szadovszky commented on PARQUET-1808:
---

[~tiddman],
Thanks for filing this issue.

Please note that {{org.apache.parquet.example}} package is not for production 
use. You should use other bindings (e.g. parquet-avro, parquet-thrift, 
parquet-protoc) or write your own. In addition, {{toString()}} is usually for 
debugging purposes and not for serialization. 
It does not mean we shall not write proper {{toString()}} methods. Would you 
like to put up a PR with a fix you've described? I would be happy to review.

> SimpleGroup.toString() uses String += and so has poor performance
> -
>
> Key: PARQUET-1808
> URL: https://issues.apache.org/jira/browse/PARQUET-1808
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Randy Tidd
>Priority: Minor
>
> This method in SimpleGroup uses `+=` for String concatenation which is a 
> known performance problem in Java, the performance degrades exponentially the 
> more strings that are added.
> [https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
> We ran into a performance problem whereby a single column in a Parquet file 
> was defined as a group:
> {code:java}
> optional group customer_ids (LIST) {
>         repeated group list { 
>         optional binary element (STRING); 
>       }
>     }{code}
>  
> and had over 31,000 values. Reading this single column took over 8 minutes 
> due to time spent in the `toString()` method.  Using a different 
> implementation that uses `StringBuffer` like this:
> {code:java}
>  StringBuffer result = new StringBuffer();
>  int i = 0;
>  for (Type field : schema.getFields()) {
>String name = field.getName();
>List values = data[i];
>++i;
>if (values != null) {
>  if (values.size() > 0) {
>for (Object value : values) {
>  result.append(indent);
>  result.append(name);
>  if (value == null) { 
>result.append(": NULL\n");
>  } else if (value instanceof Group){ 
>result.append("\n"); 
>result.append(betterToString((SimpleGroup)value, indent+" "));
>  } else { 
>result.append(": "); 
>result.append(value.toString()); 
>result.append("\n"); 
>  }
>}
>  }
>}
>  }
>  return result.toString();{code}
> reduced that time to less than 500 milliseconds. 
> The existing implementation is really poor and exhibits an infamous Java 
> string performance issue and should be fixed.
> This was a significant problem for us but we were able to work around it so I 
> am marking this issue as "Minor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1809) Add new APIs for nested predicate pushdown

2020-03-04 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051023#comment-17051023
 ] 

Gabor Szadovszky commented on PARQUET-1809:
---

I am afraid, it is not only the filter API that is affected by this problem. 
There are many places in the code that uses the _dot string_ to address nested 
columns.

If we allow dots in column names it would break backward compatibility: old 
readers will not be able to read new files.

Also, we address column names in configs (e.g. switching bloom filters on/off 
for specific columns) so we have to deal with the string-column path 
conversion. If dots are not separator characters any more what else shall we 
use that we won't allow to be in the column names? Or, should we implement an 
escaping mechanism for such characters?

>  Add new APIs for nested predicate pushdown
> ---
>
> Key: PARQUET-1809
> URL: https://issues.apache.org/jira/browse/PARQUET-1809
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: DB Tsai
>Priority: Major
>
> Currently, Parquet's *org.apache.parquet.filter2.predicate.FilterApi* is 
> using *dot* to split the column name into multi-parts of nested fields. The 
> drawback is that this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.  
> See https://github.com/apache/spark/pull/27728 and [SPARK-17636] for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1808) SimpleGroup.toString() uses String += and so has poor performance

2020-03-05 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051933#comment-17051933
 ] 

Gabor Szadovszky commented on PARQUET-1808:
---

[~tiddman],

I agree that the current project structure of parquet-mr is not optimal. It is 
not always clear which classes are for the public and which are only for 
internal use. Unfortunately, we cannot restructure the whole project without 
braking changes so this effort is left for the next major release.

I guess, you not only use classes from the {{parquet-column}}  jar but from 
{{parquet-hadoop}} as well. As you are using the {{Group}} as the data 
structure you should use {{ExampleInputFormat}}, {{ExampleOutputFormat}}. Both 
in the names and the class comments are stated that these classes are example 
implementations. {{ExampleParquetWriter}} even states that "_THIS IS AN EXAMPLE 
ONLY AND NOT INTENDED FOR USE._"

parquet-mr does not have its own fully supported data structure to use. 
Instead, it has bindings for different serialization engines (e.g. thrift, 
protobuf) or data structures (e.g. avro). Many of our clients use the Avro 
binding so they are able to use the data structures avro provides. You may find 
some examples for using {{parquet-avro}} in the [unit 
tests|https://github.com/apache/parquet-mr/tree/master/parquet-avro/src/test/java/org/apache/parquet/avro].
 You may also check the the sub-modules for the other bindings. I would suggest 
checking the related unit tests if the you cannot find proper documentation.
Another option would be to implement your own data structure (if you don't have 
it already) and implement the binding to it. You may check the implementation 
of the binding for the {{Group}} data structure in 
[parquet-hadoop|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/example].

> SimpleGroup.toString() uses String += and so has poor performance
> -
>
> Key: PARQUET-1808
> URL: https://issues.apache.org/jira/browse/PARQUET-1808
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Randy Tidd
>Priority: Minor
>  Labels: pull-request-available
>
> This method in SimpleGroup uses `+=` for String concatenation which is a 
> known performance problem in Java, the performance degrades exponentially the 
> more strings that are added.
> [https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
> We ran into a performance problem whereby a single column in a Parquet file 
> was defined as a group:
> {code:java}
> optional group customer_ids (LIST) {
>         repeated group list { 
>         optional binary element (STRING); 
>       }
>     }{code}
>  
> and had over 31,000 values. Reading this single column took over 8 minutes 
> due to time spent in the `toString()` method.  Using a different 
> implementation that uses `StringBuffer` like this:
> {code:java}
>  StringBuffer result = new StringBuffer();
>  int i = 0;
>  for (Type field : schema.getFields()) {
>String name = field.getName();
>List values = data[i];
>++i;
>if (values != null) {
>  if (values.size() > 0) {
>for (Object value : values) {
>  result.append(indent);
>  result.append(name);
>  if (value == null) { 
>result.append(": NULL\n");
>  } else if (value instanceof Group){ 
>result.append("\n"); 
>result.append(betterToString((SimpleGroup)value, indent+" "));
>  } else { 
>result.append(": "); 
>result.append(value.toString()); 
>result.append("\n"); 
>  }
>}
>  }
>}
>  }
>  return result.toString();{code}
> reduced that time to less than 500 milliseconds. 
> The existing implementation is really poor and exhibits an infamous Java 
> string performance issue and should be fixed.
> This was a significant problem for us but we were able to work around it so I 
> am marking this issue as "Minor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1808) SimpleGroup.toString() uses String += and so has poor performance

2020-03-05 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1808:
-

Assignee: Shankar Koirala

> SimpleGroup.toString() uses String += and so has poor performance
> -
>
> Key: PARQUET-1808
> URL: https://issues.apache.org/jira/browse/PARQUET-1808
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Randy Tidd
>Assignee: Shankar Koirala
>Priority: Minor
>  Labels: pull-request-available
>
> This method in SimpleGroup uses `+=` for String concatenation which is a 
> known performance problem in Java, the performance degrades exponentially the 
> more strings that are added.
> [https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
> We ran into a performance problem whereby a single column in a Parquet file 
> was defined as a group:
> {code:java}
> optional group customer_ids (LIST) {
>         repeated group list { 
>         optional binary element (STRING); 
>       }
>     }{code}
>  
> and had over 31,000 values. Reading this single column took over 8 minutes 
> due to time spent in the `toString()` method.  Using a different 
> implementation that uses `StringBuffer` like this:
> {code:java}
>  StringBuffer result = new StringBuffer();
>  int i = 0;
>  for (Type field : schema.getFields()) {
>String name = field.getName();
>List values = data[i];
>++i;
>if (values != null) {
>  if (values.size() > 0) {
>for (Object value : values) {
>  result.append(indent);
>  result.append(name);
>  if (value == null) { 
>result.append(": NULL\n");
>  } else if (value instanceof Group){ 
>result.append("\n"); 
>result.append(betterToString((SimpleGroup)value, indent+" "));
>  } else { 
>result.append(": "); 
>result.append(value.toString()); 
>result.append("\n"); 
>  }
>}
>  }
>}
>  }
>  return result.toString();{code}
> reduced that time to less than 500 milliseconds. 
> The existing implementation is really poor and exhibits an infamous Java 
> string performance issue and should be fixed.
> This was a significant problem for us but we were able to work around it so I 
> am marking this issue as "Minor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1809) Add new APIs for nested predicate pushdown

2020-03-05 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051959#comment-17051959
 ] 

Gabor Szadovszky commented on PARQUET-1809:
---

It would be nice to use string arrays (or maybe more properly 
[ColumnPath|https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java]
 objects) instead of the _dot strings_ in all code parts but it seems to be a 
huge effort. And, in case of the mentioned configuration keys, it is not 
possible.
The problem with using '.' characters in column names is the potential 
collisions may occur in case of schemas like the following one:
{code}
message Document {
  required group foo {
required int64 bar
  }
  required int64 foo.bar
}
{code}

>  Add new APIs for nested predicate pushdown
> ---
>
> Key: PARQUET-1809
> URL: https://issues.apache.org/jira/browse/PARQUET-1809
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: DB Tsai
>Priority: Major
>
> Currently, Parquet's *org.apache.parquet.filter2.predicate.FilterApi* is 
> using *dot* to split the column name into multi-parts of nested fields. The 
> drawback is that this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.  
> See https://github.com/apache/spark/pull/27728 and [SPARK-17636] for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1811) Update download links

2020-03-05 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1811:
-

 Summary: Update download links
 Key: PARQUET-1811
 URL: https://issues.apache.org/jira/browse/PARQUET-1811
 Project: Parquet
  Issue Type: Task
  Components: parquet-site
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Based on the following mail sent to the private list we shall update the 
download links on our site.
{quote}
Hello, Apache PMCs,

In order to better provide our millions of users with downloads, the
Apache Infrastructure Team has been restructuring the way downloads work
for our main distribution channels in the past few weeks. For users,
this will largely go unnoticed, and for projects likely the same, but we
did want to reach out to projects and inform them of the changes we've
made:

As of March 2020, we are deprecating www.apache.org/dist/ in favor of
https://downloads.apache.org/ for backup downloads as well as signature
and checksum verification. The primary driver has been splitting up web
site visits and downloads to gain better control and offer a better
service for both downloads and web site visits.

As stated, this does not impact end-users, and should have a minimal
impact on projects, as our download selectors as well as visits to
www.apache.org/dist/ have been adjusted to make use of
downloads.apache.org instead. We do however ask that projects, in their
own time-frame, change references on their own web sites from
www.apache.org/dist/ to downloads.apache.org wherever such references
may exist, to complete the switch in full. We will NOT be turning off
www.apache.org/dist/ in the near future, but would greatly appreciate if
projects could help us transition away from the old URLs in their
documentation and on their download pages.

The standard way of uploading releases[1] will STILL apply, however
there may be a short delay (<= 15 minutes) between releasing and
releases showing up on downloads.apache.org for technical reasons.

If you have any questions about this change, please do not hesitate
to reach out to us at us...@infra.apache.org.

With regards,
Daniel on behalf of ASF Infrastructure.

[1] https://www.apache.org/legal/release-policy.html#upload-ci
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1811) Update download links

2020-03-18 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1811.
---
Resolution: Fixed

> Update download links
> -
>
> Key: PARQUET-1811
> URL: https://issues.apache.org/jira/browse/PARQUET-1811
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-site
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Based on the following mail sent to the private list we shall update the 
> download links on our site.
> {quote}
> Hello, Apache PMCs,
> In order to better provide our millions of users with downloads, the
> Apache Infrastructure Team has been restructuring the way downloads work
> for our main distribution channels in the past few weeks. For users,
> this will largely go unnoticed, and for projects likely the same, but we
> did want to reach out to projects and inform them of the changes we've
> made:
> As of March 2020, we are deprecating www.apache.org/dist/ in favor of
> https://downloads.apache.org/ for backup downloads as well as signature
> and checksum verification. The primary driver has been splitting up web
> site visits and downloads to gain better control and offer a better
> service for both downloads and web site visits.
> As stated, this does not impact end-users, and should have a minimal
> impact on projects, as our download selectors as well as visits to
> www.apache.org/dist/ have been adjusted to make use of
> downloads.apache.org instead. We do however ask that projects, in their
> own time-frame, change references on their own web sites from
> www.apache.org/dist/ to downloads.apache.org wherever such references
> may exist, to complete the switch in full. We will NOT be turning off
> www.apache.org/dist/ in the near future, but would greatly appreciate if
> projects could help us transition away from the old URLs in their
> documentation and on their download pages.
> The standard way of uploading releases[1] will STILL apply, however
> there may be a short delay (<= 15 minutes) between releasing and
> releases showing up on downloads.apache.org for technical reasons.
> If you have any questions about this change, please do not hesitate
> to reach out to us at us...@infra.apache.org.
> With regards,
> Daniel on behalf of ASF Infrastructure.
> [1] https://www.apache.org/legal/release-policy.html#upload-ci
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1815) Add union API to BloomFilter interface

2020-03-18 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061545#comment-17061545
 ] 

Gabor Szadovszky commented on PARQUET-1815:
---

The currently implemented filters in parquet-mr (e.g. dictionary filter, column 
indexes) are created for internal use. It means that the user does not have to 
care about them, it simply sets the filter and gets the values required without 
knowing which filter implementation is dropping the unneeded values.
What is not clear to me in this jira is that how the user would benefit from 
the union of the bloom filters.

> Add union API to BloomFilter interface
> --
>
> Key: PARQUET-1815
> URL: https://issues.apache.org/jira/browse/PARQUET-1815
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>
> Sometimes, one may want to build a file-level bloom filter by union all row 
> groups bloom filters so that to save some memory. Add a union API that could 
> make it easy to use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2020-03-18 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061558#comment-17061558
 ] 

Gabor Szadovszky commented on PARQUET-41:
-

[~junma], the target release for this feature is {{1.12.0}}. The main content 
of {{1.12.0}} would be this feature and the column encryption (PARQUET-1178). 
Both are under development so I would not give any deadlines for the release.

> Add bloom filters to parquet statistics
> ---
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format, parquet-mr
>Reporter: Alex Levenson
>Assignee: Junjie Chen
>Priority: Major
>  Labels: filter2, pull-request-available
> Fix For: format-2.7.0
>
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1816) Add intersection API to BloomFilter interface

2020-03-18 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061581#comment-17061581
 ] 

Gabor Szadovszky commented on PARQUET-1816:
---

Please, find my comment at PARQUET-1815.

> Add intersection API to BloomFilter interface
> -
>
> Key: PARQUET-1816
> URL: https://issues.apache.org/jira/browse/PARQUET-1816
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Walid Gara
>Priority: Minor
>  Labels: pull-request-available
>
> The intersection of Bloom Filter is a useful operation if we manipulate just 
> bloom filters.
> Note: The intersection of two bloom filters have a higher false-positive rate 
> than a bloom filter constructed from the intersection of two sets. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1815) Add union API to BloomFilter interface

2020-03-18 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061641#comment-17061641
 ] 

Gabor Szadovszky commented on PARQUET-1815:
---

If one would like to use bloom filters out of the very scope of parquet-mr 
(e.g. to union the bloom filters of several files for a partition of a table) 
then I think providing the interface for the bloom filter is not a good idea. 
E.g. Iceberg supports the file formats Avro, Parquet and Orc. Orc also has its 
own implementation for bloom filters. If we would like to support this example 
scenario in Iceberg, it would be better to use a common interface for bloom 
filters that is not part of the Parquet API.

I am not against implementing this functionality in parquet-mr (it is not a 
complex one anyway), I've just missed a usecase and I think it is a bit early 
to implement such functionality without a driver case.

> Add union API to BloomFilter interface
> --
>
> Key: PARQUET-1815
> URL: https://issues.apache.org/jira/browse/PARQUET-1815
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>
> Sometimes, one may want to build a file-level bloom filter by union all row 
> groups bloom filters so that to save some memory. Add a union API that could 
> make it easy to use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1826) Document hadoop configuration options

2020-03-25 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1826:
-

 Summary: Document hadoop configuration options
 Key: PARQUET-1826
 URL: https://issues.apache.org/jira/browse/PARQUET-1826
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gabor Szadovszky


The currently available hadoop configuration options is not documented 
properly. The only documentation we have is the javadoc comment and the 
implementation of 
[ParquetOutputFormat|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java].
We shall investigate all the possible options and their usage/default values 
and document them properly in a way that it is easily accessible by our users.

I would suggest creating a `README.md` file in the sub-module 
[parquet-hadoop|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]
 that would describe the purpose of the module and would have a section that 
lists the possible hadoop configuration options. (Later on we shall extend this 
document with other descriptions about the purpose and usage of our library in 
the hadoop ecosystem. These efforts shall be covered by other jiras.)

By adding the description to the source code it would be easy to extend it by 
the new features we implement so it will be up-to-date for every release. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1743) Add equals to BlockSplitBloomFilter

2020-03-25 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1743:
-

Assignee: Walid Gara

> Add equals to BlockSplitBloomFilter
> ---
>
> Key: PARQUET-1743
> URL: https://issues.apache.org/jira/browse/PARQUET-1743
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Walid Gara
>Assignee: Walid Gara
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The method equals can be used to compare Bloom filters in tests since we 
> can't access its bitset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1816) Add intersection API to BloomFilter interface

2020-03-25 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1816:
-

Assignee: Walid Gara

> Add intersection API to BloomFilter interface
> -
>
> Key: PARQUET-1816
> URL: https://issues.apache.org/jira/browse/PARQUET-1816
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Walid Gara
>Assignee: Walid Gara
>Priority: Minor
>  Labels: pull-request-available
>
> The intersection of Bloom Filter is a useful operation if we manipulate just 
> bloom filters.
> Note: The intersection of two bloom filters have a higher false-positive rate 
> than a bloom filter constructed from the intersection of two sets. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1815) Add union API to BloomFilter interface

2020-03-25 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1815:
-

Assignee: Walid Gara

> Add union API to BloomFilter interface
> --
>
> Key: PARQUET-1815
> URL: https://issues.apache.org/jira/browse/PARQUET-1815
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Junjie Chen
>Assignee: Walid Gara
>Priority: Minor
>  Labels: pull-request-available
>
> Sometimes, one may want to build a file-level bloom filter by union all row 
> groups bloom filters so that to save some memory. Add a union API that could 
> make it easy to use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1787) Expected distinct numbers is not parsed correctly

2020-03-25 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1787:
-

Assignee: Walid Gara

> Expected distinct numbers is not parsed correctly
> -
>
> Key: PARQUET-1787
> URL: https://issues.apache.org/jira/browse/PARQUET-1787
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Walid Gara
>Assignee: Walid Gara
>Priority: Critical
>  Labels: pull-request-available
>
> In the bloom filter feature, when I pass the expected distinct numbers as 
> below, I got null values instead of 1000 and 200.
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> Configuration conf = new Configuration();
> conf.set("parquet.bloom.filter.column.names", "content,line"); 
> conf.set("parquet.bloom.filter.expected.ndv","1000,200");
> {code}
>  
>  The issue is coming from getting the system property of expected distinct 
> numbers through 
> [Long.getLong(expectedNDVs[i])|https://github.com/apache/parquet-mr/blob/a737141a571e3cb6cee2c252dc4406e26e6c1177/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L251].
>  
> It's possible to fix it by parsing the string with 
> Long.parseLong(expectedNDVs[i]).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1826) Document hadoop configuration options

2020-03-25 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1826:
-

Assignee: Walid Gara

Based on our discussion in the Parquet sync I'm assigning this to you, 
[~garawalid]. Feel free to contact me or write your questions to this jira 
directly.

> Document hadoop configuration options
> -
>
> Key: PARQUET-1826
> URL: https://issues.apache.org/jira/browse/PARQUET-1826
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Walid Gara
>Priority: Major
>
> The currently available hadoop configuration options is not documented 
> properly. The only documentation we have is the javadoc comment and the 
> implementation of 
> [ParquetOutputFormat|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java].
> We shall investigate all the possible options and their usage/default values 
> and document them properly in a way that it is easily accessible by our users.
> I would suggest creating a `README.md` file in the sub-module 
> [parquet-hadoop|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]
>  that would describe the purpose of the module and would have a section that 
> lists the possible hadoop configuration options. (Later on we shall extend 
> this document with other descriptions about the purpose and usage of our 
> library in the hadoop ecosystem. These efforts shall be covered by other 
> jiras.)
> By adding the description to the source code it would be easy to extend it by 
> the new features we implement so it will be up-to-date for every release. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1828) Add a SSE2 path for the ByteStreamSplit encoder implementation

2020-03-26 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1828:
--
Component/s: parquet-cpp

> Add a SSE2 path for the ByteStreamSplit encoder implementation
> --
>
> Key: PARQUET-1828
> URL: https://issues.apache.org/jira/browse/PARQUET-1828
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The encode path for the byte stream split encoding can have better 
> performance if SSE2 intrinsics are used.
> The decode path already uses sse2 intrinsics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-03-27 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068431#comment-17068431
 ] 

Gabor Szadovszky commented on PARQUET-1830:
---

[~FelixKJose], the feature of having a vectorized API in parquet-mr was only a 
topic in some of our discussions. No efforts have been made to design/implement 
it. 
It is unfortunate that both Spark (and Hive) were implemented their own way of 
vectorization by using parquet-mr internal API (e.g. reading pages directly) 
instead of having something common in parquet-mr. To have such an API designed 
and implemented properly we need design input from our users.

However, to support column indexes in Spark we might have some other approaches:
* As Spark already use some internal API of parquet-mr we can step forward and 
implement the page skipping mechanism that is implemented in parquet-mr.
   pros: might be a quicker solution if Spark community has resources to 
implement it
   cons: duplicating code, increasing parquet related code outside of parquet-mr
* Having a simpler (not vectorized) API in parquet-mr that puts an abstraction 
layer on top of pages (by reading the triplets of value, definition level and 
repetition level from a row group)
   pros: cleaner API in parquet-mr, possibly cleaner code in Spark, hiding the 
page skipping mechanism introduced by column indexes
   cons: lower level API cannot be used anymore (e.g. Spark's own vectorized 
RLE decoder)

What do you think?

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1805) Refactor the configuration for bloom filters

2020-03-30 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1805.
---
Resolution: Fixed

> Refactor the configuration for bloom filters
> 
>
> Key: PARQUET-1805
> URL: https://issues.apache.org/jira/browse/PARQUET-1805
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> Refactor the hadoop configuration for bloom filters according to PARQUET-1784.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1817) Crypto Properties Factory

2020-03-30 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1817.
---
Resolution: Fixed

> Crypto Properties Factory
> -
>
> Key: PARQUET-1817
> URL: https://issues.apache.org/jira/browse/PARQUET-1817
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Gidon Gershinsky
>Assignee: Xinli Shang
>Priority: Major
>  Labels: pull-request-available
>
> Basic common interface (abstract class) for loading of file encryption and 
> decryption properties - making them transparent to analytic frameworks, so 
> they can leverage Parquet modular encryption (PARQUET-1178) without any code 
> changes . This interface depends on passing of Hadoop configuration - already 
> done by frameworks that work with parquet-mr. The "write" part of the 
> interface can also utilize the name/path of the file being written, and its 
> WriteContext, that contains the schema with extensions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-03-30 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070830#comment-17070830
 ] 

Gabor Szadovszky commented on PARQUET-1830:
---

[~FelixKJose], agreed.
So this jira is to track the long term effort of having a vectorized API in 
parquet-mr so our clients don't have to use our internal API to have fast 
reading yet having our ppd filtering (including column indexes and bloom 
filters) automatically executed under the hood.

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1827) UUID type currently not supported by parquet-mr

2020-03-30 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1827:
-

Assignee: Gabor Szadovszky

> UUID type currently not supported by parquet-mr
> ---
>
> Key: PARQUET-1827
> URL: https://issues.apache.org/jira/browse/PARQUET-1827
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Brad Smith
>Assignee: Gabor Szadovszky
>Priority: Major
>
> The parquet-format project introduced a new UUID logical type in version 2.4:
> [https://github.com/apache/parquet-format/blob/master/CHANGES.md]
> This would be a useful type to have available in some circumstances, but it 
> currently isn't supported in the parquet-mr library. Hopefully this feature 
> can be implemented at some point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-03-30 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070971#comment-17070971
 ] 

Gabor Szadovszky commented on PARQUET-1830:
---

[~FelixKJose], you said you would prefer option 1. That one would be a Spark 
only change.

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-03-30 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070993#comment-17070993
 ] 

Gabor Szadovszky commented on PARQUET-1830:
---

Agreed. That's what I wanted to say some comments ago. :)

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1826) Document hadoop configuration options

2020-04-01 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072587#comment-17072587
 ] 

Gabor Szadovszky commented on PARQUET-1826:
---

I was not able to find any proper documentation about the parquet writer 
version either. Let me summarize it here what it means and how parquet-mr works 
related to it.

PARQUET_2_0 refers to {{DataPageHeaderV2}} to be used in the parquet file 
(instead of {{DataPageHeader}}) to write data pages. The main difference is 
that _v2_ pages store the levels uncompressed (while _v1_ pages compress the 
levels with the data). Also, _v2_ page header does not contain any field for 
the encoding of the levels but it does not really matter as we always use (at 
least in parquet-mr) the [Run Length Encoding / Bit-Packing 
Hybrid|https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3].
 See the header definitions with some more comments in 
[parquet.thrift|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift].
 However I did not find it documented anywhere, there are other differences 
between _v1_ and _v2_ in the parquet-mr implementation. The default encodings 
of the primitive types are different. See the differences between 
[DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java]
 and 
[DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java].

> Document hadoop configuration options
> -
>
> Key: PARQUET-1826
> URL: https://issues.apache.org/jira/browse/PARQUET-1826
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Walid Gara
>Priority: Major
>
> The currently available hadoop configuration options is not documented 
> properly. The only documentation we have is the javadoc comment and the 
> implementation of 
> [ParquetOutputFormat|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java].
> We shall investigate all the possible options and their usage/default values 
> and document them properly in a way that it is easily accessible by our users.
> I would suggest creating a `README.md` file in the sub-module 
> [parquet-hadoop|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]
>  that would describe the purpose of the module and would have a section that 
> lists the possible hadoop configuration options. (Later on we shall extend 
> this document with other descriptions about the purpose and usage of our 
> library in the hadoop ecosystem. These efforts shall be covered by other 
> jiras.)
> By adding the description to the source code it would be easy to extend it by 
> the new features we implement so it will be up-to-date for every release. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1832) Travis fails with too long output

2020-04-01 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1832:
-

 Summary: Travis fails with too long output
 Key: PARQUET-1832
 URL: https://issues.apache.org/jira/browse/PARQUET-1832
 Project: Parquet
  Issue Type: Test
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Travis fails with the error message _"The job exceeded the maximum log length, 
and has been terminated"_. We have to cut down our useless output during 
build/test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1833) InternalParquetRecordWriter - Too much memory used

2020-04-02 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1833:
-

 Summary: InternalParquetRecordWriter - Too much memory used
 Key: PARQUET-1833
 URL: https://issues.apache.org/jira/browse/PARQUET-1833
 Project: Parquet
  Issue Type: Test
Reporter: Gabor Szadovszky


Our logs are full of entries starting with {{InternalParquetRecordWriter - Too 
much memory used: (...)}}. We shall investigate why this is logged and whether 
it is a logging issue or some regression in our memory consumption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1832) Travis fails with too long output

2020-04-15 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1832.
---
Resolution: Fixed

> Travis fails with too long output
> -
>
> Key: PARQUET-1832
> URL: https://issues.apache.org/jira/browse/PARQUET-1832
> Project: Parquet
>  Issue Type: Test
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> Travis fails with the error message _"The job exceeded the maximum log 
> length, and has been terminated"_. We have to cut down our useless output 
> during build/test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

2020-04-15 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084075#comment-17084075
 ] 

Gabor Szadovszky commented on PARQUET-1739:
---

[~yumwang],

Have you succeeded to implement the page skipping mechanism in Spark? Without 
that you may only see the overhead of the column-indexes and not the benefit.
Meanwhile, even if the page skipping is implemented there might be a little 
performance degradation in case of the data is not sorted at all (the min/max 
values are very similar for the different pages). In this case the 
column/offset index reading I/O is the overhead while we cannot drop any pages 
based on the min/max values so we read the same amount of data as we would not 
have column indexes.

>From column index point of view we should not have too much difference between 
>the runs if no ppd is used (no filter is set in the parquet API).

> Make Spark SQL support Column indexes
> -
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 1.11.1
>
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1844) Removed Hadoop transitive dependency on commons-lang

2020-04-17 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-1844:
-

 Summary: Removed Hadoop transitive dependency on commons-lang
 Key: PARQUET-1844
 URL: https://issues.apache.org/jira/browse/PARQUET-1844
 Project: Parquet
  Issue Type: Task
Reporter: Gabor Szadovszky


Some of our code parts are using commons-lang without declaring direct 
dependency on it. It comes as a transitive dependency from Hadoop. From Hadoop 
3.3 they migrated from commons-lang to commons-lang3 which fails the build if 
parquet-mr is built against it.

We shall either properly declare our direct dependency to commons-lang (or with 
also migrating to commons-lang3) or refactor the code to not use commons-lang 
at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1699) Could not resolve org.apache.yetus:audience-annotations:0.11.0

2020-04-20 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1699:
-

Assignee: Priyank Bagrecha

> Could not resolve org.apache.yetus:audience-annotations:0.11.0
> --
>
> Key: PARQUET-1699
> URL: https://issues.apache.org/jira/browse/PARQUET-1699
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Priyank Bagrecha
>Assignee: Priyank Bagrecha
>Priority: Major
>
> Trying to use parquet-protobuf and get this via parquet-common. I'm using 
> latest on master branch
> {code:java}
> > Could not resolve org.apache.yetus:audience-annotations:0.11.0.
>   Required by:
>   project : > org.apache.parquet:parquet-common:1.11.0-SNAPSHOT
>> Could not resolve org.apache.yetus:audience-annotations:0.11.0.
>   > Could not parse POM 
> /Users/pbagrecha/.m2/repository/org/apache/yetus/audience-annotations/0.11.0/audience-annotations-0.11.0.pom
>  > Unable to resolve version for dependency 'jdk.tools:jdk.tools:jar'
>> Could not resolve org.apache.yetus:audience-annotations:0.11.0.
>   > Could not parse POM 
> https://repo1.maven.org/maven2/org/apache/yetus/audience-annotations/0.11.0/audience-annotations-0.11.0.pom
>  > Unable to resolve version for dependency 'jdk.tools:jdk.tools:jar'
>> Could not resolve org.apache.yetus:audience-annotations:0.11.0.
>   > Could not parse POM 
> https://jcenter.bintray.com/org/apache/yetus/audience-annotations/0.11.0/audience-annotations-0.11.0.pom
>  > Unable to resolve version for dependency 
> 'jdk.tools:jdk.tools:jar'{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1699) Could not resolve org.apache.yetus:audience-annotations:0.11.0

2020-04-20 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1699.
---
Resolution: Fixed

> Could not resolve org.apache.yetus:audience-annotations:0.11.0
> --
>
> Key: PARQUET-1699
> URL: https://issues.apache.org/jira/browse/PARQUET-1699
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Priyank Bagrecha
>Assignee: Priyank Bagrecha
>Priority: Major
>
> Trying to use parquet-protobuf and get this via parquet-common. I'm using 
> latest on master branch
> {code:java}
> > Could not resolve org.apache.yetus:audience-annotations:0.11.0.
>   Required by:
>   project : > org.apache.parquet:parquet-common:1.11.0-SNAPSHOT
>> Could not resolve org.apache.yetus:audience-annotations:0.11.0.
>   > Could not parse POM 
> /Users/pbagrecha/.m2/repository/org/apache/yetus/audience-annotations/0.11.0/audience-annotations-0.11.0.pom
>  > Unable to resolve version for dependency 'jdk.tools:jdk.tools:jar'
>> Could not resolve org.apache.yetus:audience-annotations:0.11.0.
>   > Could not parse POM 
> https://repo1.maven.org/maven2/org/apache/yetus/audience-annotations/0.11.0/audience-annotations-0.11.0.pom
>  > Unable to resolve version for dependency 'jdk.tools:jdk.tools:jar'
>> Could not resolve org.apache.yetus:audience-annotations:0.11.0.
>   > Could not parse POM 
> https://jcenter.bintray.com/org/apache/yetus/audience-annotations/0.11.0/audience-annotations-0.11.0.pom
>  > Unable to resolve version for dependency 
> 'jdk.tools:jdk.tools:jar'{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1844) Removed Hadoop transitive dependency on commons-lang

2020-04-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1844:
-

Assignee: Gabor Szadovszky

> Removed Hadoop transitive dependency on commons-lang
> 
>
> Key: PARQUET-1844
> URL: https://issues.apache.org/jira/browse/PARQUET-1844
> Project: Parquet
>  Issue Type: Task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Some of our code parts are using commons-lang without declaring direct 
> dependency on it. It comes as a transitive dependency from Hadoop. From 
> Hadoop 3.3 they migrated from commons-lang to commons-lang3 which fails the 
> build if parquet-mr is built against it.
> We shall either properly declare our direct dependency to commons-lang (or 
> with also migrating to commons-lang3) or refactor the code to not use 
> commons-lang at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1844) Removed Hadoop transitive dependency on commons-lang

2020-04-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1844:
--
Affects Version/s: 1.11.0

> Removed Hadoop transitive dependency on commons-lang
> 
>
> Key: PARQUET-1844
> URL: https://issues.apache.org/jira/browse/PARQUET-1844
> Project: Parquet
>  Issue Type: Task
>Affects Versions: 1.11.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Some of our code parts are using commons-lang without declaring direct 
> dependency on it. It comes as a transitive dependency from Hadoop. From 
> Hadoop 3.3 they migrated from commons-lang to commons-lang3 which fails the 
> build if parquet-mr is built against it.
> We shall either properly declare our direct dependency to commons-lang (or 
> with also migrating to commons-lang3) or refactor the code to not use 
> commons-lang at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1844) Removed Hadoop transitive dependency on commons-lang

2020-04-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1844.
---
Resolution: Fixed

> Removed Hadoop transitive dependency on commons-lang
> 
>
> Key: PARQUET-1844
> URL: https://issues.apache.org/jira/browse/PARQUET-1844
> Project: Parquet
>  Issue Type: Task
>Affects Versions: 1.11.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Some of our code parts are using commons-lang without declaring direct 
> dependency on it. It comes as a transitive dependency from Hadoop. From 
> Hadoop 3.3 they migrated from commons-lang to commons-lang3 which fails the 
> build if parquet-mr is built against it.
> We shall either properly declare our direct dependency to commons-lang (or 
> with also migrating to commons-lang3) or refactor the code to not use 
> commons-lang at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1826) Document hadoop configuration options

2020-04-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1826.
---
Resolution: Fixed

> Document hadoop configuration options
> -
>
> Key: PARQUET-1826
> URL: https://issues.apache.org/jira/browse/PARQUET-1826
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Walid Gara
>Priority: Major
>  Labels: pull-request-available
>
> The currently available hadoop configuration options is not documented 
> properly. The only documentation we have is the javadoc comment and the 
> implementation of 
> [ParquetOutputFormat|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java].
> We shall investigate all the possible options and their usage/default values 
> and document them properly in a way that it is easily accessible by our users.
> I would suggest creating a `README.md` file in the sub-module 
> [parquet-hadoop|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]
>  that would describe the purpose of the module and would have a section that 
> lists the possible hadoop configuration options. (Later on we shall extend 
> this document with other descriptions about the purpose and usage of our 
> library in the hadoop ecosystem. These efforts shall be covered by other 
> jiras.)
> By adding the description to the source code it would be easy to extend it by 
> the new features we implement so it will be up-to-date for every release. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1808) SimpleGroup.toString() uses String += and so has poor performance

2020-05-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1808.
---
Resolution: Fixed

> SimpleGroup.toString() uses String += and so has poor performance
> -
>
> Key: PARQUET-1808
> URL: https://issues.apache.org/jira/browse/PARQUET-1808
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Randy Tidd
>Assignee: Shankar Koirala
>Priority: Minor
>  Labels: pull-request-available
>
> This method in SimpleGroup uses `+=` for String concatenation which is a 
> known performance problem in Java, the performance degrades exponentially the 
> more strings that are added.
> [https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50]
> We ran into a performance problem whereby a single column in a Parquet file 
> was defined as a group:
> {code:java}
> optional group customer_ids (LIST) {
>         repeated group list { 
>         optional binary element (STRING); 
>       }
>     }{code}
>  
> and had over 31,000 values. Reading this single column took over 8 minutes 
> due to time spent in the `toString()` method.  Using a different 
> implementation that uses `StringBuffer` like this:
> {code:java}
>  StringBuffer result = new StringBuffer();
>  int i = 0;
>  for (Type field : schema.getFields()) {
>String name = field.getName();
>List values = data[i];
>++i;
>if (values != null) {
>  if (values.size() > 0) {
>for (Object value : values) {
>  result.append(indent);
>  result.append(name);
>  if (value == null) { 
>result.append(": NULL\n");
>  } else if (value instanceof Group){ 
>result.append("\n"); 
>result.append(betterToString((SimpleGroup)value, indent+" "));
>  } else { 
>result.append(": "); 
>result.append(value.toString()); 
>result.append("\n"); 
>  }
>}
>  }
>}
>  }
>  return result.toString();{code}
> reduced that time to less than 500 milliseconds. 
> The existing implementation is really poor and exhibits an infamous Java 
> string performance issue and should be fixed.
> This was a significant problem for us but we were able to work around it so I 
> am marking this issue as "Minor".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1862) A mistake of Parquet Format Thrift definition file's comment

2020-05-14 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1862:
-

Assignee: Liam Su

> A mistake of Parquet Format Thrift definition file's comment
> 
>
> Key: PARQUET-1862
> URL: https://issues.apache.org/jira/browse/PARQUET-1862
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Liam Su
>Assignee: Liam Su
>Priority: Minor
>
> A comment of *DataPageHeaderV2* in the src/main/thrift/parquet.thrift is 
> wrong.
>  
> {code:java}
>   /** optional statistics for this column chunk */
>   8: optional Statistics statistics;
> {code}
>  
>  should be
> {code:java}
>   /** optional statistics for the data in this page */
>   8: optional Statistics statistics;
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1862) A mistake of Parquet Format Thrift definition file's comment

2020-05-14 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1862.
---
Resolution: Fixed

> A mistake of Parquet Format Thrift definition file's comment
> 
>
> Key: PARQUET-1862
> URL: https://issues.apache.org/jira/browse/PARQUET-1862
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Liam Su
>Assignee: Liam Su
>Priority: Minor
>
> A comment of *DataPageHeaderV2* in the src/main/thrift/parquet.thrift is 
> wrong.
>  
> {code:java}
>   /** optional statistics for this column chunk */
>   8: optional Statistics statistics;
> {code}
>  
>  should be
> {code:java}
>   /** optional statistics for the data in this page */
>   8: optional Statistics statistics;
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1863) Remove use of add-test-source mojo in parquet-protobuf

2020-05-18 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1863.
---
Resolution: Fixed

> Remove use of add-test-source mojo in parquet-protobuf
> --
>
> Key: PARQUET-1863
> URL: https://issues.apache.org/jira/browse/PARQUET-1863
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Laurent Goujon
>Assignee: Laurent Goujon
>Priority: Minor
>
> {{parquet-protobuf}} uses {{build-helper-maven-plugin:add-test-source}} maven 
> mojo to add protobuf test classes to the test sources path, but 
> {{protoc-jar-maven-plugin}} also adds these classes to the main sources path. 
> It is unnecessary ({{protoc-jar-maven-plugin}} could be configured to add to 
> the test classes path) and actually causes confusion with some IDE (Eclipse).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1863) Remove use of add-test-source mojo in parquet-protobuf

2020-05-18 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1863:
-

Assignee: Laurent Goujon

> Remove use of add-test-source mojo in parquet-protobuf
> --
>
> Key: PARQUET-1863
> URL: https://issues.apache.org/jira/browse/PARQUET-1863
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Laurent Goujon
>Assignee: Laurent Goujon
>Priority: Minor
>
> {{parquet-protobuf}} uses {{build-helper-maven-plugin:add-test-source}} maven 
> mojo to add protobuf test classes to the test sources path, but 
> {{protoc-jar-maven-plugin}} also adds these classes to the main sources path. 
> It is unnecessary ({{protoc-jar-maven-plugin}} could be configured to add to 
> the test classes path) and actually causes confusion with some IDE (Eclipse).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1842) Update Jackson Databind version to address CVE

2020-05-19 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1729#comment-1729
 ] 

Gabor Szadovszky commented on PARQUET-1842:
---

[~fadisayed], there is no deadline for 1.12.0 yet. We are waiting for the 
encryption feature to be complete.

> Update Jackson Databind version to address CVE
> --
>
> Key: PARQUET-1842
> URL: https://issues.apache.org/jira/browse/PARQUET-1842
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
> Environment: Any
>Reporter: Patrick OFriel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The current version of jackson-databind in parquet-mr has several CVEs 
> associated with it: [https://nvd.nist.gov/vuln/detail/CVE-2020-10673], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10672], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10969], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-1], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-3], (and a few more). We 
> should update to jackson-databind 2.9.10.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1864) How to generate a file with UUID as a Logical type

2020-05-20 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112041#comment-17112041
 ] 

Gabor Szadovszky commented on PARQUET-1864:
---

UUID is not yet implemented in parquet-mr. See PARQUET-1827 for details. 
 For documentation and hints for use after PARQUET-1827 is ready see 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid].

Do you have anything else in mind that requires this issue to be kept open?

> How to generate a file with UUID as a Logical type
> --
>
> Key: PARQUET-1864
> URL: https://issues.apache.org/jira/browse/PARQUET-1864
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Khasim Shaik
>Priority: Minor
>  Labels: newbie
> Fix For: 1.11.1
>
>
> I want to generate a file with UUID as a Logical type, 
> please provide schema or any hint to use UUID as logical type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1868) Parquet reader options toggle for bloom filter toggles dictionary filtering

2020-06-02 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1868:
-

Assignee: Ryan Rupp

> Parquet reader options toggle for bloom filter toggles dictionary filtering
> ---
>
> Key: PARQUET-1868
> URL: https://issues.apache.org/jira/browse/PARQUET-1868
> Project: Parquet
>  Issue Type: Bug
>Reporter: Ryan Rupp
>Assignee: Ryan Rupp
>Priority: Trivial
>  Labels: pull-request-available
>
> Looks like the variable names just got swapped so this is toggling dictionary 
> filtering instead of bloom filter usage. Note, this is against current master 
> and not a released version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1868) Parquet reader options toggle for bloom filter toggles dictionary filtering

2020-06-02 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1868.
---
Resolution: Fixed

> Parquet reader options toggle for bloom filter toggles dictionary filtering
> ---
>
> Key: PARQUET-1868
> URL: https://issues.apache.org/jira/browse/PARQUET-1868
> Project: Parquet
>  Issue Type: Bug
>Reporter: Ryan Rupp
>Assignee: Ryan Rupp
>Priority: Trivial
>  Labels: pull-request-available
>
> Looks like the variable names just got swapped so this is toggling dictionary 
> filtering instead of bloom filter usage. Note, this is against current master 
> and not a released version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1842) Update Jackson Databind version to address CVE

2020-06-02 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123800#comment-17123800
 ] 

Gabor Szadovszky commented on PARQUET-1842:
---

[~pofriel], unfortunately I do not have to much time for pushing through a 
release right now. Maybe, it worth a mail to the dev list if a committer would 
have some time to do it. Also, we have a couple of dependencies/transitive 
dependencies. Is jackson the only one that requires a maintenance release?

> Update Jackson Databind version to address CVE
> --
>
> Key: PARQUET-1842
> URL: https://issues.apache.org/jira/browse/PARQUET-1842
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
> Environment: Any
>Reporter: Patrick OFriel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The current version of jackson-databind in parquet-mr has several CVEs 
> associated with it: [https://nvd.nist.gov/vuln/detail/CVE-2020-10673], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10672], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10969], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-1], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-3], (and a few more). We 
> should update to jackson-databind 2.9.10.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1842) Update Jackson Databind version to address CVE

2020-06-03 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124675#comment-17124675
 ] 

Gabor Szadovszky commented on PARQUET-1842:
---

[~pofriel], we are intensively working on integrating the feature encryption. I 
hope, we can initiate a release in 30 days. It's another question how much time 
will it take to have a successful vote on it.

> Update Jackson Databind version to address CVE
> --
>
> Key: PARQUET-1842
> URL: https://issues.apache.org/jira/browse/PARQUET-1842
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
> Environment: Any
>Reporter: Patrick OFriel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The current version of jackson-databind in parquet-mr has several CVEs 
> associated with it: [https://nvd.nist.gov/vuln/detail/CVE-2020-10673], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10672], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10969], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-1], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-3], (and a few more). We 
> should update to jackson-databind 2.9.10.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   >