[GitHub] [parquet-mr] parthchandra commented on a diff in pull request #999: [DRAFT] PR to show Vectored IO integration, compilation fails now.

2022-10-10 Thread GitBox


parthchandra commented on code in PR #999:
URL: https://github.com/apache/parquet-mr/pull/999#discussion_r991511389


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1811,6 +1845,44 @@ public void readAll(SeekableInputStream f, 
ChunkListBuilder builder) throws IOEx
   }
 }
 
+public void readAllVectored(SeekableInputStream f, ChunkListBuilder 
builder)
+  throws IOException, ExecutionException, InterruptedException {
+  LOG.warn("Reading through the vectored API.[readAllVectored]");
+  List result = new ArrayList(chunks.size());
+
+  int fullAllocations = (int) (length / options.getMaxAllocationSize());
+  int lastAllocationSize = (int) (length % options.getMaxAllocationSize());
+
+  int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0);
+  List fileRanges = new ArrayList<>(numAllocations);
+
+  long currentOffset = offset;
+  for (int i = 0; i < fullAllocations; i += 1) {
+
//buffers.add(options.getAllocator().allocate(options.getMaxAllocationSize()));
+fileRanges.add(FileRange.createFileRange(currentOffset, 
options.getMaxAllocationSize()));
+currentOffset = currentOffset + options.getMaxAllocationSize();
+  }
+
+  if (lastAllocationSize > 0) {
+//buffers.add(options.getAllocator().allocate(lastAllocationSize));
+fileRanges.add(FileRange.createFileRange(currentOffset, 
lastAllocationSize));
+  }
+  LOG.warn("Doing vectored IO for ranges {}", fileRanges);
+  f.readVectored(fileRanges, ByteBuffer::allocate);

Review Comment:
   Use the allocator (`options.getAllocator`)? Keep in mind the allocated 
buffer might be a direct byte buffer.



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1093,10 +1099,38 @@ private ColumnChunkPageReadStore 
internalReadFilteredRowGroup(BlockMetaData bloc
 }
   }
 }
-// actually read all the chunks
+// Vectored IO up.
+
+List ranges = new ArrayList<>();
 for (ConsecutivePartList consecutiveChunks : allParts) {
-  consecutiveChunks.readAll(f, builder);
+  ranges.add(FileRange.createFileRange(consecutiveChunks.offset, (int) 
consecutiveChunks.length));

Review Comment:
   I would do this the way you were planning to do it initially ( or so it 
appears). Move this into a readAllVectored method.



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##
@@ -1093,10 +1099,38 @@ private ColumnChunkPageReadStore 
internalReadFilteredRowGroup(BlockMetaData bloc
 }
   }
 }
-// actually read all the chunks
+// Vectored IO up.
+
+List ranges = new ArrayList<>();
 for (ConsecutivePartList consecutiveChunks : allParts) {
-  consecutiveChunks.readAll(f, builder);
+  ranges.add(FileRange.createFileRange(consecutiveChunks.offset, (int) 
consecutiveChunks.length));

Review Comment:
   And make it configurable to choose between vectored I/O and non vectored I/O 
(see `HadoopReadOptions` and `ParquetReadOptions`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-10-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615228#comment-17615228
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

matthieun commented on PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#issuecomment-1273621299

   Hi, I am fine with whatever solution. If you choose #995 that works, please 
just close this one!




> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] matthieun commented on pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-10-10 Thread GitBox


matthieun commented on PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#issuecomment-1273621299

   Hi, I am fine with whatever solution. If you choose #995 that works, please 
just close this one!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (PARQUET-2179) Add a test for skipping repeated fields

2022-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-2179:

Labels: pull-request-available  (was: )

> Add a test for skipping repeated fields
> ---
>
> Key: PARQUET-2179
> URL: https://issues.apache.org/jira/browse/PARQUET-2179
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The existing test only tests non-repeated fields. Adding a test for repeated 
> fields to make it clear that it is skipping values and not records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2193) Encrypting only one field in nested field prevents reading of other fields in nested field without keys

2022-10-10 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614917#comment-17614917
 ] 

Gidon Gershinsky commented on PARQUET-2193:
---

Welcome.

>From the sound of it, this might require each file to be processed by one 
>thread only (instead of reading a single file by multiple threads); which 
>should be ok in typical usecases where one thread/executor reads multiple 
>files anyway. But I'll have a deeper look at this.

> Encrypting only one field in nested field prevents reading of other fields in 
> nested field without keys
> ---
>
> Key: PARQUET-2193
> URL: https://issues.apache.org/jira/browse/PARQUET-2193
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vignesh Nageswaran
>Priority: Major
>
> Hi Team,
> While exploring parquet encryption, it is found that, if a field in nested 
> column is encrypted , and If I want to read this parquet directory from other 
> applications which does not have encryption keys to decrypt it, I cannot read 
> the remaining fields of the nested column without keys. 
> Example 
> `
> {code:java}
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> `{code}
> In the case class `SquareItem` , `nestedCol` field is nested field and I want 
> to encrypt a field `ic` within it. 
>  
> I also want the footer to be non encrypted , so that I can use the encrypted 
> parquet file by legacy applications. 
>  
> Encryption is successful, however, when I query the parquet file using spark 
> 3.3.0 without having any configuration for parquet encryption set up , I 
> cannot non encrypted fields of `nestedCol` `sic`. I was expecting that only 
> `nestedCol` `ic` field will not be querable.
>  
>  
> Reproducer. 
> Spark 3.3.0 Using Spark-shell 
> Downloaded the file 
> [parquet-hadoop-1.12.0-tests.jar|https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop/1.12.0/parquet-hadoop-1.12.0-tests.jar]
>  and added it to spark-jars folder
> Code to create encrypted data. #  
>  
> {code:java}
> sc.hadoopConfiguration.set("parquet.crypto.factory.class" 
> ,"org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" 
> ,"org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> sc.hadoopConfiguration.set("parquet.encryption.key.list","key1a: 
> BAECAwQFBgcICQoLDA0ODw==, key2a: BAECAAECAAECAAECAAECAA==, keyz: 
> BAECAAECAAECAAECAAECAA==")
> sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally","false")
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> valpartitionCol = 1
> case class nestedItem(ic: Int = 0, sic : Double, pc: Int = 0)
> case class SquareItem(int_column: Int, square_int_column : Double, 
> partitionCol: Int, nestedCol :nestedItem)
> val dataRange = (1 to 100).toList
> val squares = sc.parallelize(dataRange.map(i => new SquareItem(i, 
> scala.math.pow(i,2), partitionCol,nestedItem(i,i
> squares.toDS().show()
> squares.toDS().write.partitionBy("partitionCol").mode("overwrite").option("parquet.encryption.column.keys",
>  
> "key1a:square_int_column,nestedCol.ic;").option("parquet.encryption.plaintext.footer",true).option("parquet.encryption.footer.key",
>  "keyz").parquet(encryptedParquetPath)
> {code}
> Code to read the data trying to access non encrypted nested field by opening 
> a new spark-shell
>  
> {code:java}
> val encryptedParquetPath = "/tmp/par_enc_footer_non_encrypted"
> spark.sqlContext.read.parquet(encryptedParquetPath).createOrReplaceTempView("test")
> spark.sql("select nestedCol.sic from test").show(){code}
> As you can see that nestedCol.sic is not encrypted , I was expecting the 
> results, but
> I get the below error
>  
> {code:java}
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: 
> [square_int_column]. Null File Decryptor
>   at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>   at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodings(ColumnChunkMetaData.java:348)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.checkDeltaByteArrayProblem(ParquetRecordReader.java:191)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:177)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:375)
>   at 
> 

[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-10-10 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614907#comment-17614907
 ] 

Gabor Szadovszky commented on PARQUET-1222:
---

[~emkornfield],

There are a couple of docs in the parquet-format repo. The related ones are 
[about logical 
types|[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]] 
and the main one that contains the description of the [primitive 
types|https://github.com/apache/parquet-format/blob/master/README.md#types]. 
Unfortunately, the latter one does not contain anything about sorting order.
So, I think, we need to do the following:
* Define the sorting order for the primitive types or reference the logical 
types description for it. (In most cases it would be referencing since the 
ordering depends on the related logical types e.g. signed/unsigned sorting of 
integral types)
* After defining the sorting order of the primitive floating point numbers 
based on what we've discussed above reference it from the new half-precision FP 
logical type.

(Another unfortunate thing is that we have some specification-like docs at the 
[parquet site|https://parquet.apache.org] as well. I think we should propagate 
the parquet-format docs to there automatically or simply link them from the 
site. But it is clearly a different topic.)

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)