date:20220822

[jira] [Updated] (PARQUET-1416) [C++] Deprecate parquet/api/* in favor of simpler public API "parquet/api.h"

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1416:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Deprecate parquet/api/* in favor of simpler public API "parquet/api.h"
> 
>
> Key: PARQUET-1416
> URL: https://issues.apache.org/jira/browse/PARQUET-1416
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Minor
>
> The public API for this project is simple enough that I don't think we need 
> anything more complex than a single public API header file



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-2099) [C++] Statistics::num_values() is misleading

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2099:

Fix Version/s: cpp-10.0.0
   (was: cpp-9.0.0)

> [C++] Statistics::num_values() is misleading 
> -
>
> Key: PARQUET-2099
> URL: https://issues.apache.org/jira/browse/PARQUET-2099
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: cpp-10.0.0
>
>
> num_values() in statistics seems to capture the number of encoded values.  
> This is misleading as everyplace else in parquet num_values() really 
> indicates all values (null and not-null, i.e. the number of levels).  
> We should likely remove this field, rename it or at the very least update the 
> documentation.
> CC [~zeroshade]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1859:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Require error message when using ParquetException::EofException
> -
>
> Key: PARQUET-1859
> URL: https://issues.apache.org/jira/browse/PARQUET-1859
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> "Unexpected end of stream" (the defaults) gives no clue where the failure 
> occurred



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1814) [C++] TestInt96ParquetIO failure on Windows

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1814:

Fix Version/s: (was: cpp-9.0.0)

> [C++] TestInt96ParquetIO failure on Windows
> ---
>
> Key: PARQUET-1814
> URL: https://issues.apache.org/jira/browse/PARQUET-1814
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> {code}
> [ RUN  ] TestInt96ParquetIO.ReadIntoTimestamp
> C:/t/arrow/cpp/src/arrow/testing/gtest_util.cc(77): error: Failed
> @@ -0, +0 @@
> -1970-01-01 00:00:00.145738543
> +1970-01-02 11:35:00.145738543
> C:/t/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc(1034): error: 
> Expected: this->ReadAndCheckSingleColumnFile(*values) doesn't generate new 
> fatal failures in the current thread.
>   Actual: it does.
> [  FAILED  ] TestInt96ParquetIO.ReadIntoTimestamp (47 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1657) [C++] Change Bloom filter implementation to use xxhash

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1657:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Change Bloom filter implementation to use xxhash
> --
>
> Key: PARQUET-1657
> URL: https://issues.apache.org/jira/browse/PARQUET-1657
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> I also strongly recommend doing away with the virtual function calls if 
> possible. We have vendored xxhash in Apache Arrow so we should also remove 
> the murmur3 code while we are at it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1653) [C++] Deprecated BIT_PACKED level decoding is probably incorrect

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1653:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Deprecated BIT_PACKED level decoding is probably incorrect
> 
>
> Key: PARQUET-1653
> URL: https://issues.apache.org/jira/browse/PARQUET-1653
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> In working on PARQUET-1652, I noticed that our implementation of BIT_PACKED 
> almost certainly does not line up with apache/parquet-format. I'm going to 
> disable it in our tests until it can be validated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1646) [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder instead of std::vector

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1646:

Fix Version/s: cpp-10.0.0
   (was: cpp-9.0.0)

> [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder 
> instead of std::vector
> -
>
> Key: PARQUET-1646
> URL: https://issues.apache.org/jira/browse/PARQUET-1646
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-10.0.0
>
>
> Follow up to ARROW-6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1634) [C++] Factor out data/dictionary page writes to allow for page buffering

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1634:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Factor out data/dictionary page writes to allow for page buffering 
> -
>
> Key: PARQUET-1634
> URL: https://issues.apache.org/jira/browse/PARQUET-1634
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> Logic that eagerly writes out data pages is hard-coded into the column writer 
> implementation
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L565
> For higher-latency file systems like Amazon S3, it makes more sense to buffer 
> pages in memory and write them in larger batches (and preferably 
> asynchronously). We should refactor this logic so we have the ability to 
> choose rather than have the behavior hard-coded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1614) [C++] Reuse arrow::Buffer used as scratch space for decryption in Thrift deserialization hot path

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1614:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Reuse arrow::Buffer used as scratch space for decryption in Thrift 
> deserialization hot path
> -
>
> Key: PARQUET-1614
> URL: https://issues.apache.org/jira/browse/PARQUET-1614
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> If it is possible to reuse memory on the decrypt-deserialize hot path that 
> will improve performance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1515) [C++] Disable LZ4 codec

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1515:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Disable LZ4 codec
> ---
>
> Key: PARQUET-1515
> URL: https://issues.apache.org/jira/browse/PARQUET-1515
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Priority: Major
>
> As discussed in https://issues.apache.org/jira/browse/PARQUET-1241, the 
> parquet-cpp's LZ4 codec is not compatible with Hadoop and parquet-mr. We must 
> disable the codec until we resolve the compatibility issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1199) [C++] Support writing (and test reading) boolean values with RLE encoding

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1199:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Support writing (and test reading) boolean values with RLE encoding
> -
>
> Key: PARQUET-1199
> URL: https://issues.apache.org/jira/browse/PARQUET-1199
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> This is supported by the Parquet specification, we should ensure that we are 
> able to read such data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1430) [C++] Add tests for C++ tools

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1430:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Add tests for C++ tools
> -
>
> Key: PARQUET-1430
> URL: https://issues.apache.org/jira/browse/PARQUET-1430
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Priority: Major
>
> We currently do not have any tests for the tools.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1158) [C++] Basic RowGroup filtering

2022-08-22 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1158:

Fix Version/s: (was: cpp-9.0.0)

> [C++] Basic RowGroup filtering
> --
>
> Key: PARQUET-1158
> URL: https://issues.apache.org/jira/browse/PARQUET-1158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>
> See 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300
> We should be able to translate this into C++ enums and apply in the Arrow 
> read methods methods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #985: PARQUET-2173. Fix parquet build against hadoop 3.3.3+

2022-08-22 Thread GitBox



steveloughran commented on code in PR #985:
URL: https://github.com/apache/parquet-mr/pull/985#discussion_r951589888


##
pom.xml:
##
@@ -160,7 +160,11 @@
 
   
 org.slf4j
-slf4j-log4j12
+*

Review Comment:
   it means that 
   1. classic log4j is excluded
   2. slf4j-reload4j is excluded
   3. when the hadoop move to log4j2 is finally shipped, its slf4j bindings 
will be excluded too.
   
   exclusions 1 and 2 could be done explicitly, but #3 is unclear until 
something ships. (hadoop trunk/3.4 is on log4j2, but the move isn't complete)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+

2022-08-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583039#comment-17583039
 ] 

ASF GitHub Bot commented on PARQUET-2173:
-

steveloughran commented on code in PR #985:
URL: https://github.com/apache/parquet-mr/pull/985#discussion_r951589888


##
pom.xml:
##
@@ -160,7 +160,11 @@
 
   
 org.slf4j
-slf4j-log4j12
+*

Review Comment:
   it means that 
   1. classic log4j is excluded
   2. slf4j-reload4j is excluded
   3. when the hadoop move to log4j2 is finally shipped, its slf4j bindings 
will be excluded too.
   
   exclusions 1 and 2 could be done explicitly, but #3 is unclear until 
something ships. (hadoop trunk/3.4 is on log4j2, but the move isn't complete)





> Fix parquet build against hadoop 3.3.3+
> ---
>
> Key: PARQUET-2173
> URL: https://issues.apache.org/jira/browse/PARQUET-2173
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 
> for reload4j, and this creates maven dependency problems in parquet cli
> {code}
> [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli 
> ---
> [WARNING] Used undeclared dependencies found:
> [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided
> {code}
> the hadoop common dependencies need to exclude this jar and any changed slf4j 
> ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] matthieun commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-08-22 Thread GitBox



matthieun commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r951644453


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {

Review Comment:
   The list is mostly used to keep the ordering, so that the dependency chain 
is printed in order in the warning message. I understand that in case the 
schema definition is really deep with nested types it might be slower, but 
overall that list is not growing any bigger than the deepest nesting in the 
schema.
   If this is still a concern, I am happy to switch to HashSet at the expense 
of maybe dumming down the log message (printing the nesting chain out of order 
would not be valuable anyway I think).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-08-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583074#comment-17583074
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

matthieun commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r951644453


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {

Review Comment:
   The list is mostly used to keep the ordering, so that the dependency chain 
is printed in order in the warning message. I understand that in case the 
schema definition is really deep with nested types it might be slower, but 
overall that list is not growing any bigger than the deepest nesting in the 
schema.
   If this is still a concern, I am happy to switch to HashSet at the expense 
of maybe dumming down the log message (printing the nesting chain out of order 
would not be valuable anyway I think).





> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [parquet-mr] matthieun commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-08-22 Thread GitBox



matthieun commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r951648555


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   It is possible to create circular dependencies, that is the problem. I am 
not sure in what case they would be useful, however since they can exist, 
parquet should not fail with `StackOverflowError` when it encounters them 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-08-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583076#comment-17583076
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

matthieun commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r951648555


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   It is possible to create circular dependencies, that is the problem. I am 
not sure in what case they would be useful, however since they can exist, 
parquet should not fail with `StackOverflowError` when it encounters them 😄 





> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (PARQUET-1416) [C++] Deprecate parquet/api/* in favor of simpler public API "parquet/api.h"

[jira] [Updated] (PARQUET-2099) [C++] Statistics::num_values() is misleading

[jira] [Updated] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

[jira] [Updated] (PARQUET-1814) [C++] TestInt96ParquetIO failure on Windows

[jira] [Updated] (PARQUET-1657) [C++] Change Bloom filter implementation to use xxhash

[jira] [Updated] (PARQUET-1653) [C++] Deprecated BIT_PACKED level decoding is probably incorrect

[jira] [Updated] (PARQUET-1646) [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder instead of std::vector

[jira] [Updated] (PARQUET-1634) [C++] Factor out data/dictionary page writes to allow for page buffering

[jira] [Updated] (PARQUET-1614) [C++] Reuse arrow::Buffer used as scratch space for decryption in Thrift deserialization hot path

[jira] [Updated] (PARQUET-1515) [C++] Disable LZ4 codec

[jira] [Updated] (PARQUET-1199) [C++] Support writing (and test reading) boolean values with RLE encoding

[jira] [Updated] (PARQUET-1430) [C++] Add tests for C++ tools

[jira] [Updated] (PARQUET-1158) [C++] Basic RowGroup filtering

[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #985: PARQUET-2173. Fix parquet build against hadoop 3.3.3+

[jira] [Commented] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+

[GitHub] [parquet-mr] matthieun commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

[GitHub] [parquet-mr] matthieun commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

19 matches

Site Navigation

Mail list logo

Footer information