[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-08-23 Thread GitBox


shangxinli commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r952764513


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {

Review Comment:
   Make sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-08-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583672#comment-17583672
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

shangxinli commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r952764513


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {

Review Comment:
   Make sense





> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Fail to read back written large parquet file

2022-08-23 Thread Chao Sun
Jozef, feel free to open a Parquet JIRA to give the issue more
details. Ideally the writer should recover itself and produce the
correct result, but I don't have enough context yet so not sure
whether it's doable.

On Mon, Aug 8, 2022 at 1:53 AM Jozef Vilcek  wrote:
>
> Found it. Problem in my case was not with Parquet but my implementation of
> the `OutputFile` wrapper providing `PositionOutputStream`.
>
> Would it make sense to do changes to the writer to crash on negative
> offsets rather than continue and produce unreadable results.
>
> On Fri, Aug 5, 2022 at 8:42 PM Chao Sun  wrote:
>
> > Seems the file was corrupted during write. There's a similar issue
> > https://issues.apache.org/jira/browse/PARQUET-2164 we found recently.
> >
> > On Fri, Aug 5, 2022 at 3:40 AM Steve Loughran
> >  wrote:
> > >
> > > tha has to be an integer wraparound...something is using a signed int for
> > > position, so when it goes above 2GB it goes negative, and a seek(negative
> > > value) is rejected.
> > >
> > > fix: find the variable and make it a long
> > >
> > >
> > >
> > > On Thu, 4 Aug 2022 at 11:09, Jozef Vilcek  wrote:
> > >
> > > > I came across a case where a job writes out a data set in parquet
> > format
> > > > and it can not be read back as it appears to be corrupted.
> > > >
> > > > Files fail to read back if their size is going over 2GB. If I set the
> > job
> > > > to produce more smaller files from exactly the same input, all is good.
> > > >
> > > > Job write to parquet Avro messages via `parquet-avro` and
> > `parquet-mr`. It
> > > > does happen with v1.10.1 and v1.12.0.
> > > >
> > > > Read error is:
> > > >
> > > > Cannot seek to negative offset
> > > > java.io.EOFException: Cannot seek to negative offset
> > > > at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1454)
> > > > at
> > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
> > > > at
> > > >
> > > >
> > org.apache.parquet.hadoop.util.H2SeekableInputStream.seek(H2SeekableInputStream.java:60)
> > > > at
> > > >
> > > >
> > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1157)
> > > > at
> > > >
> > > >
> > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
> > > >
> > > > When digging a bit into the read, the code materializing
> > > > `ColumnChunkMetaData` is here [1] starting to see negative values for
> > > > `firstDataPage`. Printing some info from `reader.getRowGroups` yields:
> > > >
> > > >
> > > > startingPos=4, totalBytesSize=519551822, rowCount=2300100
> > > > startingPos=108156606, totalBytesSize=517597985, rowCount=2300100
> > > > ...
> > > > startingPos=1950017569, totalBytesSize=511705703, rowCount=2300100
> > > > startingPos=2058233752, totalBytesSize=521762439, rowCount=2300100
> > > > startingPos=-2128348908, totalBytesSize=508570588, rowCount=2300100
> > > > startingPos=-2020294298, totalBytesSize=518901187, rowCount=2300100
> > > > startingPos=-1911848035, totalBytesSize=512724804, rowCount=2300100
> > > > startingPos=-1803573306, totalBytesSize=510980877, rowCount=2300100
> > > > startingPos=-1695543557, totalBytesSize=525871692, rowCount=2300100
> > > > startingPos=-1587016600, totalBytesSize=519353830, rowCount=2300100
> > > > startingPos=-1478696427, totalBytesSize=451032173, rowCount=2090372
> > > >
> > > >
> > > >
> > > > Unfortunately, I was not able to reproduce it locally by taking avro
> > schema
> > > > and generating random inputs and writing them out to a local file.
> > Every
> > > > time, compressed or uncompressed, 3GB file was reading back correctly.
> > > >
> > > > I am looking for help in finding a solution of hints in debugging
> > this, as
> > > > I am out of clues to try to pinpoint and reproduce the problem.
> > > >
> > > > Thanks!
> > > >
> > > > [1]
> > > >
> > > >
> > https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L127
> > > >
> >


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-08-23 Thread GitBox


shangxinli commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r952778132


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   In that case, we silently break the circle without throwing an exception. Is 
 that OK? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-08-23 Thread GitBox


shangxinli commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r952778132


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   In that case, we silently break the circle without throwing an exception. Is 
that OK? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-08-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583688#comment-17583688
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

shangxinli commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r952778132


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   In that case, we silently break the circle without throwing an exception. Is 
that OK? 





> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-08-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583686#comment-17583686
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

shangxinli commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r952778132


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   In that case, we silently break the circle without throwing an exception. Is 
 that OK? 





> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli merged pull request #986: Prevent IntelliJ from making unsolicited whitespace changes

2022-08-23 Thread GitBox


shangxinli merged PR #986:
URL: https://github.com/apache/parquet-mr/pull/986


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] matthieun commented on a diff in pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-08-23 Thread GitBox


matthieun commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r952835674


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   Well, another option would be to add a new configuration setting to allow 
the user to either have it fail with a good error message, or just silently 
break the circle like this. However I am not familiar with how 
`parquet-protobuf` is configured. If I should go that route, I'd appreciate 
some examples!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-08-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583724#comment-17583724
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

matthieun commented on code in PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#discussion_r952835674


##
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java:
##
@@ -79,12 +80,20 @@ public MessageType convert(Class 
protobufClass) {
   }
 
   /* Iterates over list of fields. **/
-  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors) {
+  private  GroupBuilder convertFields(GroupBuilder groupBuilder, 
List fieldDescriptors, List parentNames) {
 for (FieldDescriptor fieldDescriptor : fieldDescriptors) {
-  groupBuilder =
-  addField(fieldDescriptor, groupBuilder)
+  final String name = fieldDescriptor.getFullName();
+  final List newParentNames = new ArrayList<>(parentNames);
+  newParentNames.add(name);
+  if (parentNames.contains(name)) {
+// Circular dependency, skip
+LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),

Review Comment:
   Well, another option would be to add a new configuration setting to allow 
the user to either have it fail with a good error message, or just silently 
break the circle like this. However I am not familiar with how 
`parquet-protobuf` is configured. If I should go that route, I'd appreciate 
some examples!





> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2175) Skip method skips levels and not rows for repeated fields

2022-08-23 Thread fatemah (Jira)
fatemah created PARQUET-2175:


 Summary: Skip method skips levels and not rows for repeated fields
 Key: PARQUET-2175
 URL: https://issues.apache.org/jira/browse/PARQUET-2175
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: fatemah


The implementation of TypedColumnReader::Skip method with signature:

virtual int64_t Skip(int64_t num_levels_to_skip) = 0;

will skip levels for both repeated fields and non-repeated fields. We want to 
be able to skip rows for repeated fields, and skipping levels is not that 
useful.

For example, for the following rows:

message M \{ repeated int32 b = 1 }

rows: {}, \{[10,10]}, \{[20, 20, 20]}

values = \{10, 10, 20, 20, 20};
def_levels = \{0, 1, 1, 1, 1, 1};
rep_levels = \{0, 0, 1, 0, 1, 1};

We want skip(2) to skip the first two rows, so that the next value that we read 
is 20. However, it will skip the first two levels, and the next value that we 
read is 10.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Interest in adding the float16 logical type to the Parquet spec

2022-08-23 Thread Anja
Hello!

Is there interest in having the float16 logical type standardised in the
Parquet spec? I am proposing a PR for Arrow that will write float16 to
Parquet as FixedSizeBinary:https://issues.apache.org/jira/browse/ARROW-17464
but for the sake of portability between data analysis tools, it would of
course be a lot better to have this type standardised in the format itself.

Previous requests for this have been here:
https://issues.apache.org/jira/browse/PARQUET-1647 and here:
https://issues.apache.org/jira/browse/PARQUET-758 .

With the development of neural networks, half-precision floating points are
becoming more popular:
https://en.wikipedia.org/wiki/Half-precision_floating-point_format ; I do
think that a demand exists for its support. I am new to the project, but am
happy to contribute development time if there is support for this feature,
and guidance.

Warm regards,

Anja