[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386360#comment-17386360 ] ASF GitHub Bot commented on PARQUET-968: ccpstephanie commented on pull request #411: URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-885759396 Although it's closed but I'm a bit confused... why I always get the old schema version? `parquet.proto.writeSpecsCompliant=false`, and directly using ParquetWriter. I'm using the latest version currently 1.12.0. I'd highly appreciate if someone could point out something stupid in my code! Or it's the same issue you are experiencing? My goal is to be able query data via Athena/Presto, or Hive Metastore, so need the the new parquet schema version. **Method 1:** // Doesn't work! Configuration conf = new Configuration(); ProtoWriteSupport.setWriteSpecsCompliant(conf, false); // If set to true, the old schema style will be used (without wrappers). ParquetWriter writer = ProtoParquetWriter.builder(file).withMessage(cls).withConf(conf).build(); for (MessageOrBuilder record : records) { writer.write(record); } writer.close(); System.err.println(writer.getFooter()); **Method 2:** // Doesn't work! Configuration conf = new Configuration(); ProtoWriteSupport.setWriteSpecsCompliant(conf, false); // If set to true, the old schema style will be used (without wrappers). try (ParquetWriter writer = new ParquetWriter( file, new ProtoWriteSupport(AddressBook.class), CompressionCodecName.GZIP, 128 * 1024 * 1024,//PARQUET_BLOCK_SIZE, ParquetProperties.DEFAULT_PAGE_SIZE, ParquetProperties.DEFAULT_PAGE_SIZE, true, false, ParquetProperties.DEFAULT_WRITER_VERSION, conf)) { for (Object record : messages) { writer.write(record); } writer.close(); System.err.println(writer.getFooter()); **Parquet output Metadata:** ` _ParquetMetaData{FileMetaData{schema: message AddressBookProtos.AddressBook { repeated group people = 1 { optional binary name (STRING) = 1; optional int32 id = 2; optional binary email (STRING) = 3; repeated group phones = 4 { optional binary number (STRING) = 1; optional binary type (ENUM) = 2; } }} , metadata: {parquet.proto.descriptor=name: "AddressBook" field { name: "people" number: 1 label: LABEL_REPEATED type: TYPE_MESSAGE type_name: ".AddressBookProtos.Person"} , parquet.proto.writeSpecsCompliant=false, ...} ` **Protobuf Messasge:** ` syntax = "proto3"; package AddressBookProtos; option java_multiple_files = true; option java_package = "com.mycompany.app"; option java_outer_classname = "AddressBookProtos"; message Person { string name = 1; int32 id = 2; string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { string number = 1; PhoneType type = 2; } repeated PhoneNumber phones = 4; } message AddressBook { repeated Person people = 1; } ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616413#comment-16616413 ] Scott Clasen commented on PARQUET-968: -- Hi- Is there a snapshot repo anywhere where this has been published? would love to try it out. Any release schedule for when this gets in a non snapshot yet? Thank you! SC > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460119#comment-16460119 ] ASF GitHub Bot commented on PARQUET-968: chawlakunal commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-385781922 @BenoitHanotte Here's the PR for constructor with flag #473 Also, if a minor release can be done for this fix it will be greatly appreciated. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459834#comment-16459834 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-385718339 I believe the 1.10.0 has just been released, so this will liekly land in the next "major" release, unfortunately I am not aware of any plan to have a new release in the near future. For the `ProtoParquetWriter` class, we have discussed it with @costimuraru and we will add a constructor with the flag in a future PR but I can't commit on a timeframe. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458886#comment-16458886 ] ASF GitHub Bot commented on PARQUET-968: chawlakunal commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-385494498 @BenoitHanotte Is there a timeline of when this will be released? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458294#comment-16458294 ] ASF GitHub Bot commented on PARQUET-968: chawlakunal commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-385307748 @BenoitHanotte That's exactly how I am using it right now but it kind of defeats the purpose of having ProtoParquetWriter class. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458109#comment-16458109 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-385270695 @chawlakunal you can manually create your ParquetWriter by providing the ProtoWriteSupport as following: ``` Configuration conf = new Configuration(); ProtoWriteSupport.setWriteSpecsCompliant(true, conf); // set the flag in the configuration new ParquetWriter(file, conf, new ProtoWriteSupport(protoClass)); ``` (or any variation of this as ParquetWriter has multiple constructors that accept a configuration in which we can set the flag) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456936#comment-16456936 ] ASF GitHub Bot commented on PARQUET-968: chawlakunal commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-385068068 @BenoitHanotte @costimuraru @julienledem There is no way to instantiate ProtoParquetWriter with parquet.proto.writeSpecsCompliant flag enabled. Am I missing something or is this intentional? It would have been great if a constructor to enable the flag was provided. ``` public ProtoParquetWriter(Path file, Class protoMessage, CompressionCodecName compressionCodecName, int blockSize, int pageSize, boolean enableDictionary, boolean validating, boolean writeSpecsCompliant) throws IOException { super(file, new ProtoWriteSupport(protoMessage), compressionCodecName, blockSize, pageSize, pageSize, enableDictionary, validating, DEFAULT_WRITER_VERSION, getConfigWithWriteSpecsCompliant(writeSpecsCompliant)); } private static Configuration getConfigWithWriteSpecsCompliant(boolean writeSpecsCompliant) { Configuration config = new Configuration(); ProtoWriteSupport.setWriteSpecsCompliant(config, writeSpecsCompliant); return config; } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456935#comment-16456935 ] ASF GitHub Bot commented on PARQUET-968: chawlakunal commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-385068068 @BenoitHanotte @costimuraru There is no way to instantiate ProtoParquetWriter with parquet.proto.writeSpecsCompliant flag enabled. Am I missing something or is this intentional? It would have been great if a constructor to enable the flag was provided. ``` public ProtoParquetWriter(Path file, Class protoMessage, CompressionCodecName compressionCodecName, int blockSize, int pageSize, boolean enableDictionary, boolean validating, boolean writeSpecsCompliant) throws IOException { super(file, new ProtoWriteSupport(protoMessage), compressionCodecName, blockSize, pageSize, pageSize, enableDictionary, validating, DEFAULT_WRITER_VERSION, getConfigWithWriteSpecsCompliant(writeSpecsCompliant)); } private static Configuration getConfigWithWriteSpecsCompliant(boolean writeSpecsCompliant) { Configuration config = new Configuration(); ProtoWriteSupport.setWriteSpecsCompliant(config, writeSpecsCompliant); return config; } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456933#comment-16456933 ] ASF GitHub Bot commented on PARQUET-968: chawlakunal commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-385068068 @BenoitHanotte @costimuraru There is no way to instantiate ProtoParquetWriter with parquet.proto.writeSpecsCompliant flag enabled. Am I missing something or is this intentional? It would have been great if a constructor to enable the flag was provided. ``` public QcProtoParquetWriter(Path file, Class protoMessage, CompressionCodecName compressionCodecName, int blockSize, int pageSize, boolean enableDictionary, boolean validating, boolean writeSpecsCompliant) throws IOException { super(file, new ProtoWriteSupport(protoMessage), compressionCodecName, blockSize, pageSize, pageSize, enableDictionary, validating, DEFAULT_WRITER_VERSION, getConfigWithWriteSpecsCompliant(writeSpecsCompliant)); } private static Configuration getConfigWithWriteSpecsCompliant(boolean writeSpecsCompliant) { Configuration config = new Configuration(); ProtoWriteSupport.setWriteSpecsCompliant(config, writeSpecsCompliant); return config; } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > Fix For: 1.11 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453972#comment-16453972 ] Julien Le Dem commented on PARQUET-968: --- merged in https://github.com/apache/parquet-mr/commit/f84938441be49c665595c936ac631c3e5f171bf9 > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Assignee: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453965#comment-16453965 ] ASF GitHub Bot commented on PARQUET-968: julienledem closed pull request #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java b/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java index b5649a05b..d5f43e6b3 100644 --- a/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java +++ b/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java @@ -24,12 +24,14 @@ import com.twitter.elephantbird.util.Protobufs; import org.apache.parquet.column.Dictionary; import org.apache.parquet.io.InvalidRecordException; +import org.apache.parquet.io.ParquetDecodingException; import org.apache.parquet.io.api.Binary; import org.apache.parquet.io.api.Converter; import org.apache.parquet.io.api.GroupConverter; import org.apache.parquet.io.api.PrimitiveConverter; import org.apache.parquet.schema.GroupType; import org.apache.parquet.schema.IncompatibleSchemaModificationException; +import org.apache.parquet.schema.OriginalType; import org.apache.parquet.schema.Type; import java.util.HashMap; @@ -129,10 +131,15 @@ public void add(Object value) { }; } +if (OriginalType.LIST == parquetType.getOriginalType()) { + return new ListConverter(parentBuilder, fieldDescriptor, parquetType); +} +if (OriginalType.MAP == parquetType.getOriginalType()) { + return new MapConverter(parentBuilder, fieldDescriptor, parquetType); +} return newScalarConverter(parent, parentBuilder, fieldDescriptor, parquetType); } - private Converter newScalarConverter(ParentValueContainer pvc, Message.Builder parentBuilder, Descriptors.FieldDescriptor fieldDescriptor, Type parquetType) { JavaType javaType = fieldDescriptor.getJavaType(); @@ -345,4 +352,121 @@ public void addBinary(Binary binary) { } } + + /** + * This class unwraps the additional LIST wrapper and makes it possible to read the underlying data and then convert + * it to protobuf. + * + * Consider the following protobuf schema: + * message SimpleList { + * repeated int64 first_array = 1; + * } + * + * A LIST wrapper is created in parquet for the above mentioned protobuf schema: + * message SimpleList { + * optional group first_array (LIST) = 1 { + * repeated group list { + * optional int32 element; + * } + * } + * } + * + * The LIST wrappers are used by 3rd party tools, such as Hive, to read parquet arrays. The wrapper contains + * a repeated group named 'list', itself containing only one field called 'element' of the type of the repeated + * object (can be a primitive as in this example or a group in case of a repeated message in protobuf). + */ + final class ListConverter extends GroupConverter { +private final Converter converter; + +public ListConverter(Message.Builder parentBuilder, Descriptors.FieldDescriptor fieldDescriptor, Type parquetType) { + OriginalType originalType = parquetType.getOriginalType(); + if (originalType != OriginalType.LIST || parquetType.isPrimitive()) { +throw new ParquetDecodingException("Expected LIST wrapper. Found: " + originalType + " instead."); + } + + GroupType rootWrapperType = parquetType.asGroupType(); + if (!rootWrapperType.containsField("list") || rootWrapperType.getType("list").isPrimitive()) { +throw new ParquetDecodingException("Expected repeated 'list' group inside LIST wrapperr but got: " + rootWrapperType); + } + + GroupType listType = rootWrapperType.getType("list").asGroupType(); + if (!listType.containsField("element")) { +throw new ParquetDecodingException("Expected 'element' inside repeated list group but got: " + listType); + } + + Type elementType = listType.getType("element"); + converter = newMessageConverter(parentBuilder, fieldDescriptor, elementType); +} + +@Override +public Converter getConverter(int fieldIndex) { + if (fieldIndex > 0) { +throw new ParquetDecodingException("Unexpected multiple fields in the LIST wrapper"); + } + + return new GroupConverter() { +@Override +public Converter getConverter(int fieldIndex) { + return converter; +} + +@Override +public void start() { + +} + +
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451381#comment-16451381 ] ASF GitHub Bot commented on PARQUET-968: chawlakunal commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-384106975 When can this be expected to be merged to master and released? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450169#comment-16450169 ] ASF GitHub Bot commented on PARQUET-968: julienledem commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-383997195 This looks good. Thank you for this collaborative effort! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450040#comment-16450040 ] ASF GitHub Bot commented on PARQUET-968: lukasnalezenec commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-383969694 Hi, I already did. There is one typo in comment and it is little bit harder to read - i wanted to check flow once more. I think we can commit it as it is. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449697#comment-16449697 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-383903265 Hello @lukasnalezenec, have you had time to have a look? Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440773#comment-16440773 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381955530 @lukasnalezenec this is the schema written for a map with the flag set to true: ``` optional group nonEmptyMap (MAP) = 5 { repeated group key_value { required int32 key; optional int32 value; } } ``` There is a test suite at https://github.com/BenoitHanotte/parquet-968/blob/master/src/test/scala/org/bhnte/parquet968/SparkTest.scala to validate the support of empty maps and lists as well as to check that Spark is now able to correctly interpret maps as such with the flag set to true This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440765#comment-16440765 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381955530 @lukasnalezenec this is the schema written for a map with the flag set to true: ``` optional group nonEmptyMap (MAP) = 5 { repeated group key_value { required int32 key; optional int32 value; } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440764#comment-16440764 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381955530 @lukasnalezenec this is the schema written for a map with the flag set to true: ``` optional group emptyMap (MAP) = 5 { repeated group key_value { required int32 key; optional int32 value; } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440763#comment-16440763 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381955530 @lukasnalezenec this is the schema written with the specCompliantFlag set to true: ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440573#comment-16440573 ] ASF GitHub Bot commented on PARQUET-968: lukasnalezenec commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381890716 Hi, I will look to it and merge it ASAP. Last think i want to check is the map repetition - how it's done ? Is that correct for sure? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440534#comment-16440534 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381881399 @costimuraru Perfect, thanks for merging my PR into your branch! :) Then I believe we have an agreement, all the comments have been addressed and this PR has been successfully tested. @julienledem, if your are ok with it, we should be ready to merge. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437369#comment-16437369 ] ASF GitHub Bot commented on PARQUET-968: costimuraru commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381155272 Would be great if we could merge this by Apr 29, when this PR will turn one year :D This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437294#comment-16437294 ] ASF GitHub Bot commented on PARQUET-968: costimuraru commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-381134753 @BenoitHanotte sounds awesome! I successfully tested this final patch and it works great with AWS Athena (Presto). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426863#comment-16426863 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-378924606 @costimuraru @lukasnalezenec do we have an agreement? Can we merge the 2 PRs? These changes have been in prod here for some time, and I would like them to be merged so that we can start to work on other tickets that would otherwise require an expansive rebase after this one is merged. @virtualluke These changes are backward compatible for reading so you won't need to set any flag when reading. Once you have read the data, you will be able to write them using the "specs-compliant" schemas using the `parquet.proto.writeSpecsCompliant` flag This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426861#comment-16426861 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-378924606 @costimuraru do we have an agreement? Can we merge the 2 PRs? These changes have been in prod here for some time, and I would like them to be merged so that we can start to work on other tickets that would otherwise require an expansive rebase after this one is merged. @virtualluke These changes are backward compatible for reading so you won't need to set any flag when reading. Once you have read the data, you will be able to write them using the "specs-compliant" schemas using the `parquet.proto.writeSpecsCompliant` flag This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426797#comment-16426797 ] ASF GitHub Bot commented on PARQUET-968: virtualluke commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-378911550 When this is implemented there will be a flag where I could read parquet files which have the 2-level repetition style and convert them to the 3-level style so they would be compatible with the current state of parquet parsing libraries built with parquet-cpp (thinking pyarrow here) ? Does this look like it will be merged soon? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418262#comment-16418262 ] ASF GitHub Bot commented on PARQUET-968: rdblue commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-377076348 These changes sound good to me once there is consensus from everyone that understands the protobuf parts. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415757#comment-16415757 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on a change in pull request #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#discussion_r177454755 ## File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java ## @@ -345,4 +351,121 @@ public void addBinary(Binary binary) { } } + + /** + * This class unwraps the additional LIST wrapper and makes it possible to read the underlying data and then convert + * it to protobuf. + * + * Consider the following protobuf schema: + * message SimpleList { + * repeated int64 first_array = 1; + * } + * + * A LIST wrapper is created in parquet for the above mentioned protobuf schema: + * message SimpleList { + * required group first_array (LIST) = 1 { + * repeated int32 element; + * } + * } + * + * The LIST wrappers are used by 3rd party tools, such as Hive, to read parquet arrays. The wrapper contains + * one only one field: either a primitive field (like in the example above, where we have an array of ints) or + * another group (array of messages). + */ + final class ListConverter extends GroupConverter { +private final Converter converter; +private final boolean listOfMessage; + +public ListConverter(Message.Builder parentBuilder, Descriptors.FieldDescriptor fieldDescriptor, Type parquetType) { + OriginalType originalType = parquetType.getOriginalType(); + if (originalType != OriginalType.LIST) { +throw new ParquetDecodingException("Expected LIST wrapper. Found: " + originalType + " instead."); + } + + listOfMessage = fieldDescriptor.getJavaType() == JavaType.MESSAGE; Review comment: done in https://github.com/costimuraru/parquet-mr/pull/2 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415758#comment-16415758 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on a change in pull request #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#discussion_r177454793 ## File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java ## @@ -388,7 +388,7 @@ public ListConverter(Message.Builder parentBuilder, Descriptors.FieldDescriptor if (parquetType.asGroupType().containsField("list")) { parquetSchema = parquetType.asGroupType().getType("list"); if (parquetSchema.asGroupType().containsField("element")) { - parquetSchema.asGroupType().getType("element"); + parquetSchema = parquetSchema.asGroupType().getType("element"); Review comment: done in https://github.com/costimuraru/parquet-mr/pull/2 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415756#comment-16415756 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on a change in pull request #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#discussion_r177454394 ## File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoWriteSupport.java ## @@ -247,21 +282,45 @@ final void writeRawValue(Object value) { @Override final void writeField(Object value) { recordConsumer.startField(fieldName, index); + recordConsumer.startGroup(); List list = (List) value; + recordConsumer.startField("list", 0); // This is the wrapper group for the array field for (Object listEntry: list) { +recordConsumer.startGroup(); + +recordConsumer.startField("element", 0); // This is the mandatory inner field + +if (!isPrimitive(listEntry)) { + recordConsumer.startGroup(); +} + fieldWriter.writeRawValue(listEntry); + +if (!isPrimitive(listEntry)) { + recordConsumer.endGroup(); +} + +recordConsumer.endField("element", 0); + +recordConsumer.endGroup(); } + recordConsumer.endField("list", 0); + recordConsumer.endGroup(); recordConsumer.endField(fieldName, index); } } + private boolean isPrimitive(Object listEntry) { +return !(listEntry instanceof Message); Review comment: done in https://github.com/costimuraru/parquet-mr/pull/2, I removed the isPrimitive method as wrapping with a group can be done in `MessageWriter.writeRawValue()`, removing the need to handle non-primitive types differently inside the ArrayWriter. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415751#comment-16415751 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on a change in pull request #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#discussion_r177453689 ## File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java ## @@ -345,4 +351,121 @@ public void addBinary(Binary binary) { } } + + /** + * This class unwraps the additional LIST wrapper and makes it possible to read the underlying data and then convert + * it to protobuf. + * + * Consider the following protobuf schema: + * message SimpleList { + * repeated int64 first_array = 1; + * } + * + * A LIST wrapper is created in parquet for the above mentioned protobuf schema: + * message SimpleList { + * required group first_array (LIST) = 1 { + * repeated int32 element; + * } + * } + * + * The LIST wrappers are used by 3rd party tools, such as Hive, to read parquet arrays. The wrapper contains + * one only one field: either a primitive field (like in the example above, where we have an array of ints) or + * another group (array of messages). + */ + final class ListConverter extends GroupConverter { +private final Converter converter; +private final boolean listOfMessage; + +public ListConverter(Message.Builder parentBuilder, Descriptors.FieldDescriptor fieldDescriptor, Type parquetType) { + OriginalType originalType = parquetType.getOriginalType(); + if (originalType != OriginalType.LIST) { +throw new ParquetDecodingException("Expected LIST wrapper. Found: " + originalType + " instead."); + } + + listOfMessage = fieldDescriptor.getJavaType() == JavaType.MESSAGE; + + Type parquetSchema; + if (parquetType.asGroupType().containsField("list")) { Review comment: done in https://github.com/costimuraru/parquet-mr/pull/2 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415749#comment-16415749 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on a change in pull request #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#discussion_r177453607 ## File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java ## @@ -129,10 +131,14 @@ public void add(Object value) { }; } -return newScalarConverter(parent, parentBuilder, fieldDescriptor, parquetType); +OriginalType originalType = parquetType.getOriginalType() == null ? OriginalType.UTF8 : parquetType.getOriginalType(); Review comment: I believe the reason for this is that if originalType is null, the swicth will throw an exception. In https://github.com/costimuraru/parquet-mr/pull/2 I replaced the switch with an if. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415744#comment-16415744 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-376552693 @costimuraru @lukasnalezenec @qinghui-xu I have updated my PR (https://github.com/costimuraru/parquet-mr/pull/2) on top of this one with a comment addressing the code review feedback. Could you please have a look? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407735#comment-16407735 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-374895431 @costimuraru I updated my PR in order to isolate the repetition-level change in a separate commit. However with lists as required we are likely creating incorrect files: the root wrapper is `required` but not written as protobuf's `pb.getAllFields()` only returns fields that are set (i.e. not an empty list), thus the field for the list root wrapper is never written despite being `required`. Here is how Spark would show the resulting dataset if lists are required: ``` +-+--+-++++ |intNotSet|intSet|emptyRepeated|nonEmptyRepeated|emptyMap| nonEmptyMap| +-+--+-++++ | null| 1| []| [1, 1]|null|[1 -> 1, 2 -> 2]| +-+--+-++++ ``` We can see that empty maps and primitives that are not set are null, however lists show up as empty This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407687#comment-16407687 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-374883095 @virtualluke it seems related: the issue seems to be that your file are missing the LIST wrapper around repeated fields (the 3-levels lists), and instead using the 2-levels repetition style. The 2-levels style is what parquet-protobuf currently does and this PR's aim is to allow it to use the 3-levels way, which is how the parquet specs define lists and maps when writting parquet files. The PR you linked to does the opposite: allowing to read files that have been written with the 2-levels style. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405740#comment-16405740 ] ASF GitHub Bot commented on PARQUET-968: virtualluke commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-374458355 Currently I am handed parquet files generated from parquet-mr and the current way LIST is handled makes the parquet files unusable in anything in the current python stack I have tried (arrow/parquet-cpp or fastparquet). This PR seems like it would address the difference? https://issues.apache.org/jira/browse/PARQUET-1122 . Or is it similar but not related? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404787#comment-16404787 ] ASF GitHub Bot commented on PARQUET-968: costimuraru commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-374205305 Moving the discussion from the fork branch (https://github.com/costimuraru/parquet-mr/pull/2/files#r175112235), here. > costimuraru 3 days ago Owner > @BenoitHanotte, we had a discussion a while back around the LIST wrapper being REQUIRED: > https://github.com/apache/parquet-mr/pull/411#issuecomment-301194565 > > @BenoitHanotte > BenoitHanotte 3 days ago > I strongly believe the wrappers should be optional for the following reasons: > > it is consistent with how Maps are handled > it is consistent with the fact that parquet-protobuf does not write fields if they have the default values (protobuf considers them cleared). this is currently the case for primitives and complex messages, I would like collections to behave the same way. > every field is optional with protobuf 3 > I haven't seen any arguments for making the wrapper required in the first place, was there a specific reason I am not aware of? @julienledem, regarding your previous comment (https://github.com/apache/parquet-mr/pull/411#issuecomment-301194565), is there a specific reason for which the LIST wrapper needs to be `REQUIRED` and not `OPTIONAL`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404785#comment-16404785 ] ASF GitHub Bot commented on PARQUET-968: costimuraru commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-374205305 > costimuraru 3 days ago Owner > @BenoitHanotte, we had a discussion a while back around the LIST wrapper being REQUIRED: > https://github.com/apache/parquet-mr/pull/411#issuecomment-301194565 > > @BenoitHanotte > BenoitHanotte 3 days ago > I strongly believe the wrappers should be optional for the following reasons: > > it is consistent with how Maps are handled > it is consistent with the fact that parquet-protobuf does not write fields if they have the default values (protobuf considers them cleared). this is currently the case for primitives and complex messages, I would like collections to behave the same way. > every field is optional with protobuf 3 > I haven't seen any arguments for making the wrapper required in the first place, was there a specific reason I am not aware of? @julienledem, regarding your previous comment (https://github.com/apache/parquet-mr/pull/411#issuecomment-301194565), is there a specific reason for which the LIST wrapper needs to be `REQUIRED` and not `OPTIONAL`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401822#comment-16401822 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-373654480 @costimuraru I did a PR in your repo to add a flag that enable writing using the specs-compliant schemas: https://github.com/costimuraru/parquet-mr/pull/2 . The flag is off by default in order to keep backward compatibility. I believe this is the safest way going forward and will allow to merge this PR without breaking backward compatibility. The flag could be set to true in a future major release of parquet. You can find a test suite at https://github.com/BenoitHanotte/parquet-968 that highlights the differences in how the old schemas and the new ones (specs-compliant) are interpreted by Spark and validates the changes in this PR: - default behavior (with flag set to false, default behavior, backward compatible): ``` +-++++ |emptyRepeated|nonEmptyRepeated|emptyMap| nonEmptyMap| +-++++ | []| [1, 1]| []|[[1, 1], [2, 2]]| +-++++ ``` - specs-compliant schemas: ``` +-++++ |emptyRepeated|nonEmptyRepeated|emptyMap| nonEmptyMap| +-++++ | null| [1, 1]|null|[1 -> 1, 2 -> 2]| +-++++ ``` You can see that maps are interpreted differently (they are interpreted as lists of key-value tuples with the non-compliant schemas) as are the default values (empty lists and maps). @lukasnalezenec @qinghui-xu feel free to also have a look as we would like to have this PR merged if all the concerns are addressed Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401662#comment-16401662 ] ASF GitHub Bot commented on PARQUET-968: BenoitHanotte commented on issue #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#issuecomment-373654480 @costimuraru I did a PR in your repo to add a flag that enablse writing using the specs-compliant schemas: https://github.com/costimuraru/parquet-mr/pull/2 . The flag is off by default in order to keep backward compatibility. I believe this is the safest way going forward and will allow to merge this PR without breaking backward compatibility. the flag could be set to true in a future major release of parquet. You can find a test suite at https://github.com/BenoitHanotte/parquet-968 that highlights the differences in how the old schemas and the new ones (specs-compliant) are interpreted by Spark and validates the changes in this PR: - default behavior (with flag set to false, default behavior, backward compatible): ``` +-++++ |emptyRepeated|nonEmptyRepeated|emptyMap| nonEmptyMap| +-++++ | []| [1, 1]| []|[[1, 1], [2, 2]]| +-++++ ``` - specs-compliant schemas: ``` +-++++ |emptyRepeated|nonEmptyRepeated|emptyMap| nonEmptyMap| +-++++ | null| [1, 1]|null|[1 -> 1, 2 -> 2]| +-++++ ``` You can see that maps are interpreted differently (they are interpreted as lists of key-value tuples with the non-compliant schemas) as are the default values (empty lists and maps). @lukasnalezenec @qinghui-xu feel free to also have a look as we would like to have this PR merged if all the concerns are resolved Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397313#comment-16397313 ] ASF GitHub Bot commented on PARQUET-968: qinghui-xu commented on a change in pull request #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#discussion_r174216544 ## File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java ## @@ -129,10 +131,14 @@ public void add(Object value) { }; } -return newScalarConverter(parent, parentBuilder, fieldDescriptor, parquetType); +OriginalType originalType = parquetType.getOriginalType() == null ? OriginalType.UTF8 : parquetType.getOriginalType(); Review comment: I guess for the data generated in previous version of parquet-protobuf, it is not having the "OriginalType" annotation for repeated fields, thus this conditional test to be backward compatible. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)