On Sun, Aug 17, 2014 at 9:12 PM, Gary Malouf <[email protected]> wrote:

> Hi Brock,
>
> Thank you for following up - I have some follow ups in-line:
>
>
> On Sun, Aug 17, 2014 at 11:44 PM, Brock Noland <[email protected]> wrote:
>
> > Hi,
> >
> > My comments inline
> >
> > On Sun, Aug 17, 2014 at 7:30 PM, Gary Malouf <[email protected]>
> > wrote:
> >
> > > My team currently uses Apache Spark over different types 'tables' of
> > > protobuf serialized to HDFS.  Today, the performance of our queries is
> > less
> > > than ideal and we are trying to figure out if using Parquet in specific
> > > places will help us.
> > >
> > > Questions:
> > >
> > > 1) Does a single protobuf message get broken up over a number of
> columns
> > as
> > > it seems to read?
> > >
> >
> > Storing a file with Protobuf messages written one directly after another
> > would be an example of a row-wise format. Meaning that all columns in a
> row
> > are grouped together. In a columnar format like Parquet, values of
> columns
> > are grouped together. This allows you to efficiently encode (compress)
> and
> > read columns.
> >
>
> So I interpret this to mean each field in a protobuf message's value is
> grouped along with that same value for other messages together.
>

Here is an example PB schema:

https://github.com/apache/incubator-parquet-mr/blob/master/parquet-protobuf/src/test/resources/TestProtobuf.proto#L34

and the resulting Parquet schema:

https://github.com/apache/incubator-parquet-mr/blob/master/parquet-protobuf/src/test/java/parquet/proto/ProtoSchemaConverterTest.java#L47


>
> >
> >
> > >
> > > 2) Our protobuf has mostly required fields - how does Parquet work with
> > > this when at query time we sometimes only need say 2 of our 15 fields?
> > >
> >
> > This is an ideal use of a columnar format such as Parquet since you won't
> > have to read the fields you don't care (the other 13) about off disk.
> >
>
> This only partially answers my question - protobuf messages have a concept
> of 'required fields' - wouldn't the message fail to initialize in-memory
> (for example, the JVM) if some of these are not set?
>

IIRC Protobuf throws this exception when calling build on the message
builder. I am guessing that is why ProtoRecordConverter allows you to
obtain the builder as opposed to the build message, by setting a flag:

https://github.com/apache/incubator-parquet-mr/blob/master/parquet-protobuf/src/main/java/parquet/proto/ProtoRecordConverter.java#L78

Reply via email to