Hi Brock,

Thank you for following up - I have some follow ups in-line:


On Sun, Aug 17, 2014 at 11:44 PM, Brock Noland <[email protected]> wrote:

> Hi,
>
> My comments inline
>
> On Sun, Aug 17, 2014 at 7:30 PM, Gary Malouf <[email protected]>
> wrote:
>
> > My team currently uses Apache Spark over different types 'tables' of
> > protobuf serialized to HDFS.  Today, the performance of our queries is
> less
> > than ideal and we are trying to figure out if using Parquet in specific
> > places will help us.
> >
> > Questions:
> >
> > 1) Does a single protobuf message get broken up over a number of columns
> as
> > it seems to read?
> >
>
> Storing a file with Protobuf messages written one directly after another
> would be an example of a row-wise format. Meaning that all columns in a row
> are grouped together. In a columnar format like Parquet, values of columns
> are grouped together. This allows you to efficiently encode (compress) and
> read columns.
>

So I interpret this to mean each field in a protobuf message's value is
grouped along with that same value for other messages together.

>
>
> >
> > 2) Our protobuf has mostly required fields - how does Parquet work with
> > this when at query time we sometimes only need say 2 of our 15 fields?
> >
>
> This is an ideal use of a columnar format such as Parquet since you won't
> have to read the fields you don't care (the other 13) about off disk.
>

This only partially answers my question - protobuf messages have a concept
of 'required fields' - wouldn't the message fail to initialize in-memory
(for example, the JVM) if some of these are not set?

Reply via email to