Hi, My comments inline
On Sun, Aug 17, 2014 at 7:30 PM, Gary Malouf <[email protected]> wrote: > My team currently uses Apache Spark over different types 'tables' of > protobuf serialized to HDFS. Today, the performance of our queries is less > than ideal and we are trying to figure out if using Parquet in specific > places will help us. > > Questions: > > 1) Does a single protobuf message get broken up over a number of columns as > it seems to read? > Storing a file with Protobuf messages written one directly after another would be an example of a row-wise format. Meaning that all columns in a row are grouped together. In a columnar format like Parquet, values of columns are grouped together. This allows you to efficiently encode (compress) and read columns. > > 2) Our protobuf has mostly required fields - how does Parquet work with > this when at query time we sometimes only need say 2 of our 15 fields? > This is an ideal use of a columnar format such as Parquet since you won't have to read the fields you don't care (the other 13) about off disk.
