ProtoBuf

Jacques Nadeau Wed, 08 Apr 2015 18:04:37 -0700

I agree with what Ryan said.  In terms of effort of implementation, using
the existing object models are great.

However, as you try to tune your application,  you may find suboptimal
transformation patterns to the physical format.  This is always a possible
risk when working through an abstraction.  The example I've seen previously
is that people might create a union at a level higher than is necessary.
For example, imagine

old: {
  first:string
  last:string
}

new: {
  first:string
  last:string
  twitter_handle:string
}

People are inclined to union (old,new).  Last I checked, the default Avro
behavior in this situation would be to create five columns: old_first,
old_last, new_first, and new_last (names are actually nested as group0.x,
group1.x or something similar).  Depending on what is being done, this can
be suboptimal as a logical query of "select table.first from table" now has
to read two columns, manage two possibly different encoding schemes, etc.
This will be even more impactful as we implement things like indices in the
physical layer.

In short, if you are using an abstraction, be aware that the physical
layout may not be as optimal as it would have been if you had hand-tuned
the schema with your particular application in mind.  The flip-side is you
save time and aggravation in implementation.

Make sense?

On Wed, Apr 8, 2015 at 10:08 AM, Ryan Blue <[email protected]> wrote:

> On 04/08/2015 09:49 AM, Karthikeyan Muthukumar wrote:
>
>> Thanks Jacques and Alex.
>> I have been successfully using Avro model to write to Parquet files and
>> found that quite logical, because Avro is quite rich.
>> Are there any functional or performance impacts of using Avro model based
>> Parquet files, specifically w.r.t accessing the generated Parquet files
>> through other tools like Drill, SparkSQL etc?
>> Thanks & Regards
>> MK
>>
>
> Hi MK,
>
> If Avro is the data model you're interested in using in your application,
> then parquet-avro is a good choice.
>
> For an application, it is perfectly reasonable to use Avro objects. There
> are a few reasons for this:
> 1. You have existing code based on the Avro format and object model
> 2. You want to use Avro-generated classes (avro-specific)
> 3. You want to use your own Java classes via reflection (avro-reflect)
> 4. You want compatibility with both storage formats
>
> Similarly, you could use parquet-thrift if you preferred using Thrift
> objects or had existing Thrift code. (Or scrooge, or protobuf, etc.)
>
> The only reason you would want to build your own object model is if you
> are doing a translation step later. For example, Hive can translate Avro
> objects to the form it expects, but instead we implemented a Hive object
> model to go directly from Parquet to Hive's representation. That's faster
> and doesn't require copying the data. This is why Drill, SparkSQL, Hive,
> and others have their own data models.
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Reply via email to