Thanks Jacques & Ryan.
Can any of you please point/provide some code snippets for writing to
Parquet with own object model(NOT using Avro etc)?
I don't have any complex unions etc, my data model is very simple with a
bunch of primitives, and a few Arrays and Maps (of Strings -> Numbers).
Looking through existing code in Drill or Hive, requires getting into
context of those technologies, which often takes much more time than whats
actually needed.
I would greatly appreciate a small java code snippet for writing a simple
json like this to Parquet (without Avro etc):
{"Name": "Ram", "Age": 30, "Departments": ["Sales", "Marketing"],
"Ratings": {"Dec": 100, "Nov": 50, "Oct": 200}}
Thanks a lot!
MK
On Wed, Apr 8, 2015 at 9:03 PM, Jacques Nadeau <[email protected]> wrote:
> I agree with what Ryan said. In terms of effort of implementation, using
> the existing object models are great.
>
> However, as you try to tune your application, you may find suboptimal
> transformation patterns to the physical format. This is always a possible
> risk when working through an abstraction. The example I've seen previously
> is that people might create a union at a level higher than is necessary.
> For example, imagine
>
> old: {
> first:string
> last:string
> }
>
> new: {
> first:string
> last:string
> twitter_handle:string
> }
>
> People are inclined to union (old,new). Last I checked, the default Avro
> behavior in this situation would be to create five columns: old_first,
> old_last, new_first, and new_last (names are actually nested as group0.x,
> group1.x or something similar). Depending on what is being done, this can
> be suboptimal as a logical query of "select table.first from table" now has
> to read two columns, manage two possibly different encoding schemes, etc.
> This will be even more impactful as we implement things like indices in the
> physical layer.
>
> In short, if you are using an abstraction, be aware that the physical
> layout may not be as optimal as it would have been if you had hand-tuned
> the schema with your particular application in mind. The flip-side is you
> save time and aggravation in implementation.
>
> Make sense?
>
>
> On Wed, Apr 8, 2015 at 10:08 AM, Ryan Blue <[email protected]> wrote:
>
> > On 04/08/2015 09:49 AM, Karthikeyan Muthukumar wrote:
> >
> >> Thanks Jacques and Alex.
> >> I have been successfully using Avro model to write to Parquet files and
> >> found that quite logical, because Avro is quite rich.
> >> Are there any functional or performance impacts of using Avro model
> based
> >> Parquet files, specifically w.r.t accessing the generated Parquet files
> >> through other tools like Drill, SparkSQL etc?
> >> Thanks & Regards
> >> MK
> >>
> >
> > Hi MK,
> >
> > If Avro is the data model you're interested in using in your application,
> > then parquet-avro is a good choice.
> >
> > For an application, it is perfectly reasonable to use Avro objects. There
> > are a few reasons for this:
> > 1. You have existing code based on the Avro format and object model
> > 2. You want to use Avro-generated classes (avro-specific)
> > 3. You want to use your own Java classes via reflection (avro-reflect)
> > 4. You want compatibility with both storage formats
> >
> > Similarly, you could use parquet-thrift if you preferred using Thrift
> > objects or had existing Thrift code. (Or scrooge, or protobuf, etc.)
> >
> > The only reason you would want to build your own object model is if you
> > are doing a translation step later. For example, Hive can translate Avro
> > objects to the form it expects, but instead we implemented a Hive object
> > model to go directly from Parquet to Hive's representation. That's faster
> > and doesn't require copying the data. This is why Drill, SparkSQL, Hive,
> > and others have their own data models.
> >
> > rb
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
> >
>