MK,
Here's a link to the Avro reader. It sounds like you're familiar with
Avro, so that might be the easiest to read.
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroIndexedRecordConverter.java
What is your use case? Are you building something with its own tailored
data layer on top of Parquet? We're always interested in hearing about
projects that need their own data model. We can at least help you out
with the tricky parts, like higher-level type representations.
rb
On 04/09/2015 09:47 AM, Karthikeyan Muthukumar wrote:
Thanks Jacques & Ryan.
Can any of you please point/provide some code snippets for writing to
Parquet with own object model(NOT using Avro etc)?
I don't have any complex unions etc, my data model is very simple with a
bunch of primitives, and a few Arrays and Maps (of Strings -> Numbers).
Looking through existing code in Drill or Hive, requires getting into
context of those technologies, which often takes much more time than whats
actually needed.
I would greatly appreciate a small java code snippet for writing a simple
json like this to Parquet (without Avro etc):
{"Name": "Ram", "Age": 30, "Departments": ["Sales", "Marketing"],
"Ratings": {"Dec": 100, "Nov": 50, "Oct": 200}}
Thanks a lot!
MK
On Wed, Apr 8, 2015 at 9:03 PM, Jacques Nadeau <[email protected]> wrote:
I agree with what Ryan said. In terms of effort of implementation, using
the existing object models are great.
However, as you try to tune your application, you may find suboptimal
transformation patterns to the physical format. This is always a possible
risk when working through an abstraction. The example I've seen previously
is that people might create a union at a level higher than is necessary.
For example, imagine
old: {
first:string
last:string
}
new: {
first:string
last:string
twitter_handle:string
}
People are inclined to union (old,new). Last I checked, the default Avro
behavior in this situation would be to create five columns: old_first,
old_last, new_first, and new_last (names are actually nested as group0.x,
group1.x or something similar). Depending on what is being done, this can
be suboptimal as a logical query of "select table.first from table" now has
to read two columns, manage two possibly different encoding schemes, etc.
This will be even more impactful as we implement things like indices in the
physical layer.
In short, if you are using an abstraction, be aware that the physical
layout may not be as optimal as it would have been if you had hand-tuned
the schema with your particular application in mind. The flip-side is you
save time and aggravation in implementation.
Make sense?
On Wed, Apr 8, 2015 at 10:08 AM, Ryan Blue <[email protected]> wrote:
On 04/08/2015 09:49 AM, Karthikeyan Muthukumar wrote:
Thanks Jacques and Alex.
I have been successfully using Avro model to write to Parquet files and
found that quite logical, because Avro is quite rich.
Are there any functional or performance impacts of using Avro model
based
Parquet files, specifically w.r.t accessing the generated Parquet files
through other tools like Drill, SparkSQL etc?
Thanks & Regards
MK
Hi MK,
If Avro is the data model you're interested in using in your application,
then parquet-avro is a good choice.
For an application, it is perfectly reasonable to use Avro objects. There
are a few reasons for this:
1. You have existing code based on the Avro format and object model
2. You want to use Avro-generated classes (avro-specific)
3. You want to use your own Java classes via reflection (avro-reflect)
4. You want compatibility with both storage formats
Similarly, you could use parquet-thrift if you preferred using Thrift
objects or had existing Thrift code. (Or scrooge, or protobuf, etc.)
The only reason you would want to build your own object model is if you
are doing a translation step later. For example, Hive can translate Avro
objects to the form it expects, but instead we implemented a Hive object
model to go directly from Parquet to Hive's representation. That's faster
and doesn't require copying the data. This is why Drill, SparkSQL, Hive,
and others have their own data models.
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.
--
Ryan Blue
Software Engineer
Cloudera, Inc.