Re: AVRO vs Parquet

2016-03-10 Thread Guru Medasani
Thanks Michael for clarifying this. My response is  inline.

Guru Medasani
gdm...@gmail.com

> On Mar 10, 2016, at 12:38 PM, Michael Armbrust  wrote:
> 
> A few clarifications:
>  
> 1) High memory and cpu usage. This is because Parquet files can't be streamed 
> into as records arrive. I have seen a lot of OOMs in reasonably sized 
> MR/Spark containers that write out Parquet. When doing dynamic partitioning, 
> where many writers are open at once, we’ve seen customers having trouble to 
> make it work. This has made for some very confused ETL developers.
> 
> In Spark 1.6.1 we avoid having more than 2 files open per task, so this 
> should be less of a problem even for dynamic partitioning.

Thanks for fixing this. Looks like this the Jira that is going into Spark 1.6.1 
that is fixing the memory issues during dynamic partitioning. I copied it here 
so rest of the folks on the email thread can take a look. 

SPARK-12546 

Writing to partitioned parquet table can fail with OOM


>  
> 2) Parquet lags well behind Avro in schema evolution semantics. Can only add 
> columns at the end? Deleting columns at the end is not recommended if you 
> plan to add any columns in the future. Reordering is not supported in current 
> release. 
> 
> This may be true for Impala, but Spark SQL does schema merging by name so you 
> can add / reorder columns with the constraint that you cannot reuse a name 
> with an incompatible type.

As I mentioned in my previous email it is something user still needs to be 
aware of as user mentioned the following in the initial question.

I also have to consider any file on HDFS may be accessed from other tools like 
Hive, Impala, HAWQ.

Regarding the reordering I also mentioned in my previous email that it will be 
supported in Impala in the next release as well. 

In the next release, there might be support for optionally matching Parquet 
file columns by name instead of order, like Hive does. Under this scheme, you 
cannot rename columns (since the files will retain the old name and will no 
longer be matched), but you can reorder them. ( This is regarding Impala)

Re: AVRO vs Parquet

2016-03-10 Thread Michael Armbrust
A few clarifications:


> 1) High memory and cpu usage. This is because Parquet files can't be
> streamed into as records arrive. I have seen a lot of OOMs in reasonably
> sized MR/Spark containers that write out Parquet. When doing dynamic
> partitioning, where many writers are open at once, we’ve seen customers
> having trouble to make it work. This has made for some very confused ETL
> developers.
>

In Spark 1.6.1 we avoid having more than 2 files open per task, so this
should be less of a problem even for dynamic partitioning.


> 2) Parquet lags well behind Avro in schema evolution semantics. Can only
> add columns at the end? Deleting columns at the end is not recommended if
> you plan to add any columns in the future. Reordering is not supported in
> current release.
>

This may be true for Impala, but Spark SQL does schema merging by name so
you can add / reorder columns with the constraint that you cannot reuse a
name with an incompatible type.


Re: AVRO vs Parquet

2016-03-09 Thread Guru Medasani
+1 Paul. Both have some pros and cons.

Hope this helps.

Avro:

Pros: 
1) Plays nice with other tools, 3rd party or otherwise, or you specifically 
need some data type in AVRO like binary, but gladly that list is shrinking all 
the time (yay nested types in Impala).
2) Good for event data that changes over time. Can be used in Kafka Schema 
Registry
3) Schema Evolution - supports adding, removing, renaming, reordering


Cons:
1) Cannot write data from Impala
2) Timestamp data type is not supported (Hive & Impala)-  we can still store 
them as longs and convert it in the end-user tables.

Parquet:

Pros:
1) High performance during both reads and writes - both wide and narrow tables 
(though mileage may vary based on the datasets cardinality)
2) Great for analytics

Cons:
1) High memory and cpu usage. This is because Parquet files can't be streamed 
into as records arrive. I have seen a lot of OOMs in reasonably sized MR/Spark 
containers that write out Parquet. When doing dynamic partitioning, where many 
writers are open at once, we’ve seen customers having trouble to make it work. 
This has made for some very confused ETL developers.
2) Parquet lags well behind Avro in schema evolution semantics. Can only add 
columns at the end? Deleting columns at the end is not recommended if you plan 
to add any columns in the future. Reordering is not supported in current 
release. 

Here is Schema Evolution and best practices in Parquet and Avro in more detail. 

Parquet you can add columns to the end of the table definition and they'll be 
populated with NULL if missing in the file (you can technically delete columns 
from the end too, but then it will try to read the old deleted column if you 
add a new column to the end afterwards). 

Here is an example of how this works.

At t0 - if we have a parquet table with the following columns c1,c2,c3,c4
 t1 - if we delete the columns c3, c4, table now has the columns c1,c2
 t2 - Now we add a new column c5, table now has the columns c1,c2,c5

So now when we query the table for columns c1,c2 and c5, it will try to read 
the columns c3.

Impala currently matches the file schema to the table schema by column order, 
so at t2, if you select c5, Impala will try to read c3 from the old files, 
thinking it's c5 since it's the third column in the file. This will work if the 
types of c3 and c5 are the same (but probably not be the right results) or fail 
if the types don't match. It will ignore the old c4. Files with the new schema 
(c1,c2,c5) will be read correctly.

You can also rename columns, since we only look at the column ordering and not 
their names. So if you have columns c1 and c2, and reorder them c2 and c1, 
Impala will read c1 when you query c2 and vice versa.

In the next release, there might be support for optionally matching Parquet 
file columns by name instead of order, like Hive does. Under this scheme, you 
cannot rename columns (since the files will retain the old name and will no 
longer be matched), but you can reorder them. ( This is regarding Impala)

In Avro we follow the rules described here: 
http://avro.apache.org/docs/1.8.0/spec.html#Schema+Resolution 


This includes rearranging columns, renaming columns via aliasing, and certain 
data type differences (e.g. you can "promote" an int to a bigint in the table 
schema, but not vice versa). AFAIK you should always modify the Avro schema 
directly in the table metadata, I think weird stuff happens if you try to use 
the usual ALTER TABLE commands. Preferred method is to modify the json schema 
file on HDFS and then using the ALTER TABLE to add/remove/reorder/rename the 
columns.

I think in some cases using ALTER TABLE without changing the schema could work 
by chance, but in general it seems dangerous to change the column metadata in 
the metastore and not have that be reflected in the Avro schema as well.

Guru Medasani
gdm...@gmail.com



> On Mar 4, 2016, at 7:36 AM, Paul Leclercq  wrote:
> 
> 
> 
> Nice article about Parquet with Avro : 
> https://dzone.com/articles/understanding-how-parquet 
> 
> http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ 
> 
> Nice video from the good folks of Cloudera for the differences between 
> "Avrow" and Parquet
> https://www.youtube.com/watch?v=AY1dEfyFeHc 
> 
> 
> 2016-03-04 7:12 GMT+01:00 Koert Kuipers  >:
> well can you use orc without bringing in the kitchen sink of dependencies 
> also known as hive?
> 
> On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim  > wrote:
> How about ORC? I have experimented briefly with Parquet and ORC, and I liked 
> the fact that ORC has its schema within the file, which makes it 

Re: AVRO vs Parquet

2016-03-04 Thread Paul Leclercq
Nice article about Parquet *with* Avro :

   - https://dzone.com/articles/understanding-how-parquet
   - http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

Nice video from the good folks of Cloudera for the *differences* between
"Avrow" and Parquet

   - https://www.youtube.com/watch?v=AY1dEfyFeHc


2016-03-04 7:12 GMT+01:00 Koert Kuipers :

> well can you use orc without bringing in the kitchen sink of dependencies
> also known as hive?
>
> On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim  wrote:
>
>> How about ORC? I have experimented briefly with Parquet and ORC, and I
>> liked the fact that ORC has its schema within the file, which makes it
>> handy to work with any other tools.
>>
>> Jong Wook
>>
>> On 3 March 2016 at 23:29, Don Drake  wrote:
>>
>>> My tests show Parquet has better performance than Avro in just about
>>> every test.  It really shines when you are querying a subset of columns in
>>> a wide table.
>>>
>>> -Don
>>>
>>> On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann 
>>> wrote:
>>>
 Which format is the best format for SparkSQL adhoc queries and general
 data storage?

 There are lots of specialized cases, but generally accessing some but
 not all the available columns with a reasonable subset of the data.

 I am learning towards Parquet as it has great support in Spark.

 I also have to consider any file on HDFS may be accessed from other
 tools like Hive, Impala, HAWQ.

 Suggestions?
 —
 airis.DATA
 Timothy Spann, Senior Solutions Architect
 C: 609-250-5894
 http://airisdata.com/
 http://meetup.com/nj-datascience



>>>
>>>
>>> --
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/
>>> https://twitter.com/dondrake 
>>> 800-733-2143
>>>
>>
>>
>


-- 

Paul Leclercq | Data engineer


 paul.lecle...@tabmo.io  |  http://www.tabmo.fr/


Re: AVRO vs Parquet

2016-03-03 Thread Koert Kuipers
well can you use orc without bringing in the kitchen sink of dependencies
also known as hive?

On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim  wrote:

> How about ORC? I have experimented briefly with Parquet and ORC, and I
> liked the fact that ORC has its schema within the file, which makes it
> handy to work with any other tools.
>
> Jong Wook
>
> On 3 March 2016 at 23:29, Don Drake  wrote:
>
>> My tests show Parquet has better performance than Avro in just about
>> every test.  It really shines when you are querying a subset of columns in
>> a wide table.
>>
>> -Don
>>
>> On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann 
>> wrote:
>>
>>> Which format is the best format for SparkSQL adhoc queries and general
>>> data storage?
>>>
>>> There are lots of specialized cases, but generally accessing some but
>>> not all the available columns with a reasonable subset of the data.
>>>
>>> I am learning towards Parquet as it has great support in Spark.
>>>
>>> I also have to consider any file on HDFS may be accessed from other
>>> tools like Hive, Impala, HAWQ.
>>>
>>> Suggestions?
>>> —
>>> airis.DATA
>>> Timothy Spann, Senior Solutions Architect
>>> C: 609-250-5894
>>> http://airisdata.com/
>>> http://meetup.com/nj-datascience
>>>
>>>
>>>
>>
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> https://twitter.com/dondrake 
>> 800-733-2143
>>
>
>


Re: AVRO vs Parquet

2016-03-03 Thread Jong Wook Kim
How about ORC? I have experimented briefly with Parquet and ORC, and I
liked the fact that ORC has its schema within the file, which makes it
handy to work with any other tools.

Jong Wook

On 3 March 2016 at 23:29, Don Drake  wrote:

> My tests show Parquet has better performance than Avro in just about every
> test.  It really shines when you are querying a subset of columns in a wide
> table.
>
> -Don
>
> On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann 
> wrote:
>
>> Which format is the best format for SparkSQL adhoc queries and general
>> data storage?
>>
>> There are lots of specialized cases, but generally accessing some but not
>> all the available columns with a reasonable subset of the data.
>>
>> I am learning towards Parquet as it has great support in Spark.
>>
>> I also have to consider any file on HDFS may be accessed from other tools
>> like Hive, Impala, HAWQ.
>>
>> Suggestions?
>> —
>> airis.DATA
>> Timothy Spann, Senior Solutions Architect
>> C: 609-250-5894
>> http://airisdata.com/
>> http://meetup.com/nj-datascience
>>
>>
>>
>
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake 
> 800-733-2143
>


Re: AVRO vs Parquet

2016-03-03 Thread Don Drake
My tests show Parquet has better performance than Avro in just about every
test.  It really shines when you are querying a subset of columns in a wide
table.

-Don

On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann 
wrote:

> Which format is the best format for SparkSQL adhoc queries and general
> data storage?
>
> There are lots of specialized cases, but generally accessing some but not
> all the available columns with a reasonable subset of the data.
>
> I am learning towards Parquet as it has great support in Spark.
>
> I also have to consider any file on HDFS may be accessed from other tools
> like Hive, Impala, HAWQ.
>
> Suggestions?
> —
> airis.DATA
> Timothy Spann, Senior Solutions Architect
> C: 609-250-5894
> http://airisdata.com/
> http://meetup.com/nj-datascience
>
>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143