Re: AVRO vs Parquet

2016-03-10 Thread Guru Medasani
Thanks Michael for clarifying this. My response is inline. Guru Medasani gdm...@gmail.com > On Mar 10, 2016, at 12:38 PM, Michael Armbrust wrote: > > A few clarifications: > > 1) High memory and cpu usage. This is because Parquet files can't be streamed > into as

Re: AVRO vs Parquet

2016-03-10 Thread Michael Armbrust
A few clarifications: > 1) High memory and cpu usage. This is because Parquet files can't be > streamed into as records arrive. I have seen a lot of OOMs in reasonably > sized MR/Spark containers that write out Parquet. When doing dynamic > partitioning, where many writers are open at once,

Re: AVRO vs Parquet

2016-03-09 Thread Guru Medasani
+1 Paul. Both have some pros and cons. Hope this helps. Avro: Pros: 1) Plays nice with other tools, 3rd party or otherwise, or you specifically need some data type in AVRO like binary, but gladly that list is shrinking all the time (yay nested types in Impala). 2) Good for event data that

Re: AVRO vs Parquet

2016-03-04 Thread Paul Leclercq
Nice article about Parquet *with* Avro : - https://dzone.com/articles/understanding-how-parquet - http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ Nice video from the good folks of Cloudera for the *differences* between "Avrow" and Parquet -

Re: AVRO vs Parquet

2016-03-03 Thread Koert Kuipers
well can you use orc without bringing in the kitchen sink of dependencies also known as hive? On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim wrote: > How about ORC? I have experimented briefly with Parquet and ORC, and I > liked the fact that ORC has its schema within the

Re: AVRO vs Parquet

2016-03-03 Thread Jong Wook Kim
How about ORC? I have experimented briefly with Parquet and ORC, and I liked the fact that ORC has its schema within the file, which makes it handy to work with any other tools. Jong Wook On 3 March 2016 at 23:29, Don Drake wrote: > My tests show Parquet has better

Re: AVRO vs Parquet

2016-03-03 Thread Don Drake
My tests show Parquet has better performance than Avro in just about every test. It really shines when you are querying a subset of columns in a wide table. -Don On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann wrote: > Which format is the best format for SparkSQL adhoc

AVRO vs Parquet

2016-03-02 Thread Timothy Spann
Which format is the best format for SparkSQL adhoc queries and general data storage? There are lots of specialized cases, but generally accessing some but not all the available columns with a reasonable subset of the data. I am learning towards Parquet as it has great support in Spark. I also