Re: AVRO vs Parquet

2016-03-10 Thread Guru Medasani
Thanks Michael for clarifying this. My response is inline. Guru Medasani gdm...@gmail.com > On Mar 10, 2016, at 12:38 PM, Michael Armbrust wrote: > > A few clarifications: > > 1) High memory and cpu usage. This is because Parquet files can't be streamed > into as records arrive. I have seen

Re: AVRO vs Parquet

2016-03-10 Thread Michael Armbrust
A few clarifications: > 1) High memory and cpu usage. This is because Parquet files can't be > streamed into as records arrive. I have seen a lot of OOMs in reasonably > sized MR/Spark containers that write out Parquet. When doing dynamic > partitioning, where many writers are open at once, we’ve

Re: AVRO vs Parquet

2016-03-09 Thread Guru Medasani
+1 Paul. Both have some pros and cons. Hope this helps. Avro: Pros: 1) Plays nice with other tools, 3rd party or otherwise, or you specifically need some data type in AVRO like binary, but gladly that list is shrinking all the time (yay nested types in Impala). 2) Good for event data that cha

Re: AVRO vs Parquet

2016-03-04 Thread Paul Leclercq
Nice article about Parquet *with* Avro : - https://dzone.com/articles/understanding-how-parquet - http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ Nice video from the good folks of Cloudera for the *differences* between "Avrow" and Parquet - https://www.youtube.com/watch?v=AY1

Re: AVRO vs Parquet

2016-03-03 Thread Koert Kuipers
well can you use orc without bringing in the kitchen sink of dependencies also known as hive? On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim wrote: > How about ORC? I have experimented briefly with Parquet and ORC, and I > liked the fact that ORC has its schema within the file, which makes it >

Re: AVRO vs Parquet

2016-03-03 Thread Xinh Huynh
Parquet is faster for adhoc queries, because of the columnar storage. (It only reads the columns needed for a query.) It's more than twice as fast (often a lot more) as Avro in these slides from SVDS: http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-mor

Re: AVRO vs Parquet

2016-03-03 Thread Jong Wook Kim
How about ORC? I have experimented briefly with Parquet and ORC, and I liked the fact that ORC has its schema within the file, which makes it handy to work with any other tools. Jong Wook On 3 March 2016 at 23:29, Don Drake wrote: > My tests show Parquet has better performance than Avro in just

Re: AVRO vs Parquet

2016-03-03 Thread Don Drake
My tests show Parquet has better performance than Avro in just about every test. It really shines when you are querying a subset of columns in a wide table. -Don On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann wrote: > Which format is the best format for SparkSQL adhoc queries and general > data