Thanks Michael for clarifying this. My response is inline.
Guru Medasani
gdm...@gmail.com
> On Mar 10, 2016, at 12:38 PM, Michael Armbrust wrote:
>
> A few clarifications:
>
> 1) High memory and cpu usage. This is because Parquet files can't be streamed
> into as records arrive. I have seen
A few clarifications:
> 1) High memory and cpu usage. This is because Parquet files can't be
> streamed into as records arrive. I have seen a lot of OOMs in reasonably
> sized MR/Spark containers that write out Parquet. When doing dynamic
> partitioning, where many writers are open at once, we’ve
+1 Paul. Both have some pros and cons.
Hope this helps.
Avro:
Pros:
1) Plays nice with other tools, 3rd party or otherwise, or you specifically
need some data type in AVRO like binary, but gladly that list is shrinking all
the time (yay nested types in Impala).
2) Good for event data that cha
Nice article about Parquet *with* Avro :
- https://dzone.com/articles/understanding-how-parquet
- http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/
Nice video from the good folks of Cloudera for the *differences* between
"Avrow" and Parquet
- https://www.youtube.com/watch?v=AY1
well can you use orc without bringing in the kitchen sink of dependencies
also known as hive?
On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim wrote:
> How about ORC? I have experimented briefly with Parquet and ORC, and I
> liked the fact that ORC has its schema within the file, which makes it
>
Parquet is faster for adhoc queries, because of the columnar storage. (It
only reads the columns needed for a query.) It's more than twice as fast
(often a lot more) as Avro in these slides from SVDS:
http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-mor
How about ORC? I have experimented briefly with Parquet and ORC, and I
liked the fact that ORC has its schema within the file, which makes it
handy to work with any other tools.
Jong Wook
On 3 March 2016 at 23:29, Don Drake wrote:
> My tests show Parquet has better performance than Avro in just
My tests show Parquet has better performance than Avro in just about every
test. It really shines when you are querying a subset of columns in a wide
table.
-Don
On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann
wrote:
> Which format is the best format for SparkSQL adhoc queries and general
> data