Re: ORC v/s Parquet for Spark 2.0

Jörn Franke Tue, 26 Jul 2016 02:10:29 -0700

I think both are very similar, but with slightly different goals. While they 
work transparently for each Hadoop application you need to enable specific 
support in the application for predicate push down. 
In the end you have to check which application you are using and do some tests 
(with correct predicate push down configuration). Keep in mind that both 
formats work best if they are sorted on filter columns (which is your 
responsibility) and if their optimatizations are correctly configured (min max 
index, bloom filter, compression etc) .


If you need to ingest sensor data you may want to store it first in hbase and 
then batch process it in large files in Orc or parquet format.

> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> wrote:
> 
> Just wondering advantages and disadvantages to convert data into ORC or 
> Parquet. 
> 
> In the documentation of Spark there are numerous examples of Parquet format. 
> 
> Any strong reasons to chose Parquet over ORC file format ?
> 
> Also : current data compression is bzip2
> 
> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy 
> This seems like biased.

Re: ORC v/s Parquet for Spark 2.0

Reply via email to