Re: ORC v/s Parquet for Spark 2.0

Ovidiu-Cristian MARCU Tue, 26 Jul 2016 06:58:12 -0700

Interesting opinion, thank you

Still, on the website parquet is basically inspired by Dremel (Google) [1] and 
part of orc has been enhanced while deployed for Facebook, Yahoo [2].


Other than this presentation [3], do you guys know any other benchmark?

[1]https://parquet.apache.org/documentation/latest/ 
<https://parquet.apache.org/documentation/latest/>
[2]https://orc.apache.org/docs/ <https://orc.apache.org/docs/>
[3] 
http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet 
<http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet>

> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
> 
> when parquet came out it was developed by a community of companies, and was 
> designed as a library to be supported by multiple big data projects. nice
> 
> orc on the other hand initially only supported hive. it wasn't even designed 
> as a library that can be re-used. even today it brings in the kitchen sink of 
> transitive dependencies. yikes
> 
> 
> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com 
> <mailto:jornfra...@gmail.com>> wrote:
> I think both are very similar, but with slightly different goals. While they 
> work transparently for each Hadoop application you need to enable specific 
> support in the application for predicate push down. 
> In the end you have to check which application you are using and do some 
> tests (with correct predicate push down configuration). Keep in mind that 
> both formats work best if they are sorted on filter columns (which is your 
> responsibility) and if their optimatizations are correctly configured (min 
> max index, bloom filter, compression etc) . 
> 
> If you need to ingest sensor data you may want to store it first in hbase and 
> then batch process it in large files in Orc or parquet format.
> 
> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com 
> <mailto:janardhan...@gmail.com>> wrote:
> 
>> Just wondering advantages and disadvantages to convert data into ORC or 
>> Parquet. 
>> 
>> In the documentation of Spark there are numerous examples of Parquet format. 
>> 
>> Any strong reasons to chose Parquet over ORC file format ?
>> 
>> Also : current data compression is bzip2
>> 
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>  
>> <http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy>
>>  
>> This seems like biased.

Re: ORC v/s Parquet for Spark 2.0

Reply via email to