Re: ORC v/s Parquet for Spark 2.0

u...@moosheimer.com Wed, 27 Jul 2016 02:48:19 -0700

Hi Gourav,

Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory 
db with data storage while Parquet is "only" a columnar storage format.


As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's 
more a wish :-).

Regards,
Uwe

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengu...@gmail.com>:
> 
> Gosh,
> 
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
> speed that is better than SPARK.
> 
> Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
> might just start saying that KUDA has difficult lineage as well. After all 
> dynastic rules dictate.
> 
> Personally I feel that if something stores my data compressed and makes me 
> access it faster I do not care where it comes from or how difficult the child 
> birth was :)
> 
> 
> Regards,
> Gourav
> 
>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>> <sbpothin...@gmail.com> wrote:
>> Just correction:
>> 
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>> default. 
>> 
>> Do not know If Spark leveraging this new repo?
>> 
>> <dependency>
>>  <groupId>org.apache.orc</groupId>
>>     <artifactId>orc</artifactId>
>>     <version>1.1.2</version>
>>     <type>pom</type>
>> </dependency>
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Sent from my iPhone
>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>> 
>> 
>>> parquet was inspired by dremel but written from the ground up as a library 
>>> with support for a variety of big data systems (hive, pig, impala, 
>>> cascading, etc.). it is also easy to add new support, since its a proper 
>>> library.
>>> 
>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in 
>>> hive. just hive. it didn't really exist by itself. it was part of the big 
>>> java soup that is called hive, without an easy way to extract it. hive does 
>>> not expose proper java apis. it never cared for that.
>>> 
>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>>>> <ovidiu-cristian.ma...@inria.fr> wrote:
>>>> Interesting opinion, thank you
>>>> 
>>>> Still, on the website parquet is basically inspired by Dremel (Google) [1] 
>>>> and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>>>> 
>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>> 
>>>> [1]https://parquet.apache.org/documentation/latest/
>>>> [2]https://orc.apache.org/docs/
>>>> [3] 
>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>> 
>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>> 
>>>>> when parquet came out it was developed by a community of companies, and 
>>>>> was designed as a library to be supported by multiple big data projects. 
>>>>> nice
>>>>> 
>>>>> orc on the other hand initially only supported hive. it wasn't even 
>>>>> designed as a library that can be re-used. even today it brings in the 
>>>>> kitchen sink of transitive dependencies. yikes
>>>>> 
>>>>> 
>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>>>>>> I think both are very similar, but with slightly different goals. While 
>>>>>> they work transparently for each Hadoop application you need to enable 
>>>>>> specific support in the application for predicate push down. 
>>>>>> In the end you have to check which application you are using and do some 
>>>>>> tests (with correct predicate push down configuration). Keep in mind 
>>>>>> that both formats work best if they are sorted on filter columns (which 
>>>>>> is your responsibility) and if their optimatizations are correctly 
>>>>>> configured (min max index, bloom filter, compression etc) . 
>>>>>> 
>>>>>> If you need to ingest sensor data you may want to store it first in 
>>>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>>> 
>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Just wondering advantages and disadvantages to convert data into ORC or 
>>>>>>> Parquet. 
>>>>>>> 
>>>>>>> In the documentation of Spark there are numerous examples of Parquet 
>>>>>>> format. 
>>>>>>> 
>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>> 
>>>>>>> Also : current data compression is bzip2
>>>>>>> 
>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>>>  
>>>>>>> This seems like biased.
>

Re: ORC v/s Parquet for Spark 2.0

Reply via email to