Re: ORC v/s Parquet for Spark 2.0

ayan guha Wed, 27 Jul 2016 06:46:28 -0700

Because everyone is here discussing this ever-changing-for-better-reason
topic of storage formats and serdes, any opinion/thoughts/experience with
Apache Arrow? It sounds like a nice idea, but how ready is it?


On Wed, Jul 27, 2016 at 11:31 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Kudu has been from my impression be designed to offer somethings between
> hbase and parquet for write intensive loads - it is not faster for
> warehouse type of querying compared to parquet (merely slower, because that
> is not its use case).   I assume this is still the strategy of it.
>
> For some scenarios it could make sense together with parquet and Orc.
> However I am not sure what the advantage towards using hbase + parquet and
> Orc.
>
> On 27 Jul 2016, at 11:47, "u...@moosheimer.com <u...@moosheimer.com>" <
> u...@moosheimer.com <u...@moosheimer.com>> wrote:
>
> Hi Gourav,
>
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
> memory db with data storage while Parquet is "only" a columnar
> storage format.
>
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
> that's more a wish :-).
>
> Regards,
> Uwe
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengu...@gmail.com
> >:
>
> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> <dependency>
>>  <groupId>org.apache.orc</groupId>
>>     <artifactId>orc</artifactId>
>>     <version>1.1.2</version>
>>     <type>pom</type>
>> </dependency>
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>>>
>>>> I think both are very similar, but with slightly different goals. While
>>>> they work transparently for each Hadoop application you need to enable
>>>> specific support in the application for predicate push down.
>>>> In the end you have to check which application you are using and do
>>>> some tests (with correct predicate push down configuration). Keep in mind
>>>> that both formats work best if they are sorted on filter columns (which is
>>>> your responsibility) and if their optimatizations are correctly configured
>>>> (min max index, bloom filter, compression etc) .
>>>>
>>>> If you need to ingest sensor data you may want to store it first in
>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>
>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com>
>>>> wrote:
>>>>
>>>> Just wondering advantages and disadvantages to convert data into ORC or
>>>> Parquet.
>>>>
>>>> In the documentation of Spark there are numerous examples of Parquet
>>>> format.
>>>>
>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>
>>>> Also : current data compression is bzip2
>>>>
>>>>
>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>> This seems like biased.
>>>>
>>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha

Re: ORC v/s Parquet for Spark 2.0

Reply via email to