Gosh, whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK.
Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate. Personally I feel that if something stores my data compressed and makes me access it faster I do not care where it comes from or how difficult the child birth was :) Regards, Gourav On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni < sbpothin...@gmail.com> wrote: > Just correction: > > ORC Java libraries from Hive are forked into Apache ORC. Vectorization > default. > > Do not know If Spark leveraging this new repo? > > <dependency> > <groupId>org.apache.orc</groupId> > <artifactId>orc</artifactId> > <version>1.1.2</version> > <type>pom</type> > </dependency> > > > > > > > > > Sent from my iPhone > On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote: > > parquet was inspired by dremel but written from the ground up as a library > with support for a variety of big data systems (hive, pig, impala, > cascading, etc.). it is also easy to add new support, since its a proper > library. > > orc bas been enhanced while deployed at facebook in hive and at yahoo in > hive. just hive. it didn't really exist by itself. it was part of the big > java soup that is called hive, without an easy way to extract it. hive does > not expose proper java apis. it never cared for that. > > On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU < > ovidiu-cristian.ma...@inria.fr> wrote: > >> Interesting opinion, thank you >> >> Still, on the website parquet is basically inspired by Dremel (Google) >> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo >> [2]. >> >> Other than this presentation [3], do you guys know any other benchmark? >> >> [1]https://parquet.apache.org/documentation/latest/ >> [2]https://orc.apache.org/docs/ >> [3] >> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet >> >> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote: >> >> when parquet came out it was developed by a community of companies, and >> was designed as a library to be supported by multiple big data projects. >> nice >> >> orc on the other hand initially only supported hive. it wasn't even >> designed as a library that can be re-used. even today it brings in the >> kitchen sink of transitive dependencies. yikes >> >> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote: >> >>> I think both are very similar, but with slightly different goals. While >>> they work transparently for each Hadoop application you need to enable >>> specific support in the application for predicate push down. >>> In the end you have to check which application you are using and do some >>> tests (with correct predicate push down configuration). Keep in mind that >>> both formats work best if they are sorted on filter columns (which is your >>> responsibility) and if their optimatizations are correctly configured (min >>> max index, bloom filter, compression etc) . >>> >>> If you need to ingest sensor data you may want to store it first in >>> hbase and then batch process it in large files in Orc or parquet format. >>> >>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> >>> wrote: >>> >>> Just wondering advantages and disadvantages to convert data into ORC or >>> Parquet. >>> >>> In the documentation of Spark there are numerous examples of Parquet >>> format. >>> >>> Any strong reasons to chose Parquet over ORC file format ? >>> >>> Also : current data compression is bzip2 >>> >>> >>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy >>> This seems like biased. >>> >>> >> >