Hi Gourav, Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory db with data storage while Parquet is "only" a columnar storage format.
As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's more a wish :-). Regards, Uwe Mit freundlichen Grüßen / best regards Kay-Uwe Moosheimer > Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengu...@gmail.com>: > > Gosh, > > whether ORC came from this or that, it runs queries in HIVE with TEZ at a > speed that is better than SPARK. > > Has anyone heard of KUDA? Its better than Parquet. But I think that someone > might just start saying that KUDA has difficult lineage as well. After all > dynastic rules dictate. > > Personally I feel that if something stores my data compressed and makes me > access it faster I do not care where it comes from or how difficult the child > birth was :) > > > Regards, > Gourav > >> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni >> <sbpothin...@gmail.com> wrote: >> Just correction: >> >> ORC Java libraries from Hive are forked into Apache ORC. Vectorization >> default. >> >> Do not know If Spark leveraging this new repo? >> >> <dependency> >> <groupId>org.apache.orc</groupId> >> <artifactId>orc</artifactId> >> <version>1.1.2</version> >> <type>pom</type> >> </dependency> >> >> >> >> >> >> >> >> >> Sent from my iPhone >>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote: >>> >> >>> parquet was inspired by dremel but written from the ground up as a library >>> with support for a variety of big data systems (hive, pig, impala, >>> cascading, etc.). it is also easy to add new support, since its a proper >>> library. >>> >>> orc bas been enhanced while deployed at facebook in hive and at yahoo in >>> hive. just hive. it didn't really exist by itself. it was part of the big >>> java soup that is called hive, without an easy way to extract it. hive does >>> not expose proper java apis. it never cared for that. >>> >>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU >>>> <ovidiu-cristian.ma...@inria.fr> wrote: >>>> Interesting opinion, thank you >>>> >>>> Still, on the website parquet is basically inspired by Dremel (Google) [1] >>>> and part of orc has been enhanced while deployed for Facebook, Yahoo [2]. >>>> >>>> Other than this presentation [3], do you guys know any other benchmark? >>>> >>>> [1]https://parquet.apache.org/documentation/latest/ >>>> [2]https://orc.apache.org/docs/ >>>> [3] >>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet >>>> >>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote: >>>>> >>>>> when parquet came out it was developed by a community of companies, and >>>>> was designed as a library to be supported by multiple big data projects. >>>>> nice >>>>> >>>>> orc on the other hand initially only supported hive. it wasn't even >>>>> designed as a library that can be re-used. even today it brings in the >>>>> kitchen sink of transitive dependencies. yikes >>>>> >>>>> >>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote: >>>>>> I think both are very similar, but with slightly different goals. While >>>>>> they work transparently for each Hadoop application you need to enable >>>>>> specific support in the application for predicate push down. >>>>>> In the end you have to check which application you are using and do some >>>>>> tests (with correct predicate push down configuration). Keep in mind >>>>>> that both formats work best if they are sorted on filter columns (which >>>>>> is your responsibility) and if their optimatizations are correctly >>>>>> configured (min max index, bloom filter, compression etc) . >>>>>> >>>>>> If you need to ingest sensor data you may want to store it first in >>>>>> hbase and then batch process it in large files in Orc or parquet format. >>>>>> >>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Just wondering advantages and disadvantages to convert data into ORC or >>>>>>> Parquet. >>>>>>> >>>>>>> In the documentation of Spark there are numerous examples of Parquet >>>>>>> format. >>>>>>> >>>>>>> Any strong reasons to chose Parquet over ORC file format ? >>>>>>> >>>>>>> Also : current data compression is bzip2 >>>>>>> >>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy >>>>>>> >>>>>>> This seems like biased. >