Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Alexander Pivovarov
Found 0 matching posts for *ORC v/s Parquet for Spark 2.0* in Apache Spark User List <http://apache-spark-user-list.1001560.n3.nabble.com/> http://apache-spark-user-list.1001560.n3.nabble.com/ Anyone have a link to this discussion? Want to share it with my colleagues. On Thu, Jul 28, 201

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Mich Talebzadeh
As far as I know Spark still lacks the ability to handle Updates or deletes vis-à-vis ORC transactional tables. As you may know in Hive an ORC transactional table can handle updates and deletes. Transactional support was added to Hive for ORC tables. No transactional support with Spark SQL on ORC

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Ofir Manor
BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I personally think both are great at this point). But the original question was about Spark 2.0. Anyone has some insights about Parquet-specific optimizations / limitations vs. ORC-specific optimizations / limitations in

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Mich Talebzadeh
Like anything else your mileage varies. ORC with Vectorised query execution is the nearest one can get to proper Data Warehouse like SAP IQ or Teradata with columnar indexes. To me that is cool. Parquet has been around

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Jörn Franke
I see it more as a process of innovation and thus competition is good. Companies just should not follow these religious arguments but try themselves what suits them. There is more than software when using software ;) > On 28 Jul 2016, at 01:44, Mich Talebzadeh wrote:

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Mich Talebzadeh
And frankly this is becoming some sort of religious arguments now Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Sudhir Babu Pothineni
It depends on what you are dong, here is the recent comparison of ORC, Parquet https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet Although from ORC authors, I thought fair comparison, We use ORC as System of Record on our Cloudera HDFS cluster, our experience

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread janardhan shetty
Seems like parquet format is better comparatively to orc when the dataset is log data without nested structures? Is this fair understanding ? On Jul 27, 2016 1:30 PM, "Jörn Franke" wrote: > Kudu has been from my impression be designed to offer somethings between > hbase and

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke
Kudu has been from my impression be designed to offer somethings between hbase and parquet for write intensive loads - it is not faster for warehouse type of querying compared to parquet (merely slower, because that is not its use case). I assume this is still the strategy of it. For some

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread ayan guha
Because everyone is here discussing this ever-changing-for-better-reason topic of storage formats and serdes, any opinion/thoughts/experience with Apache Arrow? It sounds like a nice idea, but how ready is it? On Wed, Jul 27, 2016 at 11:31 PM, Jörn Franke wrote: > Kudu has

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke
Kudu has been from my impression be designed to offer somethings between hbase and parquet for write intensive loads - it is not faster for warehouse type of querying compared to parquet (merely slower, because that is not its use case). I assume this is still the strategy of it. For some

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread u...@moosheimer.com
Hi Gourav, Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory db with data storage while Parquet is "only" a columnar storage format. As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's more a wish :-). Regards, Uwe Mit freundlichen

Re:Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread prosp4300
Thanks for this immediate correction :) 在 2016-07-27 15:17:54,"Gourav Sengupta" 写道: Sorry, in my email above I was referring to KUDU, and there is goes how can KUDU be right if it is mentioned in forums first with a wrong spelling. Its got a difficult beginning

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta
Sorry, in my email above I was referring to KUDU, and there is goes how can KUDU be right if it is mentioned in forums first with a wrong spelling. Its got a difficult beginning where people were trying to figure out its name. Regards, Gourav Sengupta On Wed, Jul 27, 2016 at 8:15 AM, Gourav

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta
Gosh, whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK. Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate.

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
i dont think so, but that sounds like a good idea On Tue, Jul 26, 2016 at 6:19 PM, Sudhir Babu Pothineni < sbpothin...@gmail.com> wrote: > Just correction: > > ORC Java libraries from Hive are forked into Apache ORC. Vectorization > default. > > Do not know If Spark leveraging this new repo? > >

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
parquet was inspired by dremel but written from the ground up as a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.). it is also easy to add new support, since its a proper library. orc bas been enhanced while deployed at facebook in hive and at yahoo in

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
Interesting opinion, thank you Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2]. Other than this presentation [3], do you guys know any other benchmark?

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
So did you tried actually to run your use case with spark 2.0 and orc files? It’s hard to understand your ‘apparently..’. Best, Ovidiu > On 26 Jul 2016, at 13:10, Gourav Sengupta wrote: > > If you have ever tried to use ORC via SPARK you will know that SPARK's >

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Gourav Sengupta
If you have ever tried to use ORC via SPARK you will know that SPARK's promise of accessing ORC files is a sham. SPARK cannot access partitioned tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC faster and what more, if you are using SQL and have thought of using HIVE with ORC

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ofir Manor
One additional point specific to Spark 2.0 - for the alpha Structured Streaming API (only), the file sink only supports Parquet format (I'm sure that limitation will be lifted in a future release before Structured Streaming is GA): "File sink - Stores the output to a directory. As of Spark

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Jörn Franke
I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. In the end you have to check which application you are using and do some tests (with

Re: ORC v/s Parquet for Spark 2.0

2016-07-25 Thread janardhan shetty
Thanks Timur for the explanation. What about if the data is log-data which is delimited(csv or tsv) and doesn't have too many nestings and are in file formats ? On Mon, Jul 25, 2016 at 7:38 PM, Timur Shenkao wrote: > 1) The opinions on StackOverflow are correct, not biased.

ORC v/s Parquet for Spark 2.0

2016-07-25 Thread janardhan shetty
Just wondering advantages and disadvantages to convert data into ORC or Parquet. In the documentation of Spark there are numerous examples of Parquet format. Any strong reasons to chose Parquet over ORC file format ? Also : current data compression is bzip2