Found 0 matching posts for *ORC v/s Parquet for Spark 2.0* in Apache Spark User List <http://apache-spark-user-list.1001560.n3.nabble.com/> http://apache-spark-user-list.1001560.n3.nabble.com/
Anyone have a link to this discussion? Want to share it with my colleagues. On Thu, Jul 28, 2016 at 2:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > As far as I know Spark still lacks the ability to handle Updates or > deletes vis-à-vis ORC transactional tables. As you may know in Hive an ORC > transactional table can handle updates and deletes. Transactional support > was added to Hive for ORC tables. No transactional support with Spark SQL > on ORC tables yet. Locking and concurrency (as used by Hive) with Spark > app running a Hive context. I am not convinced this works actually. Case in > point, you can test it for yourself in Spark and see whether locks are > applied in Hive metastore . In my opinion Spark value comes as a query tool > for faster query processing (DAG + IM capability) > > HTH > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 28 July 2016 at 18:46, Ofir Manor <ofir.ma...@equalum.io> wrote: > >> BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I >> personally think both are great at this point). >> But the original question was about Spark 2.0. Anyone has some insights >> about Parquet-specific optimizations / limitations vs. ORC-specific >> optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the >> beginning of the thread regarding Structured Streaming, but there was a >> general claim that pre-2.0 Spark was missing many ORC optimizations, and >> that some (all?) were added in 2.0. >> I saw that a lot of related tickets closed in 2.0, but it would great if >> someone close to the details can explain. >> >> Ofir Manor >> >> Co-Founder & CTO | Equalum >> >> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io >> >> On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Like anything else your mileage varies. >>> >>> ORC with Vectorised query execution >>> <https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution> >>> is >>> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata >>> with columnar indexes. To me that is cool. Parquet has been around and has >>> its use case as well. >>> >>> I guess there is no hard and fast rule which one to use all the time. >>> Use the one that provides best fit for the condition. >>> >>> HTH >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 28 July 2016 at 09:18, Jörn Franke <jornfra...@gmail.com> wrote: >>> >>>> I see it more as a process of innovation and thus competition is good. >>>> Companies just should not follow these religious arguments but try >>>> themselves what suits them. There is more than software when using software >>>> ;) >>>> >>>> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> And frankly this is becoming some sort of religious arguments now >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sbpothin...@gmail.com> >>>> wrote: >>>> >>>>> It depends on what you are dong, here is the recent comparison of ORC, >>>>> Parquet >>>>> >>>>> >>>>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet >>>>> >>>>> Although from ORC authors, I thought fair comparison, We use ORC as >>>>> System of Record on our Cloudera HDFS cluster, our experience is so far >>>>> good. >>>>> >>>>> Perquet is backed by Cloudera, which has more installations of Hadoop. >>>>> ORC is by Hortonworks, so battle of file format continues... >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <janardhan...@gmail.com> >>>>> wrote: >>>>> >>>>> Seems like parquet format is better comparatively to orc when the >>>>> dataset is log data without nested structures? Is this fair understanding >>>>> ? >>>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jornfra...@gmail.com> wrote: >>>>> >>>>>> Kudu has been from my impression be designed to offer somethings >>>>>> between hbase and parquet for write intensive loads - it is not faster >>>>>> for >>>>>> warehouse type of querying compared to parquet (merely slower, because >>>>>> that >>>>>> is not its use case). I assume this is still the strategy of it. >>>>>> >>>>>> For some scenarios it could make sense together with parquet and Orc. >>>>>> However I am not sure what the advantage towards using hbase + parquet >>>>>> and >>>>>> Orc. >>>>>> >>>>>> On 27 Jul 2016, at 11:47, "u...@moosheimer.com <u...@moosheimer.com>" < >>>>>> u...@moosheimer.com <u...@moosheimer.com>> wrote: >>>>>> >>>>>> Hi Gourav, >>>>>> >>>>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a >>>>>> in memory db with data storage while Parquet is "only" a columnar >>>>>> storage format. >>>>>> >>>>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok >>>>>> ... that's more a wish :-). >>>>>> >>>>>> Regards, >>>>>> Uwe >>>>>> >>>>>> Mit freundlichen Grüßen / best regards >>>>>> Kay-Uwe Moosheimer >>>>>> >>>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta < >>>>>> gourav.sengu...@gmail.com>: >>>>>> >>>>>> Gosh, >>>>>> >>>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ >>>>>> at a speed that is better than SPARK. >>>>>> >>>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that >>>>>> someone might just start saying that KUDA has difficult lineage as well. >>>>>> After all dynastic rules dictate. >>>>>> >>>>>> Personally I feel that if something stores my data compressed and >>>>>> makes me access it faster I do not care where it comes from or how >>>>>> difficult the child birth was :) >>>>>> >>>>>> >>>>>> Regards, >>>>>> Gourav >>>>>> >>>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni < >>>>>> sbpothin...@gmail.com> wrote: >>>>>> >>>>>>> Just correction: >>>>>>> >>>>>>> ORC Java libraries from Hive are forked into Apache ORC. >>>>>>> Vectorization default. >>>>>>> >>>>>>> Do not know If Spark leveraging this new repo? >>>>>>> >>>>>>> <dependency> >>>>>>> <groupId>org.apache.orc</groupId> >>>>>>> <artifactId>orc</artifactId> >>>>>>> <version>1.1.2</version> >>>>>>> <type>pom</type> >>>>>>> </dependency> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> >>>>>>> wrote: >>>>>>> >>>>>>> parquet was inspired by dremel but written from the ground up as a >>>>>>> library with support for a variety of big data systems (hive, pig, >>>>>>> impala, >>>>>>> cascading, etc.). it is also easy to add new support, since its a proper >>>>>>> library. >>>>>>> >>>>>>> orc bas been enhanced while deployed at facebook in hive and at >>>>>>> yahoo in hive. just hive. it didn't really exist by itself. it was part >>>>>>> of >>>>>>> the big java soup that is called hive, without an easy way to extract >>>>>>> it. >>>>>>> hive does not expose proper java apis. it never cared for that. >>>>>>> >>>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU < >>>>>>> ovidiu-cristian.ma...@inria.fr> wrote: >>>>>>> >>>>>>>> Interesting opinion, thank you >>>>>>>> >>>>>>>> Still, on the website parquet is basically inspired by Dremel >>>>>>>> (Google) [1] and part of orc has been enhanced while deployed for >>>>>>>> Facebook, >>>>>>>> Yahoo [2]. >>>>>>>> >>>>>>>> Other than this presentation [3], do you guys know any other >>>>>>>> benchmark? >>>>>>>> >>>>>>>> [1]https://parquet.apache.org/documentation/latest/ >>>>>>>> [2]https://orc.apache.org/docs/ >>>>>>>> [3] >>>>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet >>>>>>>> >>>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote: >>>>>>>> >>>>>>>> when parquet came out it was developed by a community of companies, >>>>>>>> and was designed as a library to be supported by multiple big data >>>>>>>> projects. nice >>>>>>>> >>>>>>>> orc on the other hand initially only supported hive. it wasn't even >>>>>>>> designed as a library that can be re-used. even today it brings in the >>>>>>>> kitchen sink of transitive dependencies. yikes >>>>>>>> >>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I think both are very similar, but with slightly different goals. >>>>>>>>> While they work transparently for each Hadoop application you need to >>>>>>>>> enable specific support in the application for predicate push down. >>>>>>>>> In the end you have to check which application you are using and >>>>>>>>> do some tests (with correct predicate push down configuration). Keep >>>>>>>>> in >>>>>>>>> mind that both formats work best if they are sorted on filter columns >>>>>>>>> (which is your responsibility) and if their optimatizations are >>>>>>>>> correctly >>>>>>>>> configured (min max index, bloom filter, compression etc) . >>>>>>>>> >>>>>>>>> If you need to ingest sensor data you may want to store it first >>>>>>>>> in hbase and then batch process it in large files in Orc or parquet >>>>>>>>> format. >>>>>>>>> >>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Just wondering advantages and disadvantages to convert data into >>>>>>>>> ORC or Parquet. >>>>>>>>> >>>>>>>>> In the documentation of Spark there are numerous examples of >>>>>>>>> Parquet format. >>>>>>>>> >>>>>>>>> Any strong reasons to chose Parquet over ORC file format ? >>>>>>>>> >>>>>>>>> Also : current data compression is bzip2 >>>>>>>>> >>>>>>>>> >>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy >>>>>>>>> This seems like biased. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> >