Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Alexander Pivovarov
Found 0 matching posts for *ORC v/s Parquet for Spark 2.0* in Apache Spark
User List 
http://apache-spark-user-list.1001560.n3.nabble.com/

Anyone have a link to this discussion? Want to share it with my colleagues.

On Thu, Jul 28, 2016 at 2:35 PM, Mich Talebzadeh 
wrote:

> As far as I know Spark still lacks the ability to handle Updates or
> deletes vis-à-vis ORC transactional tables. As you may know in Hive an ORC
> transactional table can handle updates and deletes. Transactional support
> was added to Hive for ORC tables. No transactional support with Spark SQL
> on ORC tables yet. Locking and concurrency (as used by Hive) with Spark
> app running a Hive context. I am not convinced this works actually. Case in
> point, you can test it for yourself in Spark and see whether locks are
> applied in Hive metastore . In my opinion Spark value comes as a query tool
> for faster query processing (DAG + IM capability)
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 18:46, Ofir Manor  wrote:
>
>> BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
>> personally think both are great at this point).
>> But the original question was about Spark 2.0. Anyone has some insights
>> about Parquet-specific optimizations / limitations vs. ORC-specific
>> optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
>> beginning of the thread regarding Structured Streaming, but there was a
>> general claim that pre-2.0 Spark was missing many ORC optimizations, and
>> that some (all?) were added in 2.0.
>> I saw that a lot of related tickets closed in 2.0, but it would great if
>> someone close to the details can explain.
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Like anything else your mileage varies.
>>>
>>> ORC with Vectorised query execution
>>> 
>>>  is
>>> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
>>> with columnar indexes. To me that is cool. Parquet has been around and has
>>> its use case as well.
>>>
>>> I guess there is no hard and fast rule which one to use all the time.
>>> Use the one that provides best fit for the condition.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 28 July 2016 at 09:18, Jörn Franke  wrote:
>>>
 I see it more as a process of innovation and thus competition is good.
 Companies just should not follow these religious arguments but try
 themselves what suits them. There is more than software when using software
 ;)

 On 28 Jul 2016, at 01:44, Mich Talebzadeh 
 wrote:

 And frankly this is becoming some sort of religious arguments now



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
 wrote:

> It depends 

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Mich Talebzadeh
As far as I know Spark still lacks the ability to handle Updates or deletes
vis-à-vis ORC transactional tables. As you may know in Hive an ORC
transactional table can handle updates and deletes. Transactional support
was added to Hive for ORC tables. No transactional support with Spark SQL
on ORC tables yet. Locking and concurrency (as used by Hive) with Spark app
running a Hive context. I am not convinced this works actually. Case in
point, you can test it for yourself in Spark and see whether locks are
applied in Hive metastore . In my opinion Spark value comes as a query tool
for faster query processing (DAG + IM capability)

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 18:46, Ofir Manor  wrote:

> BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
> personally think both are great at this point).
> But the original question was about Spark 2.0. Anyone has some insights
> about Parquet-specific optimizations / limitations vs. ORC-specific
> optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
> beginning of the thread regarding Structured Streaming, but there was a
> general claim that pre-2.0 Spark was missing many ORC optimizations, and
> that some (all?) were added in 2.0.
> I saw that a lot of related tickets closed in 2.0, but it would great if
> someone close to the details can explain.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Like anything else your mileage varies.
>>
>> ORC with Vectorised query execution
>> 
>>  is
>> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
>> with columnar indexes. To me that is cool. Parquet has been around and has
>> its use case as well.
>>
>> I guess there is no hard and fast rule which one to use all the time. Use
>> the one that provides best fit for the condition.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 July 2016 at 09:18, Jörn Franke  wrote:
>>
>>> I see it more as a process of innovation and thus competition is good.
>>> Companies just should not follow these religious arguments but try
>>> themselves what suits them. There is more than software when using software
>>> ;)
>>>
>>> On 28 Jul 2016, at 01:44, Mich Talebzadeh 
>>> wrote:
>>>
>>> And frankly this is becoming some sort of religious arguments now
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
>>> wrote:
>>>
 It depends on what you are dong, here is the recent comparison of ORC,
 Parquet


 https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet

 Although from ORC authors, I thought fair comparison, We use ORC as
 System of Record on our Cloudera HDFS cluster, our experience is so far
 good.

 Perquet is backed by Cloudera, which has more installations of Hadoop.
 ORC is by Hortonworks, so battle of file format continues...

 Sent from my 

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Ofir Manor
BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
personally think both are great at this point).
But the original question was about Spark 2.0. Anyone has some insights
about Parquet-specific optimizations / limitations vs. ORC-specific
optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
beginning of the thread regarding Structured Streaming, but there was a
general claim that pre-2.0 Spark was missing many ORC optimizations, and
that some (all?) were added in 2.0.
I saw that a lot of related tickets closed in 2.0, but it would great if
someone close to the details can explain.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh 
wrote:

> Like anything else your mileage varies.
>
> ORC with Vectorised query execution
>  
> is
> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
> with columnar indexes. To me that is cool. Parquet has been around and has
> its use case as well.
>
> I guess there is no hard and fast rule which one to use all the time. Use
> the one that provides best fit for the condition.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 09:18, Jörn Franke  wrote:
>
>> I see it more as a process of innovation and thus competition is good.
>> Companies just should not follow these religious arguments but try
>> themselves what suits them. There is more than software when using software
>> ;)
>>
>> On 28 Jul 2016, at 01:44, Mich Talebzadeh 
>> wrote:
>>
>> And frankly this is becoming some sort of religious arguments now
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
>> wrote:
>>
>>> It depends on what you are dong, here is the recent comparison of ORC,
>>> Parquet
>>>
>>>
>>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> Although from ORC authors, I thought fair comparison, We use ORC as
>>> System of Record on our Cloudera HDFS cluster, our experience is so far
>>> good.
>>>
>>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>>> ORC is by Hortonworks, so battle of file format continues...
>>>
>>> Sent from my iPhone
>>>
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty 
>>> wrote:
>>>
>>> Seems like parquet format is better comparatively to orc when the
>>> dataset is log data without nested structures? Is this fair understanding ?
>>> On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
>>>
 Kudu has been from my impression be designed to offer somethings
 between hbase and parquet for write intensive loads - it is not faster for
 warehouse type of querying compared to parquet (merely slower, because that
 is not its use case).   I assume this is still the strategy of it.

 For some scenarios it could make sense together with parquet and Orc.
 However I am not sure what the advantage towards using hbase + parquet and
 Orc.

 On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
 u...@moosheimer.com > wrote:

 Hi Gourav,

 Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
 memory db with data storage while Parquet is "only" a columnar
 storage format.

 As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
 that's more a wish :-).

 Regards,
 Uwe

 Mit freundlichen Grüßen / best regards
 Kay-Uwe Moosheimer

 Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
 

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Mich Talebzadeh
Like anything else your mileage varies.

ORC with Vectorised query execution

is
the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
with columnar indexes. To me that is cool. Parquet has been around and has
its use case as well.

I guess there is no hard and fast rule which one to use all the time. Use
the one that provides best fit for the condition.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 09:18, Jörn Franke  wrote:

> I see it more as a process of innovation and thus competition is good.
> Companies just should not follow these religious arguments but try
> themselves what suits them. There is more than software when using software
> ;)
>
> On 28 Jul 2016, at 01:44, Mich Talebzadeh 
> wrote:
>
> And frankly this is becoming some sort of religious arguments now
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
> wrote:
>
>> It depends on what you are dong, here is the recent comparison of ORC,
>> Parquet
>>
>>
>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>>
>> Although from ORC authors, I thought fair comparison, We use ORC as
>> System of Record on our Cloudera HDFS cluster, our experience is so far
>> good.
>>
>> Perquet is backed by Cloudera, which has more installations of Hadoop.
>> ORC is by Hortonworks, so battle of file format continues...
>>
>> Sent from my iPhone
>>
>> On Jul 27, 2016, at 4:54 PM, janardhan shetty 
>> wrote:
>>
>> Seems like parquet format is better comparatively to orc when the dataset
>> is log data without nested structures? Is this fair understanding ?
>> On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
>>
>>> Kudu has been from my impression be designed to offer somethings between
>>> hbase and parquet for write intensive loads - it is not faster for
>>> warehouse type of querying compared to parquet (merely slower, because that
>>> is not its use case).   I assume this is still the strategy of it.
>>>
>>> For some scenarios it could make sense together with parquet and Orc.
>>> However I am not sure what the advantage towards using hbase + parquet and
>>> Orc.
>>>
>>> On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
>>> u...@moosheimer.com > wrote:
>>>
>>> Hi Gourav,
>>>
>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
>>> memory db with data storage while Parquet is "only" a columnar
>>> storage format.
>>>
>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
>>> that's more a wish :-).
>>>
>>> Regards,
>>> Uwe
>>>
>>> Mit freundlichen Grüßen / best regards
>>> Kay-Uwe Moosheimer
>>>
>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <
>>> gourav.sengu...@gmail.com>:
>>>
>>> Gosh,
>>>
>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at
>>> a speed that is better than SPARK.
>>>
>>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>>> someone might just start saying that KUDA has difficult lineage as well.
>>> After all dynastic rules dictate.
>>>
>>> Personally I feel that if something stores my data compressed and makes
>>> me access it faster I do not care where it comes from or how difficult the
>>> child birth was :)
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>>> sbpothin...@gmail.com> wrote:
>>>
 Just correction:

 ORC Java libraries from Hive are forked into Apache ORC. Vectorization
 default.

 Do not know If Spark leveraging this new repo?

 
  org.apache.orc
 orc
 1.1.2
 pom
 








Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Jörn Franke
I see it more as a process of innovation and thus competition is good. 
Companies just should not follow these religious arguments but try themselves 
what suits them. There is more than software when using software ;)

> On 28 Jul 2016, at 01:44, Mich Talebzadeh  wrote:
> 
> And frankly this is becoming some sort of religious arguments now
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni  
>> wrote:
>> It depends on what you are dong, here is the recent comparison of ORC, 
>> Parquet
>> 
>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>> 
>> Although from ORC authors, I thought fair comparison, We use ORC as System 
>> of Record on our Cloudera HDFS cluster, our experience is so far good.
>> 
>> Perquet is backed by Cloudera, which has more installations of Hadoop. ORC 
>> is by Hortonworks, so battle of file format continues...
>> 
>> Sent from my iPhone
>> 
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty  
>>> wrote:
>>> 
>>> Seems like parquet format is better comparatively to orc when the dataset 
>>> is log data without nested structures? Is this fair understanding ?
>>> 
 On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
 Kudu has been from my impression be designed to offer somethings between 
 hbase and parquet for write intensive loads - it is not faster for 
 warehouse type of querying compared to parquet (merely slower, because 
 that is not its use case).   I assume this is still the strategy of it.
 
 For some scenarios it could make sense together with parquet and Orc. 
 However I am not sure what the advantage towards using hbase + parquet and 
 Orc.
 
> On 27 Jul 2016, at 11:47, "u...@moosheimer.com"  
> wrote:
> 
> Hi Gourav,
> 
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
> memory db with data storage while Parquet is "only" a columnar storage 
> format.
> 
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
> that's more a wish :-).
> 
> Regards,
> Uwe
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta 
>> :
>> 
>> Gosh,
>> 
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at 
>> a speed that is better than SPARK.
>> 
>> Has anyone heard of KUDA? Its better than Parquet. But I think that 
>> someone might just start saying that KUDA has difficult lineage as well. 
>> After all dynastic rules dictate.
>> 
>> Personally I feel that if something stores my data compressed and makes 
>> me access it faster I do not care where it comes from or how difficult 
>> the child birth was :)
>> 
>> 
>> Regards,
>> Gourav
>> 
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>>  wrote:
>>> Just correction:
>>> 
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>>> default. 
>>> 
>>> Do not know If Spark leveraging this new repo?
>>> 
>>> 
>>>  org.apache.orc
>>> orc
>>> 1.1.2
>>> pom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sent from my iPhone
 On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
 
>>> 
 parquet was inspired by dremel but written from the ground up as a 
 library with support for a variety of big data systems (hive, pig, 
 impala, cascading, etc.). it is also easy to add new support, since 
 its a proper library.
 
 orc bas been enhanced while deployed at facebook in hive and at yahoo 
 in hive. just hive. it didn't really exist by itself. it was part of 
 the big java soup that is called hive, without an easy way to extract 
 it. hive does not expose proper java apis. it never cared for that.
 
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>  wrote:
> Interesting opinion, thank you
> 
> Still, on the website parquet is basically inspired by Dremel 
> (Google) [1] and part of orc 

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Mich Talebzadeh
And frankly this is becoming some sort of religious arguments now



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
wrote:

> It depends on what you are dong, here is the recent comparison of ORC,
> Parquet
>
>
> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>
> Although from ORC authors, I thought fair comparison, We use ORC as System
> of Record on our Cloudera HDFS cluster, our experience is so far good.
>
> Perquet is backed by Cloudera, which has more installations of Hadoop. ORC
> is by Hortonworks, so battle of file format continues...
>
> Sent from my iPhone
>
> On Jul 27, 2016, at 4:54 PM, janardhan shetty 
> wrote:
>
> Seems like parquet format is better comparatively to orc when the dataset
> is log data without nested structures? Is this fair understanding ?
> On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
>
>> Kudu has been from my impression be designed to offer somethings between
>> hbase and parquet for write intensive loads - it is not faster for
>> warehouse type of querying compared to parquet (merely slower, because that
>> is not its use case).   I assume this is still the strategy of it.
>>
>> For some scenarios it could make sense together with parquet and Orc.
>> However I am not sure what the advantage towards using hbase + parquet and
>> Orc.
>>
>> On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
>> u...@moosheimer.com > wrote:
>>
>> Hi Gourav,
>>
>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
>> memory db with data storage while Parquet is "only" a columnar
>> storage format.
>>
>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
>> that's more a wish :-).
>>
>> Regards,
>> Uwe
>>
>> Mit freundlichen Grüßen / best regards
>> Kay-Uwe Moosheimer
>>
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta > >:
>>
>> Gosh,
>>
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
>> speed that is better than SPARK.
>>
>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>> someone might just start saying that KUDA has difficult lineage as well.
>> After all dynastic rules dictate.
>>
>> Personally I feel that if something stores my data compressed and makes
>> me access it faster I do not care where it comes from or how difficult the
>> child birth was :)
>>
>>
>> Regards,
>> Gourav
>>
>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>> sbpothin...@gmail.com> wrote:
>>
>>> Just correction:
>>>
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>>> default.
>>>
>>> Do not know If Spark leveraging this new repo?
>>>
>>> 
>>>  org.apache.orc
>>> orc
>>> 1.1.2
>>> pom
>>> 
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sent from my iPhone
>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>>
>>> parquet was inspired by dremel but written from the ground up as a
>>> library with support for a variety of big data systems (hive, pig, impala,
>>> cascading, etc.). it is also easy to add new support, since its a proper
>>> library.
>>>
>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>>> hive. just hive. it didn't really exist by itself. it was part of the big
>>> java soup that is called hive, without an easy way to extract it. hive does
>>> not expose proper java apis. it never cared for that.
>>>
>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>>> ovidiu-cristian.ma...@inria.fr> wrote:
>>>
 Interesting opinion, thank you

 Still, on the website parquet is basically inspired by Dremel (Google)
 [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
 [2].

 Other than this presentation [3], do you guys know any other benchmark?

 [1]https://parquet.apache.org/documentation/latest/
 [2]https://orc.apache.org/docs/
 [3]
 http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

 On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:

 when parquet came out it was developed by a community of companies, and
 was designed as a library to be supported by multiple big data projects.
 nice

 orc on the other hand initially only supported hive. it 

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Sudhir Babu Pothineni
It depends on what you are dong, here is the recent comparison of ORC, Parquet

https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet

Although from ORC authors, I thought fair comparison, We use ORC as System of 
Record on our Cloudera HDFS cluster, our experience is so far good.

Perquet is backed by Cloudera, which has more installations of Hadoop. ORC is 
by Hortonworks, so battle of file format continues...

Sent from my iPhone

> On Jul 27, 2016, at 4:54 PM, janardhan shetty  wrote:
> 
> Seems like parquet format is better comparatively to orc when the dataset is 
> log data without nested structures? Is this fair understanding ?
> 
>> On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
>> Kudu has been from my impression be designed to offer somethings between 
>> hbase and parquet for write intensive loads - it is not faster for warehouse 
>> type of querying compared to parquet (merely slower, because that is not its 
>> use case).   I assume this is still the strategy of it.
>> 
>> For some scenarios it could make sense together with parquet and Orc. 
>> However I am not sure what the advantage towards using hbase + parquet and 
>> Orc.
>> 
>>> On 27 Jul 2016, at 11:47, "u...@moosheimer.com"  wrote:
>>> 
>>> Hi Gourav,
>>> 
>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
>>> memory db with data storage while Parquet is "only" a columnar storage 
>>> format.
>>> 
>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
>>> that's more a wish :-).
>>> 
>>> Regards,
>>> Uwe
>>> 
>>> Mit freundlichen Grüßen / best regards
>>> Kay-Uwe Moosheimer
>>> 
 Am 27.07.2016 um 09:15 schrieb Gourav Sengupta :
 
 Gosh,
 
 whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
 speed that is better than SPARK.
 
 Has anyone heard of KUDA? Its better than Parquet. But I think that 
 someone might just start saying that KUDA has difficult lineage as well. 
 After all dynastic rules dictate.
 
 Personally I feel that if something stores my data compressed and makes me 
 access it faster I do not care where it comes from or how difficult the 
 child birth was :)
 
 
 Regards,
 Gourav
 
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>  wrote:
> Just correction:
> 
> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
> default. 
> 
> Do not know If Spark leveraging this new repo?
> 
> 
>  org.apache.orc
> orc
> 1.1.2
> pom
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>> 
> 
>> parquet was inspired by dremel but written from the ground up as a 
>> library with support for a variety of big data systems (hive, pig, 
>> impala, cascading, etc.). it is also easy to add new support, since its 
>> a proper library.
>> 
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in 
>> hive. just hive. it didn't really exist by itself. it was part of the 
>> big java soup that is called hive, without an easy way to extract it. 
>> hive does not expose proper java apis. it never cared for that.
>> 
>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>>>  wrote:
>>> Interesting opinion, thank you
>>> 
>>> Still, on the website parquet is basically inspired by Dremel (Google) 
>>> [1] and part of orc has been enhanced while deployed for Facebook, 
>>> Yahoo [2].
>>> 
>>> Other than this presentation [3], do you guys know any other benchmark?
>>> 
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3] 
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>> 
 On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
 
 when parquet came out it was developed by a community of companies, 
 and was designed as a library to be supported by multiple big data 
 projects. nice
 
 orc on the other hand initially only supported hive. it wasn't even 
 designed as a library that can be re-used. even today it brings in the 
 kitchen sink of transitive dependencies. yikes
 
 
> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
> I think both are very similar, but with slightly different goals. 
> While they work transparently for each Hadoop application you need to 
> enable specific support in the application for predicate push down. 
> In the end you have 

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread janardhan shetty
Seems like parquet format is better comparatively to orc when the dataset
is log data without nested structures? Is this fair understanding ?
On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:

> Kudu has been from my impression be designed to offer somethings between
> hbase and parquet for write intensive loads - it is not faster for
> warehouse type of querying compared to parquet (merely slower, because that
> is not its use case).   I assume this is still the strategy of it.
>
> For some scenarios it could make sense together with parquet and Orc.
> However I am not sure what the advantage towards using hbase + parquet and
> Orc.
>
> On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
> u...@moosheimer.com > wrote:
>
> Hi Gourav,
>
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
> memory db with data storage while Parquet is "only" a columnar
> storage format.
>
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
> that's more a wish :-).
>
> Regards,
> Uwe
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta  >:
>
> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> 
>>  org.apache.orc
>> orc
>> 1.1.2
>> pom
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>>
 I think both are very similar, but with slightly different goals. While
 they work transparently for each Hadoop application you need to enable
 specific support in the application for predicate push down.
 In the end you have to check which application you are using and do
 some tests (with correct predicate push down configuration). Keep in mind
 that both formats work best if they are sorted on filter columns (which is
 your responsibility) and if their optimatizations are correctly configured
 (min max index, bloom filter, compression etc) .

 If you need to ingest sensor data you may want to store it first in
 hbase and then batch process it in large files in Orc or parquet format.

 On 26 Jul 2016, at 04:09, janardhan shetty 
 wrote:

 Just wondering advantages and disadvantages to convert data into ORC or
 Parquet.

 In the documentation of Spark there are numerous examples of Parquet
 format.

 Any strong reasons to chose Parquet over ORC file format ?

 Also : current data compression is bzip2


 

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke
Kudu has been from my impression be designed to offer somethings between hbase 
and parquet for write intensive loads - it is not faster for warehouse type of 
querying compared to parquet (merely slower, because that is not its use case). 
  I assume this is still the strategy of it.

For some scenarios it could make sense together with parquet and Orc. However I 
am not sure what the advantage towards using hbase + parquet and Orc.

> On 27 Jul 2016, at 11:47, "u...@moosheimer.com"  wrote:
> 
> Hi Gourav,
> 
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
> memory db with data storage while Parquet is "only" a columnar storage format.
> 
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
> that's more a wish :-).
> 
> Regards,
> Uwe
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta :
>> 
>> Gosh,
>> 
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
>> speed that is better than SPARK.
>> 
>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
>> might just start saying that KUDA has difficult lineage as well. After all 
>> dynastic rules dictate.
>> 
>> Personally I feel that if something stores my data compressed and makes me 
>> access it faster I do not care where it comes from or how difficult the 
>> child birth was :)
>> 
>> 
>> Regards,
>> Gourav
>> 
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>>  wrote:
>>> Just correction:
>>> 
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>>> default. 
>>> 
>>> Do not know If Spark leveraging this new repo?
>>> 
>>> 
>>>  org.apache.orc
>>> orc
>>> 1.1.2
>>> pom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sent from my iPhone
 On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
 
>>> 
 parquet was inspired by dremel but written from the ground up as a library 
 with support for a variety of big data systems (hive, pig, impala, 
 cascading, etc.). it is also easy to add new support, since its a proper 
 library.
 
 orc bas been enhanced while deployed at facebook in hive and at yahoo in 
 hive. just hive. it didn't really exist by itself. it was part of the big 
 java soup that is called hive, without an easy way to extract it. hive 
 does not expose proper java apis. it never cared for that.
 
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>  wrote:
> Interesting opinion, thank you
> 
> Still, on the website parquet is basically inspired by Dremel (Google) 
> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo 
> [2].
> 
> Other than this presentation [3], do you guys know any other benchmark?
> 
> [1]https://parquet.apache.org/documentation/latest/
> [2]https://orc.apache.org/docs/
> [3] 
> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
> 
>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>> 
>> when parquet came out it was developed by a community of companies, and 
>> was designed as a library to be supported by multiple big data projects. 
>> nice
>> 
>> orc on the other hand initially only supported hive. it wasn't even 
>> designed as a library that can be re-used. even today it brings in the 
>> kitchen sink of transitive dependencies. yikes
>> 
>> 
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>> I think both are very similar, but with slightly different goals. While 
>>> they work transparently for each Hadoop application you need to enable 
>>> specific support in the application for predicate push down. 
>>> In the end you have to check which application you are using and do 
>>> some tests (with correct predicate push down configuration). Keep in 
>>> mind that both formats work best if they are sorted on filter columns 
>>> (which is your responsibility) and if their optimatizations are 
>>> correctly configured (min max index, bloom filter, compression etc) . 
>>> 
>>> If you need to ingest sensor data you may want to store it first in 
>>> hbase and then batch process it in large files in Orc or parquet format.
>>> 
 On 26 Jul 2016, at 04:09, janardhan shetty  
 wrote:
 
 Just wondering advantages and disadvantages to convert data into ORC 
 or Parquet. 
 
 In the documentation of Spark there are numerous examples of Parquet 
 format. 
 
 Any strong reasons to chose Parquet over ORC file format ?
 
 Also : current data compression is bzip2
 
 

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread ayan guha
Because everyone is here discussing this ever-changing-for-better-reason
topic of storage formats and serdes, any opinion/thoughts/experience with
Apache Arrow? It sounds like a nice idea, but how ready is it?

On Wed, Jul 27, 2016 at 11:31 PM, Jörn Franke  wrote:

> Kudu has been from my impression be designed to offer somethings between
> hbase and parquet for write intensive loads - it is not faster for
> warehouse type of querying compared to parquet (merely slower, because that
> is not its use case).   I assume this is still the strategy of it.
>
> For some scenarios it could make sense together with parquet and Orc.
> However I am not sure what the advantage towards using hbase + parquet and
> Orc.
>
> On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
> u...@moosheimer.com > wrote:
>
> Hi Gourav,
>
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
> memory db with data storage while Parquet is "only" a columnar
> storage format.
>
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
> that's more a wish :-).
>
> Regards,
> Uwe
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta  >:
>
> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> 
>>  org.apache.orc
>> orc
>> 1.1.2
>> pom
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>>
 I think both are very similar, but with slightly different goals. While
 they work transparently for each Hadoop application you need to enable
 specific support in the application for predicate push down.
 In the end you have to check which application you are using and do
 some tests (with correct predicate push down configuration). Keep in mind
 that both formats work best if they are sorted on filter columns (which is
 your responsibility) and if their optimatizations are correctly configured
 (min max index, bloom filter, compression etc) .

 If you need to ingest sensor data you may want to store it first in
 hbase and then batch process it in large files in Orc or parquet format.

 On 26 Jul 2016, at 04:09, janardhan shetty 
 wrote:

 Just wondering advantages and disadvantages to convert data into ORC or
 Parquet.

 In the documentation of Spark there are numerous examples of Parquet
 format.

 Any strong reasons to chose Parquet over ORC file format ?

 Also : current 

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke
Kudu has been from my impression be designed to offer somethings between hbase 
and parquet for write intensive loads - it is not faster for warehouse type of 
querying compared to parquet (merely slower, because that is not its use case). 
  I assume this is still the strategy of it.

For some scenarios it could make sense together with parquet and Orc. However I 
am not sure what the advantage towards using hbase + parquet and Orc.

> On 27 Jul 2016, at 11:47, "u...@moosheimer.com"  wrote:
> 
> Hi Gourav,
> 
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
> memory db with data storage while Parquet is "only" a columnar storage format.
> 
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
> that's more a wish :-).
> 
> Regards,
> Uwe
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta :
>> 
>> Gosh,
>> 
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
>> speed that is better than SPARK.
>> 
>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
>> might just start saying that KUDA has difficult lineage as well. After all 
>> dynastic rules dictate.
>> 
>> Personally I feel that if something stores my data compressed and makes me 
>> access it faster I do not care where it comes from or how difficult the 
>> child birth was :)
>> 
>> 
>> Regards,
>> Gourav
>> 
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>>  wrote:
>>> Just correction:
>>> 
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>>> default. 
>>> 
>>> Do not know If Spark leveraging this new repo?
>>> 
>>> 
>>>  org.apache.orc
>>> orc
>>> 1.1.2
>>> pom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sent from my iPhone
 On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
 
>>> 
 parquet was inspired by dremel but written from the ground up as a library 
 with support for a variety of big data systems (hive, pig, impala, 
 cascading, etc.). it is also easy to add new support, since its a proper 
 library.
 
 orc bas been enhanced while deployed at facebook in hive and at yahoo in 
 hive. just hive. it didn't really exist by itself. it was part of the big 
 java soup that is called hive, without an easy way to extract it. hive 
 does not expose proper java apis. it never cared for that.
 
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>  wrote:
> Interesting opinion, thank you
> 
> Still, on the website parquet is basically inspired by Dremel (Google) 
> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo 
> [2].
> 
> Other than this presentation [3], do you guys know any other benchmark?
> 
> [1]https://parquet.apache.org/documentation/latest/
> [2]https://orc.apache.org/docs/
> [3] 
> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
> 
>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>> 
>> when parquet came out it was developed by a community of companies, and 
>> was designed as a library to be supported by multiple big data projects. 
>> nice
>> 
>> orc on the other hand initially only supported hive. it wasn't even 
>> designed as a library that can be re-used. even today it brings in the 
>> kitchen sink of transitive dependencies. yikes
>> 
>> 
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>> I think both are very similar, but with slightly different goals. While 
>>> they work transparently for each Hadoop application you need to enable 
>>> specific support in the application for predicate push down. 
>>> In the end you have to check which application you are using and do 
>>> some tests (with correct predicate push down configuration). Keep in 
>>> mind that both formats work best if they are sorted on filter columns 
>>> (which is your responsibility) and if their optimatizations are 
>>> correctly configured (min max index, bloom filter, compression etc) . 
>>> 
>>> If you need to ingest sensor data you may want to store it first in 
>>> hbase and then batch process it in large files in Orc or parquet format.
>>> 
 On 26 Jul 2016, at 04:09, janardhan shetty  
 wrote:
 
 Just wondering advantages and disadvantages to convert data into ORC 
 or Parquet. 
 
 In the documentation of Spark there are numerous examples of Parquet 
 format. 
 
 Any strong reasons to chose Parquet over ORC file format ?
 
 Also : current data compression is bzip2
 
 

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread u...@moosheimer.com
Hi Gourav,

Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory 
db with data storage while Parquet is "only" a columnar storage format.

As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's 
more a wish :-).

Regards,
Uwe

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta :
> 
> Gosh,
> 
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
> speed that is better than SPARK.
> 
> Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
> might just start saying that KUDA has difficult lineage as well. After all 
> dynastic rules dictate.
> 
> Personally I feel that if something stores my data compressed and makes me 
> access it faster I do not care where it comes from or how difficult the child 
> birth was :)
> 
> 
> Regards,
> Gourav
> 
>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>  wrote:
>> Just correction:
>> 
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>> default. 
>> 
>> Do not know If Spark leveraging this new repo?
>> 
>> 
>>  org.apache.orc
>> orc
>> 1.1.2
>> pom
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Sent from my iPhone
>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>> 
>> 
>>> parquet was inspired by dremel but written from the ground up as a library 
>>> with support for a variety of big data systems (hive, pig, impala, 
>>> cascading, etc.). it is also easy to add new support, since its a proper 
>>> library.
>>> 
>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in 
>>> hive. just hive. it didn't really exist by itself. it was part of the big 
>>> java soup that is called hive, without an easy way to extract it. hive does 
>>> not expose proper java apis. it never cared for that.
>>> 
 On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
  wrote:
 Interesting opinion, thank you
 
 Still, on the website parquet is basically inspired by Dremel (Google) [1] 
 and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
 
 Other than this presentation [3], do you guys know any other benchmark?
 
 [1]https://parquet.apache.org/documentation/latest/
 [2]https://orc.apache.org/docs/
 [3] 
 http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
 
> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
> 
> when parquet came out it was developed by a community of companies, and 
> was designed as a library to be supported by multiple big data projects. 
> nice
> 
> orc on the other hand initially only supported hive. it wasn't even 
> designed as a library that can be re-used. even today it brings in the 
> kitchen sink of transitive dependencies. yikes
> 
> 
>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>> I think both are very similar, but with slightly different goals. While 
>> they work transparently for each Hadoop application you need to enable 
>> specific support in the application for predicate push down. 
>> In the end you have to check which application you are using and do some 
>> tests (with correct predicate push down configuration). Keep in mind 
>> that both formats work best if they are sorted on filter columns (which 
>> is your responsibility) and if their optimatizations are correctly 
>> configured (min max index, bloom filter, compression etc) . 
>> 
>> If you need to ingest sensor data you may want to store it first in 
>> hbase and then batch process it in large files in Orc or parquet format.
>> 
>>> On 26 Jul 2016, at 04:09, janardhan shetty  
>>> wrote:
>>> 
>>> Just wondering advantages and disadvantages to convert data into ORC or 
>>> Parquet. 
>>> 
>>> In the documentation of Spark there are numerous examples of Parquet 
>>> format. 
>>> 
>>> Any strong reasons to chose Parquet over ORC file format ?
>>> 
>>> Also : current data compression is bzip2
>>> 
>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>  
>>> This seems like biased.
> 


Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta
Sorry,

in my email above I was referring to KUDU, and there is goes how can KUDU
be right if it is mentioned in forums first with a wrong spelling. Its got
a difficult beginning where people were trying to figure out its name.


Regards,
Gourav Sengupta

On Wed, Jul 27, 2016 at 8:15 AM, Gourav Sengupta 
wrote:

> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> 
>>  org.apache.orc
>> orc
>> 1.1.2
>> pom
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>>
 I think both are very similar, but with slightly different goals. While
 they work transparently for each Hadoop application you need to enable
 specific support in the application for predicate push down.
 In the end you have to check which application you are using and do
 some tests (with correct predicate push down configuration). Keep in mind
 that both formats work best if they are sorted on filter columns (which is
 your responsibility) and if their optimatizations are correctly configured
 (min max index, bloom filter, compression etc) .

 If you need to ingest sensor data you may want to store it first in
 hbase and then batch process it in large files in Orc or parquet format.

 On 26 Jul 2016, at 04:09, janardhan shetty 
 wrote:

 Just wondering advantages and disadvantages to convert data into ORC or
 Parquet.

 In the documentation of Spark there are numerous examples of Parquet
 format.

 Any strong reasons to chose Parquet over ORC file format ?

 Also : current data compression is bzip2


 http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
 This seems like biased.


>>>
>>
>


Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta
Gosh,

whether ORC came from this or that, it runs queries in HIVE with TEZ at a
speed that is better than SPARK.

Has anyone heard of KUDA? Its better than Parquet. But I think that someone
might just start saying that KUDA has difficult lineage as well. After all
dynastic rules dictate.

Personally I feel that if something stores my data compressed and makes me
access it faster I do not care where it comes from or how difficult the
child birth was :)


Regards,
Gourav

On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
sbpothin...@gmail.com> wrote:

> Just correction:
>
> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
> default.
>
> Do not know If Spark leveraging this new repo?
>
> 
>  org.apache.orc
> orc
> 1.1.2
> pom
> 
>
>
>
>
>
>
>
>
> Sent from my iPhone
> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>
> parquet was inspired by dremel but written from the ground up as a library
> with support for a variety of big data systems (hive, pig, impala,
> cascading, etc.). it is also easy to add new support, since its a proper
> library.
>
> orc bas been enhanced while deployed at facebook in hive and at yahoo in
> hive. just hive. it didn't really exist by itself. it was part of the big
> java soup that is called hive, without an easy way to extract it. hive does
> not expose proper java apis. it never cared for that.
>
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Interesting opinion, thank you
>>
>> Still, on the website parquet is basically inspired by Dremel (Google)
>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>> [2].
>>
>> Other than this presentation [3], do you guys know any other benchmark?
>>
>> [1]https://parquet.apache.org/documentation/latest/
>> [2]https://orc.apache.org/docs/
>> [3]
>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>
>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>>
>> when parquet came out it was developed by a community of companies, and
>> was designed as a library to be supported by multiple big data projects.
>> nice
>>
>> orc on the other hand initially only supported hive. it wasn't even
>> designed as a library that can be re-used. even today it brings in the
>> kitchen sink of transitive dependencies. yikes
>>
>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>
>>> I think both are very similar, but with slightly different goals. While
>>> they work transparently for each Hadoop application you need to enable
>>> specific support in the application for predicate push down.
>>> In the end you have to check which application you are using and do some
>>> tests (with correct predicate push down configuration). Keep in mind that
>>> both formats work best if they are sorted on filter columns (which is your
>>> responsibility) and if their optimatizations are correctly configured (min
>>> max index, bloom filter, compression etc) .
>>>
>>> If you need to ingest sensor data you may want to store it first in
>>> hbase and then batch process it in large files in Orc or parquet format.
>>>
>>> On 26 Jul 2016, at 04:09, janardhan shetty 
>>> wrote:
>>>
>>> Just wondering advantages and disadvantages to convert data into ORC or
>>> Parquet.
>>>
>>> In the documentation of Spark there are numerous examples of Parquet
>>> format.
>>>
>>> Any strong reasons to chose Parquet over ORC file format ?
>>>
>>> Also : current data compression is bzip2
>>>
>>>
>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>> This seems like biased.
>>>
>>>
>>
>


Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
i dont think so, but that sounds like a good idea

On Tue, Jul 26, 2016 at 6:19 PM, Sudhir Babu Pothineni <
sbpothin...@gmail.com> wrote:

> Just correction:
>
> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
> default.
>
> Do not know If Spark leveraging this new repo?
>
> 
>  org.apache.orc
> orc
> 1.1.2
> pom
> 
>
>
>
>
>
>
>
>
> Sent from my iPhone
> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>
> parquet was inspired by dremel but written from the ground up as a library
> with support for a variety of big data systems (hive, pig, impala,
> cascading, etc.). it is also easy to add new support, since its a proper
> library.
>
> orc bas been enhanced while deployed at facebook in hive and at yahoo in
> hive. just hive. it didn't really exist by itself. it was part of the big
> java soup that is called hive, without an easy way to extract it. hive does
> not expose proper java apis. it never cared for that.
>
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Interesting opinion, thank you
>>
>> Still, on the website parquet is basically inspired by Dremel (Google)
>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>> [2].
>>
>> Other than this presentation [3], do you guys know any other benchmark?
>>
>> [1]https://parquet.apache.org/documentation/latest/
>> [2]https://orc.apache.org/docs/
>> [3]
>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>
>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>>
>> when parquet came out it was developed by a community of companies, and
>> was designed as a library to be supported by multiple big data projects.
>> nice
>>
>> orc on the other hand initially only supported hive. it wasn't even
>> designed as a library that can be re-used. even today it brings in the
>> kitchen sink of transitive dependencies. yikes
>>
>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>
>>> I think both are very similar, but with slightly different goals. While
>>> they work transparently for each Hadoop application you need to enable
>>> specific support in the application for predicate push down.
>>> In the end you have to check which application you are using and do some
>>> tests (with correct predicate push down configuration). Keep in mind that
>>> both formats work best if they are sorted on filter columns (which is your
>>> responsibility) and if their optimatizations are correctly configured (min
>>> max index, bloom filter, compression etc) .
>>>
>>> If you need to ingest sensor data you may want to store it first in
>>> hbase and then batch process it in large files in Orc or parquet format.
>>>
>>> On 26 Jul 2016, at 04:09, janardhan shetty 
>>> wrote:
>>>
>>> Just wondering advantages and disadvantages to convert data into ORC or
>>> Parquet.
>>>
>>> In the documentation of Spark there are numerous examples of Parquet
>>> format.
>>>
>>> Any strong reasons to chose Parquet over ORC file format ?
>>>
>>> Also : current data compression is bzip2
>>>
>>>
>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>> This seems like biased.
>>>
>>>
>>
>


Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
parquet was inspired by dremel but written from the ground up as a library
with support for a variety of big data systems (hive, pig, impala,
cascading, etc.). it is also easy to add new support, since its a proper
library.

orc bas been enhanced while deployed at facebook in hive and at yahoo in
hive. just hive. it didn't really exist by itself. it was part of the big
java soup that is called hive, without an easy way to extract it. hive does
not expose proper java apis. it never cared for that.

On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Interesting opinion, thank you
>
> Still, on the website parquet is basically inspired by Dremel (Google) [1]
> and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>
> Other than this presentation [3], do you guys know any other benchmark?
>
> [1]https://parquet.apache.org/documentation/latest/
> [2]https://orc.apache.org/docs/
> [3]
> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>
> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>
> when parquet came out it was developed by a community of companies, and
> was designed as a library to be supported by multiple big data projects.
> nice
>
> orc on the other hand initially only supported hive. it wasn't even
> designed as a library that can be re-used. even today it brings in the
> kitchen sink of transitive dependencies. yikes
>
> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>
>> I think both are very similar, but with slightly different goals. While
>> they work transparently for each Hadoop application you need to enable
>> specific support in the application for predicate push down.
>> In the end you have to check which application you are using and do some
>> tests (with correct predicate push down configuration). Keep in mind that
>> both formats work best if they are sorted on filter columns (which is your
>> responsibility) and if their optimatizations are correctly configured (min
>> max index, bloom filter, compression etc) .
>>
>> If you need to ingest sensor data you may want to store it first in hbase
>> and then batch process it in large files in Orc or parquet format.
>>
>> On 26 Jul 2016, at 04:09, janardhan shetty 
>> wrote:
>>
>> Just wondering advantages and disadvantages to convert data into ORC or
>> Parquet.
>>
>> In the documentation of Spark there are numerous examples of Parquet
>> format.
>>
>> Any strong reasons to chose Parquet over ORC file format ?
>>
>> Also : current data compression is bzip2
>>
>>
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>> This seems like biased.
>>
>>
>


Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
Interesting opinion, thank you

Still, on the website parquet is basically inspired by Dremel (Google) [1] and 
part of orc has been enhanced while deployed for Facebook, Yahoo [2].

Other than this presentation [3], do you guys know any other benchmark?

[1]https://parquet.apache.org/documentation/latest/ 

[2]https://orc.apache.org/docs/ 
[3] 
http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet 


> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
> 
> when parquet came out it was developed by a community of companies, and was 
> designed as a library to be supported by multiple big data projects. nice
> 
> orc on the other hand initially only supported hive. it wasn't even designed 
> as a library that can be re-used. even today it brings in the kitchen sink of 
> transitive dependencies. yikes
> 
> 
> On Jul 26, 2016 5:09 AM, "Jörn Franke"  > wrote:
> I think both are very similar, but with slightly different goals. While they 
> work transparently for each Hadoop application you need to enable specific 
> support in the application for predicate push down. 
> In the end you have to check which application you are using and do some 
> tests (with correct predicate push down configuration). Keep in mind that 
> both formats work best if they are sorted on filter columns (which is your 
> responsibility) and if their optimatizations are correctly configured (min 
> max index, bloom filter, compression etc) . 
> 
> If you need to ingest sensor data you may want to store it first in hbase and 
> then batch process it in large files in Orc or parquet format.
> 
> On 26 Jul 2016, at 04:09, janardhan shetty  > wrote:
> 
>> Just wondering advantages and disadvantages to convert data into ORC or 
>> Parquet. 
>> 
>> In the documentation of Spark there are numerous examples of Parquet format. 
>> 
>> Any strong reasons to chose Parquet over ORC file format ?
>> 
>> Also : current data compression is bzip2
>> 
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>  
>> 
>>  
>> This seems like biased.



Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
when parquet came out it was developed by a community of companies, and was
designed as a library to be supported by multiple big data projects. nice

orc on the other hand initially only supported hive. it wasn't even
designed as a library that can be re-used. even today it brings in the
kitchen sink of transitive dependencies. yikes

On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:

> I think both are very similar, but with slightly different goals. While
> they work transparently for each Hadoop application you need to enable
> specific support in the application for predicate push down.
> In the end you have to check which application you are using and do some
> tests (with correct predicate push down configuration). Keep in mind that
> both formats work best if they are sorted on filter columns (which is your
> responsibility) and if their optimatizations are correctly configured (min
> max index, bloom filter, compression etc) .
>
> If you need to ingest sensor data you may want to store it first in hbase
> and then batch process it in large files in Orc or parquet format.
>
> On 26 Jul 2016, at 04:09, janardhan shetty  wrote:
>
> Just wondering advantages and disadvantages to convert data into ORC or
> Parquet.
>
> In the documentation of Spark there are numerous examples of Parquet
> format.
>
> Any strong reasons to chose Parquet over ORC file format ?
>
> Also : current data compression is bzip2
>
>
> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
> This seems like biased.
>
>


Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
So did you tried actually to run your use case with spark 2.0 and orc files?
It’s hard to understand your ‘apparently..’.

Best,
Ovidiu
> On 26 Jul 2016, at 13:10, Gourav Sengupta  wrote:
> 
> If you have ever tried to use ORC via SPARK you will know that SPARK's 
> promise of accessing ORC files is a sham. SPARK cannot access partitioned 
> tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC faster 
> and what more, if you are using SQL and have thought of using HIVE with ORC 
> on TEZ, then it runs way better, faster and leaner than SPARK. 
> 
> I can process almost a few billion records close to a terabyte in a cluster 
> with around 100GB RAM and 40 cores in a few hours, and find it a challenge 
> doing the same with SPARK. 
> 
> But apparently, everything is resolved in SPARK 2.0.
> 
> 
> Regards,
> Gourav Sengupta
> 
> On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor  > wrote:
> One additional point specific to Spark 2.0 - for the alpha Structured 
> Streaming API (only),  the file sink only supports Parquet format (I'm sure 
> that limitation will be lifted in a future release before Structured 
> Streaming is GA):
>  "File sink - Stores the output to a directory. As of Spark 2.0, this 
> only supports Parquet file format, and Append output mode."
>  
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here
>  
> 
> 
> ​
> 



Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Gourav Sengupta
If you have ever tried to use ORC via SPARK you will know that SPARK's
promise of accessing ORC files is a sham. SPARK cannot access partitioned
tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC
faster and what more, if you are using SQL and have thought of using HIVE
with ORC on TEZ, then it runs way better, faster and leaner than SPARK.

I can process almost a few billion records close to a terabyte in a cluster
with around 100GB RAM and 40 cores in a few hours, and find it a challenge
doing the same with SPARK.

But apparently, everything is resolved in SPARK 2.0.


Regards,
Gourav Sengupta

On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor  wrote:

> One additional point specific to Spark 2.0 - for the alpha Structured
> Streaming API (only),  the file sink only supports Parquet format (I'm sure
> that limitation will be lifted in a future release before Structured
> Streaming is GA):
>  "File sink - Stores the output to a directory. As of Spark 2.0, this
> only supports Parquet file format, and Append output mode."
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here
>
> ​
>


Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ofir Manor
One additional point specific to Spark 2.0 - for the alpha Structured
Streaming API (only),  the file sink only supports Parquet format (I'm sure
that limitation will be lifted in a future release before Structured
Streaming is GA):
 "File sink - Stores the output to a directory. As of Spark 2.0, this
only supports Parquet file format, and Append output mode."

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here

​


Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Jörn Franke
I think both are very similar, but with slightly different goals. While they 
work transparently for each Hadoop application you need to enable specific 
support in the application for predicate push down. 
In the end you have to check which application you are using and do some tests 
(with correct predicate push down configuration). Keep in mind that both 
formats work best if they are sorted on filter columns (which is your 
responsibility) and if their optimatizations are correctly configured (min max 
index, bloom filter, compression etc) . 

If you need to ingest sensor data you may want to store it first in hbase and 
then batch process it in large files in Orc or parquet format.

> On 26 Jul 2016, at 04:09, janardhan shetty  wrote:
> 
> Just wondering advantages and disadvantages to convert data into ORC or 
> Parquet. 
> 
> In the documentation of Spark there are numerous examples of Parquet format. 
> 
> Any strong reasons to chose Parquet over ORC file format ?
> 
> Also : current data compression is bzip2
> 
> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy 
> This seems like biased.


Re: ORC v/s Parquet for Spark 2.0

2016-07-25 Thread janardhan shetty
Thanks Timur for the explanation.
What about if the data is  log-data which is delimited(csv or tsv) and
doesn't have too many nestings and are in file formats ?

On Mon, Jul 25, 2016 at 7:38 PM, Timur Shenkao  wrote:

> 1) The opinions on StackOverflow are correct, not biased.
> 2) Cloudera promoted Parquet, Hortonworks - ORC + Tez. When it became
> obvious that just file format is not enough and Impala sucks, then Cloudera
> announced https://vision.cloudera.com/one-platform/ and focused on Spark
> 3) There is a race between ORC & Parquet: after some perfect release ORC
> becomes better & faster, then, several months later, Parquet may outperform.
> 4) If you use "flat" tables --> ORC is better. If you have highly nested
> data with arrays inside of dictionaries (for instance, json that isn't
> flattened) then may be one should choose Parquet
> 5) AFAIK, Parquet has its metadata at the end of the file (correct me if
> something has changed) . It means that Parquet file must be completely read
> & put into RAM. If there is no enough RAM or file somehow is corrupted -->
> problems arise
>
> On Tue, Jul 26, 2016 at 5:09 AM, janardhan shetty 
> wrote:
>
>> Just wondering advantages and disadvantages to convert data into ORC or
>> Parquet.
>>
>> In the documentation of Spark there are numerous examples of Parquet
>> format.
>>
>> Any strong reasons to chose Parquet over ORC file format ?
>>
>> Also : current data compression is bzip2
>>
>>
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>> This seems like biased.
>>
>
>