Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
@Alfie Davidson  : Awesome, it worked with
"“org.elasticsearch.spark.sql”"
But as soon as I switched to *elasticsearch-spark-20_2.12, *"es" also
worked.


On Fri, Sep 8, 2023 at 12:45 PM Dipayan Dev  wrote:

>
> Let me try that and get back. Just wondering, if there a change in  the
> way we pass the format in connector from Spark 2 to 3?
>
>
> On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson 
> wrote:
>
>> I am pretty certain you need to change the write.format from “es” to
>> “org.elasticsearch.spark.sql”
>>
>> Sent from my iPhone
>>
>> On 8 Sep 2023, at 03:10, Dipayan Dev  wrote:
>>
>> 
>>
>> ++ Dev
>>
>> On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev 
>> wrote:
>>
>>> Hi,
>>>
>>> Can you please elaborate your last response? I don’t have any external
>>> dependencies added, and just updated the Spark version as mentioned below.
>>>
>>> Can someone help me with this?
>>>
>>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>>
>>>> could the provided scope be the issue?
>>>>
>>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>>>> wrote:
>>>>
>>>>> Using the following dependency for Spark 3 in POM file (My Scala
>>>>> version is 2.12.14)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *org.elasticsearch
>>>>> elasticsearch-spark-30_2.12
>>>>> 7.12.0provided*
>>>>>
>>>>>
>>>>> The code throws error at this line :
>>>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>>>> The same code is working with Spark 2.4.0 and the following dependency
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *org.elasticsearch
>>>>> elasticsearch-spark-20_2.12
>>>>> 7.12.0*
>>>>>
>>>>>
>>>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> What’s the version of the ES connector you are using?
>>>>>>
>>>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>>>>> index.
>>>>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>>>>> at
>>>>>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>>>>
>>>>>>> Looking at a few responses from Stackoverflow
>>>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>>>>> supported by Elasticsearch-hadoop.
>>>>>>>
>>>>>>> Does anyone have experience with this? Or faced/resolved this issue
>>>>>>> in Spark 3?
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> Regards
>>>>>>> Dipayan
>>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>>>> transmitted with it are confidential, privileged and intended solely for
>>>> the use of the individual or entity to whom they are addressed. If you are
>>>> not the intended recipient, you are hereby notified that any disclosure,
>>>> copying, distribution (electronic or otherwise) or forwarding of, or the
>>>> taking of any action in reliance on the contents of this transmission is
>>>> strictly prohibited. Please notify the sender immediately by e-mail if you
>>>> have received this email by mistake and delete this email from your system.
>>>>
>>>> Is it necessary to print this email? If you care about the environment
>>>> like we do, please refrain from printing emails. It helps to keep the
>>>> environment forested and litter-free.
>>>
>>>


Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
Let me try that and get back. Just wondering, if there a change in  the way
we pass the format in connector from Spark 2 to 3?


On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson 
wrote:

> I am pretty certain you need to change the write.format from “es” to
> “org.elasticsearch.spark.sql”
>
> Sent from my iPhone
>
> On 8 Sep 2023, at 03:10, Dipayan Dev  wrote:
>
> 
>
> ++ Dev
>
> On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev 
> wrote:
>
>> Hi,
>>
>> Can you please elaborate your last response? I don’t have any external
>> dependencies added, and just updated the Spark version as mentioned below.
>>
>> Can someone help me with this?
>>
>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>
>>> could the provided scope be the issue?
>>>
>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>>> wrote:
>>>
>>>> Using the following dependency for Spark 3 in POM file (My Scala
>>>> version is 2.12.14)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *org.elasticsearch
>>>> elasticsearch-spark-30_2.12
>>>> 7.12.0provided*
>>>>
>>>>
>>>> The code throws error at this line :
>>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>>> The same code is working with Spark 2.4.0 and the following dependency
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *org.elasticsearch
>>>> elasticsearch-spark-20_2.12
>>>> 7.12.0*
>>>>
>>>>
>>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>>> wrote:
>>>>
>>>>> What’s the version of the ES connector you are using?
>>>>>
>>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>>>> index.
>>>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>>>> at
>>>>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>>>
>>>>>> Looking at a few responses from Stackoverflow
>>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>>>> supported by Elasticsearch-hadoop.
>>>>>>
>>>>>> Does anyone have experience with this? Or faced/resolved this issue
>>>>>> in Spark 3?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> Regards
>>>>>> Dipayan
>>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>>> transmitted with it are confidential, privileged and intended solely for
>>> the use of the individual or entity to whom they are addressed. If you are
>>> not the intended recipient, you are hereby notified that any disclosure,
>>> copying, distribution (electronic or otherwise) or forwarding of, or the
>>> taking of any action in reliance on the contents of this transmission is
>>> strictly prohibited. Please notify the sender immediately by e-mail if you
>>> have received this email by mistake and delete this email from your system.
>>>
>>> Is it necessary to print this email? If you care about the environment
>>> like we do, please refrain from printing emails. It helps to keep the
>>> environment forested and litter-free.
>>
>>


Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
Hi Sean,

Removed the provided thing, but still the same issue.


org.elasticsearch
elasticsearch-spark-30_${scala.compat.version}
7.12.1



On Fri, Sep 8, 2023 at 4:41 AM Sean Owen  wrote:

> By marking it provided, you are not including this dependency with your
> app. If it is also not somehow already provided by your spark cluster (this
> is what it means), then yeah this is not anywhere on the class path at
> runtime. Remove the provided scope.
>
> On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev  wrote:
>
>> Hi,
>>
>> Can you please elaborate your last response? I don’t have any external
>> dependencies added, and just updated the Spark version as mentioned below.
>>
>> Can someone help me with this?
>>
>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>
>>> could the provided scope be the issue?
>>>
>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>>> wrote:
>>>
>>>> Using the following dependency for Spark 3 in POM file (My Scala
>>>> version is 2.12.14)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *org.elasticsearch
>>>> elasticsearch-spark-30_2.12
>>>> 7.12.0provided*
>>>>
>>>>
>>>> The code throws error at this line :
>>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>>> The same code is working with Spark 2.4.0 and the following dependency
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *org.elasticsearch
>>>> elasticsearch-spark-20_2.12
>>>> 7.12.0*
>>>>
>>>>
>>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>>> wrote:
>>>>
>>>>> What’s the version of the ES connector you are using?
>>>>>
>>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>>>> index.
>>>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>>>> at
>>>>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>>>
>>>>>> Looking at a few responses from Stackoverflow
>>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>>>> supported by Elasticsearch-hadoop.
>>>>>>
>>>>>> Does anyone have experience with this? Or faced/resolved this issue
>>>>>> in Spark 3?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> Regards
>>>>>> Dipayan
>>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>>> transmitted with it are confidential, privileged and intended solely for
>>> the use of the individual or entity to whom they are addressed. If you are
>>> not the intended recipient, you are hereby notified that any disclosure,
>>> copying, distribution (electronic or otherwise) or forwarding of, or the
>>> taking of any action in reliance on the contents of this transmission is
>>> strictly prohibited. Please notify the sender immediately by e-mail if you
>>> have received this email by mistake and delete this email from your system.
>>>
>>> Is it necessary to print this email? If you care about the environment
>>> like we do, please refrain from printing emails. It helps to keep the
>>> environment forested and litter-free.
>>
>>


Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
Hi,

Can you please elaborate your last response? I don’t have any external
dependencies added, and just updated the Spark version as mentioned below.

Can someone help me with this?

On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:

> could the provided scope be the issue?
>
> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
> wrote:
>
>> Using the following dependency for Spark 3 in POM file (My Scala version
>> is 2.12.14)
>>
>>
>>
>>
>>
>>
>> *org.elasticsearch
>> elasticsearch-spark-30_2.12
>> 7.12.0provided*
>>
>>
>> The code throws error at this line :
>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>> The same code is working with Spark 2.4.0 and the following dependency
>>
>>
>>
>>
>>
>> *org.elasticsearch
>> elasticsearch-spark-20_2.12
>> 7.12.0*
>>
>>
>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>> wrote:
>>
>>> What’s the version of the ES connector you are using?
>>>
>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>> index.
>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>
>>>> Looking at a few responses from Stackoverflow
>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>> supported by Elasticsearch-hadoop.
>>>>
>>>> Does anyone have experience with this? Or faced/resolved this issue in
>>>> Spark 3?
>>>>
>>>> Thanks in advance!
>>>>
>>>> Regards
>>>> Dipayan
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
> CONFIDENTIALITY NOTICE: This electronic communication and any files
> transmitted with it are confidential, privileged and intended solely for
> the use of the individual or entity to whom they are addressed. If you are
> not the intended recipient, you are hereby notified that any disclosure,
> copying, distribution (electronic or otherwise) or forwarding of, or the
> taking of any action in reliance on the contents of this transmission is
> strictly prohibited. Please notify the sender immediately by e-mail if you
> have received this email by mistake and delete this email from your system.
>
> Is it necessary to print this email? If you care about the environment
> like we do, please refrain from printing emails. It helps to keep the
> environment forested and litter-free.


Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
++ Dev

On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev  wrote:

> Hi,
>
> Can you please elaborate your last response? I don’t have any external
> dependencies added, and just updated the Spark version as mentioned below.
>
> Can someone help me with this?
>
> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>
>> could the provided scope be the issue?
>>
>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>> wrote:
>>
>>> Using the following dependency for Spark 3 in POM file (My Scala version
>>> is 2.12.14)
>>>
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-30_2.12
>>> 7.12.0provided*
>>>
>>>
>>> The code throws error at this line :
>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>> The same code is working with Spark 2.4.0 and the following dependency
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-20_2.12
>>> 7.12.0*
>>>
>>>
>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>> wrote:
>>>
>>>> What’s the version of the ES connector you are using?
>>>>
>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>>> index.
>>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>>> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>>
>>>>> Looking at a few responses from Stackoverflow
>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>>> supported by Elasticsearch-hadoop.
>>>>>
>>>>> Does anyone have experience with this? Or faced/resolved this issue in
>>>>> Spark 3?
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>> Regards
>>>>> Dipayan
>>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>> transmitted with it are confidential, privileged and intended solely for
>> the use of the individual or entity to whom they are addressed. If you are
>> not the intended recipient, you are hereby notified that any disclosure,
>> copying, distribution (electronic or otherwise) or forwarding of, or the
>> taking of any action in reliance on the contents of this transmission is
>> strictly prohibited. Please notify the sender immediately by e-mail if you
>> have received this email by mistake and delete this email from your system.
>>
>> Is it necessary to print this email? If you care about the environment
>> like we do, please refrain from printing emails. It helps to keep the
>> environment forested and litter-free.
>
>


Re: Elasticsearch support for Spark 3.x

2023-08-27 Thread Dipayan Dev
Using the following dependency for Spark 3 in POM file (My Scala version is
2.12.14)






*org.elasticsearch
elasticsearch-spark-30_2.12
7.12.0provided*


The code throws error at this line :
df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
The same code is working with Spark 2.4.0 and the following dependency





*org.elasticsearch
elasticsearch-spark-20_2.12
7.12.0*


On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau  wrote:

> What’s the version of the ES connector you are using?
>
> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
> wrote:
>
>> Hi All,
>>
>> We're using Spark 2.4.x to write dataframe into the Elasticsearch index.
>> As we're upgrading to Spark 3.3.0, it throwing out error
>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>
>> Looking at a few responses from Stackoverflow
>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>> supported by Elasticsearch-hadoop.
>>
>> Does anyone have experience with this? Or faced/resolved this issue in
>> Spark 3?
>>
>> Thanks in advance!
>>
>> Regards
>> Dipayan
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Elasticsearch support for Spark 3.x

2023-08-26 Thread Dipayan Dev
Hi All,

We're using Spark 2.4.x to write dataframe into the Elasticsearch index.
As we're upgrading to Spark 3.3.0, it throwing out error
Caused by: java.lang.ClassNotFoundException: es.DefaultSource
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)

Looking at a few responses from Stackoverflow
. it seems this is not yet supported
by Elasticsearch-hadoop.

Does anyone have experience with this? Or faced/resolved this issue in
Spark 3?

Thanks in advance!

Regards
Dipayan


Unsubscribe

2023-08-25 Thread Dipayan Dev



Unsubscribe

2023-08-23 Thread Dipayan Dev
Unsubscribe


Unsubscribe

2023-08-21 Thread Dipayan Dev
-- 



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore


Spark doesn’t create SUCCESS file when external path is passed

2023-08-21 Thread Dipayan Dev
Hi Team,

I need some help and if someone can replicate the issue at their end, or
let me know if I am doing anything wrong.

https://issues.apache.org/jira/browse/SPARK-44884


We have recently upgraded to Spark 3.3.0 in our Production Dataproc.
We have a lot of downstream application that relies on the SUCCESS file.

Please let me know if this is a bug or I need to any additional
configuration to fix this in Spark 3.3.0.

Happy to contribute if you suggest.
-- 



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore


Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-20 Thread Dipayan Dev
Hi Mich,

It's not specific to ORC, and looks like a bug from Hadoop Common project.
I have raised a bug and am happy to contribute to Hadoop 3.3.0 version. Do
you know if anyone could help me to set the Assignee?
https://issues.apache.org/jira/browse/HADOOP-18856


With Best Regards,

Dipayan Dev



On Sun, Aug 20, 2023 at 2:47 AM Mich Talebzadeh 
wrote:

> Under gs directory
>
> "gs://test_dd1/abc/"
>
> What do you see?
>
> gsutil ls gs://test_dd1/abc
>
> and the same
>
> gs://test_dd1/
>
> gsutil ls gs://test_dd1
>
> I suspect you need a folder for multiple ORC slices!
>
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 19 Aug 2023 at 21:36, Dipayan Dev  wrote:
>
>> Hi Everyone,
>>
>> I'm stuck with one problem, where I need to provide a custom GCS location
>> for the Hive table from Spark. The code fails while doing an *'insert
>> into'* whenever my Hive table has a flag GS location like
>> gs://, but works for nested locations like
>> gs://bucket_name/blob_name.
>>
>> Is anyone aware if it's an issue from Spark side or any config I need to
>> pass for it?
>>
>> *The issue is happening in 2.x and 3.x both.*
>>
>> Config using:
>>
>> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
>> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition", true)
>> spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
>> spark.conf.set("hive.exec.dynamic.partition", true)
>>
>>
>> *Case 1 : FAILS*
>>
>> val DF = Seq(("test1", 123)).toDF("name", "num")
>>  val partKey = List("num").map(x => x)
>>
>> DF.write.option("path", 
>> "gs://test_dd1/").mode(SaveMode.Overwrite).partitionBy(partKey: 
>> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb1")
>>
>> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>> DF.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb1")
>>
>>
>>
>>
>>
>> *java.lang.NullPointerException  at 
>> org.apache.hadoop.fs.Path.(Path.java:141)  at 
>> org.apache.hadoop.fs.Path.(Path.java:120)  at 
>> org.apache.hadoop.fs.Path.suffix(Path.java:441)  at 
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254)*
>>
>>
>> *Case 2: Succeeds  *
>>
>> val DF = Seq(("test1", 123)).toDF("name", "num")
>>  val partKey = List("num").map(x => x)
>>
>> DF.write.option("path", 
>> "gs://test_dd1/abc/").mode(SaveMode.Overwrite).partitionBy(partKey: 
>> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb2")
>>
>> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>>
>> DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb2")
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>


Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Dipayan Dev
Hi Everyone,

I'm stuck with one problem, where I need to provide a custom GCS location
for the Hive table from Spark. The code fails while doing an *'insert into'*
whenever my Hive table has a flag GS location like gs://, but
works for nested locations like gs://bucket_name/blob_name.

Is anyone aware if it's an issue from Spark side or any config I need to
pass for it?

*The issue is happening in 2.x and 3.x both.*

Config using:

spark.conf.set("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
spark.conf.set("spark.hadoop.hive.exec.dynamic.partition", true)
spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
spark.conf.set("hive.exec.dynamic.partition", true)


*Case 1 : FAILS*

val DF = Seq(("test1", 123)).toDF("name", "num")
 val partKey = List("num").map(x => x)

DF.write.option("path",
"gs://test_dd1/").mode(SaveMode.Overwrite).partitionBy(partKey:
_*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb1")

val DF1 = Seq(("test2", 125)).toDF("name", "num")
DF.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb1")





*java.lang.NullPointerException  at
org.apache.hadoop.fs.Path.(Path.java:141)  at
org.apache.hadoop.fs.Path.(Path.java:120)  at
org.apache.hadoop.fs.Path.suffix(Path.java:441)  at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254)*


*Case 2: Succeeds  *

val DF = Seq(("test1", 123)).toDF("name", "num")
 val partKey = List("num").map(x => x)

DF.write.option("path",
"gs://test_dd1/abc/").mode(SaveMode.Overwrite).partitionBy(partKey:
_*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb2")

val DF1 = Seq(("test2", 125)).toDF("name", "num")

DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb2")


With Best Regards,

Dipayan Dev


[no subject]

2023-08-18 Thread Dipayan Dev
Unsubscribe --



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore


Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Dipayan Dev
I used the following config and the performance has improved a lot.
.config("spark.sql.orc.splits.include.file.footer", true)

I am not able to find the default value of this config anywhere? Can
someone please share what's the default config of this- is it false?
Also just curious what this actually does?


With Best Regards,

Dipayan Dev


On Wed, Jul 19, 2023 at 2:25 PM Dipayan Dev  wrote:

> Thank you. Will try out these options.
>
>
>
> With Best Regards,
>
>
>
> On Wed, Jul 19, 2023 at 1:40 PM Mich Talebzadeh 
> wrote:
>
>> Sounds like if the mv command is inherently slow, there is little that
>> can be done.
>>
>> The only suggestion I can make is to create the staging table as
>> compressed to reduce its size and hence mv? Is that feasible? Also the
>> managed table can be created with SNAPPY compression
>>
>> STORED AS ORC
>> TBLPROPERTIES (
>> "orc.create.index"="true",
>> "orc.bloom.filter.columns"="KEY",
>> "orc.bloom.filter.fpp"="0.05",
>> "*orc.compress"="SNAPPY",*
>> "orc.stripe.size"="16777216",
>> "orc.row.index.stride"="1" )
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 19 Jul 2023 at 02:35, Dipayan Dev 
>> wrote:
>>
>>> Hi Mich,
>>> Ok, my use-case is a bit different.
>>> I have a Hive table partitioned by dates and need to do dynamic
>>> partition updates(insert overwrite) daily for the last 30 days
>>> (partitions).
>>> The ETL inside the staging directories is completed in hardly 5minutes,
>>> but then renaming takes a lot of time as it deletes and copies the
>>> partitions.
>>> My issue is something related to this -
>>> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
>>>
>>>
>>>
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>>
>>>
>>>
>>> On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Spark has no role in creating that hive staging directory. That
>>>> directory belongs to Hive and Spark simply does ETL there, loading to the
>>>> Hive managed table in your case which ends up in saging directory
>>>>
>>>> I suggest that you review your design and use an external hive table
>>>> with explicit location on GCS with the date the data loaded. Then push that
>>>> data into the Hive managed table for today's partition.
>>>>
>>>> This was written in bash for Hive HQL itself but you can easily adapt
>>>> it for Spark
>>>>
>>>> TODAY="`date +%Y-%m-%d`"
>>>> DateStamp="${TODAY}"
>>>> CREATE EXTERNAL TABLE IF NOT EXISTS EXTERNALMARKETDATA (
>>>>  KEY string
>>>>, TICKER string
>>>>, TIMECREATED string
>>>>, PRICE float
>>>> )
>>>> COMMENT 'From prices using Kafka delivered by Flume location by day'
>>>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>>>> STORED AS TEXTFILE
>>>> LOCATION 'gs://etcbucket/cloud_data_fusion/hive.../';
>>>>
>>>> --Keep track of daily ingestion into the external table.
>>>> ALTER TABLE EXTERNALMARKETDATA set location
>>>> 'gs://etcbucket/cloud_data_fusion/hive.../${TODAY}';
>>>>
>>>> -- create your managed table here and populate it from the Hive
>>>> external table
>>>> CREATE TABLE IF NOT EXISTS MARKETDATA (
>>>>  KEY string
>>>>, TICKER string
>>>>, TIMECREATED string
>>>>, PRICE float
>>>>, op_type int
>>>>, op_time timestamp
>>>>

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Dipayan Dev
Thank you. Will try out these options.



With Best Regards,



On Wed, Jul 19, 2023 at 1:40 PM Mich Talebzadeh 
wrote:

> Sounds like if the mv command is inherently slow, there is little that can
> be done.
>
> The only suggestion I can make is to create the staging table as
> compressed to reduce its size and hence mv? Is that feasible? Also the
> managed table can be created with SNAPPY compression
>
> STORED AS ORC
> TBLPROPERTIES (
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="KEY",
> "orc.bloom.filter.fpp"="0.05",
> "*orc.compress"="SNAPPY",*
> "orc.stripe.size"="16777216",
> "orc.row.index.stride"="1" )
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 19 Jul 2023 at 02:35, Dipayan Dev  wrote:
>
>> Hi Mich,
>> Ok, my use-case is a bit different.
>> I have a Hive table partitioned by dates and need to do dynamic partition
>> updates(insert overwrite) daily for the last 30 days (partitions).
>> The ETL inside the staging directories is completed in hardly 5minutes,
>> but then renaming takes a lot of time as it deletes and copies the
>> partitions.
>> My issue is something related to this -
>> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>>
>>
>> On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Spark has no role in creating that hive staging directory. That
>>> directory belongs to Hive and Spark simply does ETL there, loading to the
>>> Hive managed table in your case which ends up in saging directory
>>>
>>> I suggest that you review your design and use an external hive table
>>> with explicit location on GCS with the date the data loaded. Then push that
>>> data into the Hive managed table for today's partition.
>>>
>>> This was written in bash for Hive HQL itself but you can easily adapt it
>>> for Spark
>>>
>>> TODAY="`date +%Y-%m-%d`"
>>> DateStamp="${TODAY}"
>>> CREATE EXTERNAL TABLE IF NOT EXISTS EXTERNALMARKETDATA (
>>>  KEY string
>>>, TICKER string
>>>, TIMECREATED string
>>>, PRICE float
>>> )
>>> COMMENT 'From prices using Kafka delivered by Flume location by day'
>>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>>> STORED AS TEXTFILE
>>> LOCATION 'gs://etcbucket/cloud_data_fusion/hive.../';
>>>
>>> --Keep track of daily ingestion into the external table.
>>> ALTER TABLE EXTERNALMARKETDATA set location
>>> 'gs://etcbucket/cloud_data_fusion/hive.../${TODAY}';
>>>
>>> -- create your managed table here and populate it from the Hive external
>>> table
>>> CREATE TABLE IF NOT EXISTS MARKETDATA (
>>>  KEY string
>>>, TICKER string
>>>, TIMECREATED string
>>>, PRICE float
>>>, op_type int
>>>, op_time timestamp
>>> )
>>> PARTITIONED BY (DateStamp  string)
>>> STORED AS ORC
>>> TBLPROPERTIES (
>>> "orc.create.index"="true",
>>> "orc.bloom.filter.columns"="KEY",
>>> "orc.bloom.filter.fpp"="0.05",
>>> "orc.compress"="SNAPPY",
>>> "orc.stripe.size"="16777216",
>>> "orc.row.index.stride"="1" )
>>> ;
>>>
>>> --Populate target table
>>> INSERT OVERWRITE TABLE MARKETDATA PARTITION (DateStamp = "${TODAY}")
>>> SELECT
>>>   KEY
>>> , TICKER
>>> , TIMECREATED
>>> , PRICE
>>> , 1
>>> , CAST(from_unixtime(unix_timestamp()) AS timestamp)
>>> FROM EXTERNALMARK

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
Hi Mich,
Ok, my use-case is a bit different.
I have a Hive table partitioned by dates and need to do dynamic partition
updates(insert overwrite) daily for the last 30 days (partitions).
The ETL inside the staging directories is completed in hardly 5minutes, but
then renaming takes a lot of time as it deletes and copies the partitions.
My issue is something related to this -
https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1



With Best Regards,

Dipayan Dev



On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh 
wrote:

> Spark has no role in creating that hive staging directory. That directory
> belongs to Hive and Spark simply does ETL there, loading to the Hive
> managed table in your case which ends up in saging directory
>
> I suggest that you review your design and use an external hive table with
> explicit location on GCS with the date the data loaded. Then push that data
> into the Hive managed table for today's partition.
>
> This was written in bash for Hive HQL itself but you can easily adapt it
> for Spark
>
> TODAY="`date +%Y-%m-%d`"
> DateStamp="${TODAY}"
> CREATE EXTERNAL TABLE IF NOT EXISTS EXTERNALMARKETDATA (
>  KEY string
>, TICKER string
>, TIMECREATED string
>, PRICE float
> )
> COMMENT 'From prices using Kafka delivered by Flume location by day'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION 'gs://etcbucket/cloud_data_fusion/hive.../';
>
> --Keep track of daily ingestion into the external table.
> ALTER TABLE EXTERNALMARKETDATA set location
> 'gs://etcbucket/cloud_data_fusion/hive.../${TODAY}';
>
> -- create your managed table here and populate it from the Hive external
> table
> CREATE TABLE IF NOT EXISTS MARKETDATA (
>  KEY string
>, TICKER string
>, TIMECREATED string
>, PRICE float
>, op_type int
>, op_time timestamp
> )
> PARTITIONED BY (DateStamp  string)
> STORED AS ORC
> TBLPROPERTIES (
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="KEY",
> "orc.bloom.filter.fpp"="0.05",
> "orc.compress"="SNAPPY",
> "orc.stripe.size"="16777216",
> "orc.row.index.stride"="1" )
> ;
>
> --Populate target table
> INSERT OVERWRITE TABLE MARKETDATA PARTITION (DateStamp = "${TODAY}")
> SELECT
>   KEY
> , TICKER
> , TIMECREATED
> , PRICE
> , 1
> , CAST(from_unixtime(unix_timestamp()) AS timestamp)
> FROM EXTERNALMARKETDATA;
>
> ANALYZE TABLE MARKETDATA PARTITION (DateStamp) COMPUTE STATISTICS;
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 18 Jul 2023 at 18:22, Dipayan Dev  wrote:
>
>> It does help performance but not significantly.
>>
>> I am just wondering, once Spark creates that staging directory along with
>> the SUCCESS file, can we just do a gsutil rsync command and move these
>> files to original directory? Anyone tried this approach or foresee any
>> concern?
>>
>>
>>
>> On Mon, 17 Jul 2023 at 9:47 PM, Dipayan Dev 
>> wrote:
>>
>>> Thanks Jay, is there any suggestion how much I can increase those
>>> parameters?
>>>
>>> On Mon, 17 Jul 2023 at 8:25 PM, Jay 
>>> wrote:
>>>
>>>> Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
>>>> copy and delete operation in GCS and therefore if there are many number of
>>>> files it will take a long time to perform this step. One workaround will be
>>>> to create smaller number of larger files if that is possible from Spark and
>>>> if this is not possible then those configurations allow for configuring the
>>>> threadpool which does the metadata copy.
>>>>
>>>> You can go through this table
>>>> <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores>
>>>> to understand GCS performance 

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
It does help performance but not significantly.

I am just wondering, once Spark creates that staging directory along with
the SUCCESS file, can we just do a gsutil rsync command and move these
files to original directory? Anyone tried this approach or foresee any
concern?



On Mon, 17 Jul 2023 at 9:47 PM, Dipayan Dev  wrote:

> Thanks Jay, is there any suggestion how much I can increase those
> parameters?
>
> On Mon, 17 Jul 2023 at 8:25 PM, Jay  wrote:
>
>> Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
>> copy and delete operation in GCS and therefore if there are many number of
>> files it will take a long time to perform this step. One workaround will be
>> to create smaller number of larger files if that is possible from Spark and
>> if this is not possible then those configurations allow for configuring the
>> threadpool which does the metadata copy.
>>
>> You can go through this table
>> <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores>
>> to understand GCS performance implications.
>>
>>
>>
>> On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh 
>> wrote:
>>
>>> You said this Hive table was a managed table partitioned by date
>>> -->${TODAY}
>>>
>>> How  do you define your Hive managed table?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev 
>>> wrote:
>>>
>>>> It does support- It doesn’t error out for me atleast. But it took
>>>> around 4 hours to finish the job.
>>>>
>>>> Interestingly, it took only 10 minutes to write the output in the
>>>> staging directory and rest of the time it took to rename the objects. Thats
>>>> the concern.
>>>>
>>>> Looks like a known issue as spark behaves with GCS but not getting any
>>>> workaround for this.
>>>>
>>>>
>>>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park 
>>>> wrote:
>>>>
>>>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
>>>>> supported on GCS? IIRC it wasn't, but you could check with GCP support
>>>>>
>>>>>
>>>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev 
>>>>> wrote:
>>>>>
>>>>>> Thanks Jay,
>>>>>>
>>>>>> I will try that option.
>>>>>>
>>>>>> Any insight on the file committer algorithms?
>>>>>>
>>>>>> I tried v2 algorithm but its not enhancing the runtime. What’s the
>>>>>> best practice in Dataproc for dynamic updates in Spark.
>>>>>>
>>>>>>
>>>>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay 
>>>>>> wrote:
>>>>>>
>>>>>>> You can try increasing fs.gs.batch.threads and
>>>>>>> fs.gs.max.requests.per.batch.
>>>>>>>
>>>>>>> The definitions for these flags are available here -
>>>>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>>>>>
>>>>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have a
>>>>>>>> managed Hive table on top of this.
>>>>>>>> [image: image.png]
>>>>>>>> When I do a dynamic partition update of Spark, it creates the new
>>>>>>>> file in a Staging area as shown here.
>>>>>>>>

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, is there any suggestion how much I can increase those
parameters?

On Mon, 17 Jul 2023 at 8:25 PM, Jay  wrote:

> Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
> copy and delete operation in GCS and therefore if there are many number of
> files it will take a long time to perform this step. One workaround will be
> to create smaller number of larger files if that is possible from Spark and
> if this is not possible then those configurations allow for configuring the
> threadpool which does the metadata copy.
>
> You can go through this table
> <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores>
> to understand GCS performance implications.
>
>
>
> On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh 
> wrote:
>
>> You said this Hive table was a managed table partitioned by date
>> -->${TODAY}
>>
>> How  do you define your Hive managed table?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev 
>> wrote:
>>
>>> It does support- It doesn’t error out for me atleast. But it took around
>>> 4 hours to finish the job.
>>>
>>> Interestingly, it took only 10 minutes to write the output in the
>>> staging directory and rest of the time it took to rename the objects. Thats
>>> the concern.
>>>
>>> Looks like a known issue as spark behaves with GCS but not getting any
>>> workaround for this.
>>>
>>>
>>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park 
>>> wrote:
>>>
>>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
>>>> supported on GCS? IIRC it wasn't, but you could check with GCP support
>>>>
>>>>
>>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev 
>>>> wrote:
>>>>
>>>>> Thanks Jay,
>>>>>
>>>>> I will try that option.
>>>>>
>>>>> Any insight on the file committer algorithms?
>>>>>
>>>>> I tried v2 algorithm but its not enhancing the runtime. What’s the
>>>>> best practice in Dataproc for dynamic updates in Spark.
>>>>>
>>>>>
>>>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay 
>>>>> wrote:
>>>>>
>>>>>> You can try increasing fs.gs.batch.threads and
>>>>>> fs.gs.max.requests.per.batch.
>>>>>>
>>>>>> The definitions for these flags are available here -
>>>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>>>>
>>>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev 
>>>>>> wrote:
>>>>>>
>>>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have a
>>>>>>> managed Hive table on top of this.
>>>>>>> [image: image.png]
>>>>>>> When I do a dynamic partition update of Spark, it creates the new
>>>>>>> file in a Staging area as shown here.
>>>>>>> But the GCS blob renaming takes a lot of time. I have a partition
>>>>>>> based on dates and I need to update around 3 years of data. It usually
>>>>>>> takes 3 hours to finish the process. Anyway to speed up this?
>>>>>>> With Best Regards,
>>>>>>>
>>>>>>> Dipayan Dev
>>>>>>>
>>>>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> So you are using GCP and your Hive is installed on Dataproc which
>>>>>>>> happens to run your Sp

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
It does support- It doesn’t error out for me atleast. But it took around 4
hours to finish the job.

Interestingly, it took only 10 minutes to write the output in the staging
directory and rest of the time it took to rename the objects. Thats the
concern.

Looks like a known issue as spark behaves with GCS but not getting any
workaround for this.


On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park  wrote:

> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
> supported on GCS? IIRC it wasn't, but you could check with GCP support
>
>
> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev 
> wrote:
>
>> Thanks Jay,
>>
>> I will try that option.
>>
>> Any insight on the file committer algorithms?
>>
>> I tried v2 algorithm but its not enhancing the runtime. What’s the best
>> practice in Dataproc for dynamic updates in Spark.
>>
>>
>> On Mon, 17 Jul 2023 at 7:05 PM, Jay  wrote:
>>
>>> You can try increasing fs.gs.batch.threads and
>>> fs.gs.max.requests.per.batch.
>>>
>>> The definitions for these flags are available here -
>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>
>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev 
>>> wrote:
>>>
>>>> No, I am using Spark 2.4 to update the GCS partitions . I have a
>>>> managed Hive table on top of this.
>>>> [image: image.png]
>>>> When I do a dynamic partition update of Spark, it creates the new file
>>>> in a Staging area as shown here.
>>>> But the GCS blob renaming takes a lot of time. I have a partition based
>>>> on dates and I need to update around 3 years of data. It usually takes 3
>>>> hours to finish the process. Anyway to speed up this?
>>>> With Best Regards,
>>>>
>>>> Dipayan Dev
>>>>
>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> So you are using GCP and your Hive is installed on Dataproc which
>>>>> happens to run your Spark as well. Is that correct?
>>>>>
>>>>> What version of Hive are you using?
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Solutions Architect/Engineering Lead
>>>>> Palantir Technologies Limited
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Of late, I have encountered the issue where I have to overwrite a lot
>>>>>> of partitions of the Hive table through Spark. It looks like writing to
>>>>>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>>>>>> time goes in moving the ORC files from staging directory to the final
>>>>>> partitioned directory structure.
>>>>>>
>>>>>> I got some reference where it's mentioned to use this config during
>>>>>> the Spark write.
>>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>>>
>>>>>> However, it's also mentioned it's not safe as partial job failure
>>>>>> might cause data loss.
>>>>>>
>>>>>> Is there any suggestion on the pros and cons of using this version?
>>>>>> Or any ongoing Spark feature development to address this issue?
>>>>>>
>>>>>>
>>>>>>
>>>>>> With Best Regards,
>>>>>>
>>>>>> Dipayan Dev
>>>>>>
>>>>> --
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>> Author of *Deep Learning with Hadoop
>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>> M.Tech (AI), IISc, Bangalore
>>
> --



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore


Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay,

I will try that option.

Any insight on the file committer algorithms?

I tried v2 algorithm but its not enhancing the runtime. What’s the best
practice in Dataproc for dynamic updates in Spark.


On Mon, 17 Jul 2023 at 7:05 PM, Jay  wrote:

> You can try increasing fs.gs.batch.threads and
> fs.gs.max.requests.per.batch.
>
> The definitions for these flags are available here -
> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>
> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev  wrote:
>
>> No, I am using Spark 2.4 to update the GCS partitions . I have a managed
>> Hive table on top of this.
>> [image: image.png]
>> When I do a dynamic partition update of Spark, it creates the new file in
>> a Staging area as shown here.
>> But the GCS blob renaming takes a lot of time. I have a partition based
>> on dates and I need to update around 3 years of data. It usually takes 3
>> hours to finish the process. Anyway to speed up this?
>> With Best Regards,
>>
>> Dipayan Dev
>>
>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> So you are using GCP and your Hive is installed on Dataproc which
>>> happens to run your Spark as well. Is that correct?
>>>
>>> What version of Hive are you using?
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Of late, I have encountered the issue where I have to overwrite a lot
>>>> of partitions of the Hive table through Spark. It looks like writing to
>>>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>>>> time goes in moving the ORC files from staging directory to the final
>>>> partitioned directory structure.
>>>>
>>>> I got some reference where it's mentioned to use this config during the
>>>> Spark write.
>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>
>>>> However, it's also mentioned it's not safe as partial job failure might
>>>> cause data loss.
>>>>
>>>> Is there any suggestion on the pros and cons of using this version? Or
>>>> any ongoing Spark feature development to address this issue?
>>>>
>>>>
>>>>
>>>> With Best Regards,
>>>>
>>>> Dipayan Dev
>>>>
>>> --



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore


Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
No, I am using Spark 2.4 to update the GCS partitions . I have a managed
Hive table on top of this.
[image: image.png]
When I do a dynamic partition update of Spark, it creates the new file in a
Staging area as shown here.
But the GCS blob renaming takes a lot of time. I have a partition based on
dates and I need to update around 3 years of data. It usually takes 3 hours
to finish the process. Anyway to speed up this?
With Best Regards,

Dipayan Dev

On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh 
wrote:

> So you are using GCP and your Hive is installed on Dataproc which happens
> to run your Spark as well. Is that correct?
>
> What version of Hive are you using?
>
> HTH
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev  wrote:
>
>> Hi All,
>>
>> Of late, I have encountered the issue where I have to overwrite a lot of
>> partitions of the Hive table through Spark. It looks like writing to
>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>> time goes in moving the ORC files from staging directory to the final
>> partitioned directory structure.
>>
>> I got some reference where it's mentioned to use this config during the
>> Spark write.
>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>
>> However, it's also mentioned it's not safe as partial job failure might
>> cause data loss.
>>
>> Is there any suggestion on the pros and cons of using this version? Or
>> any ongoing Spark feature development to address this issue?
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>


Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Hi All,

Of late, I have encountered the issue where I have to overwrite a lot of
partitions of the Hive table through Spark. It looks like writing to
hive_staging_directory takes 25% of the total time, whereas 75% or more
time goes in moving the ORC files from staging directory to the final
partitioned directory structure.

I got some reference where it's mentioned to use this config during the
Spark write.
*mapreduce.fileoutputcommitter.algorithm.version = 2*

However, it's also mentioned it's not safe as partial job failure might
cause data loss.

Is there any suggestion on the pros and cons of using this version? Or any
ongoing Spark feature development to address this issue?



With Best Regards,

Dipayan Dev


Contributing to Spark MLLib

2023-07-16 Thread Dipayan Dev
Hi Spark Community,

A very good morning to you.

I am using Spark from last few years now, and new to the community.

I am very much interested to be a contributor.

I am looking to contribute to Spark MLLib. Can anyone please suggest me how
to start with contributing to any new MLLib feature? Is there any new
features in line and the best way to explore this?
Looking forward to little guidance to start with.


Thanks
Dipayan
-- 



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore