Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Hi Team,

I am still looking for a guidance here. Really appreciate anything that
points me in the right direction.


On Mon, Jul 17, 2023, 16:14 Varun Shah  wrote:

> Resending this message with a proper Subject line
>
> Hi Spark Community,
>
> I am trying to set up my forked apache/spark project locally for my 1st
> Open Source Contribution, by building and creating a package as mentioned here
> under Running Individual Tests
> .
> Here are the steps I have followed:
> >> .build/sbt  # this opens a sbt console
> >> test # to execute all tests
>
> I am getting the following error and the tests are failing. Even compile /
> package sbt commands fail with the same errors.
>
>>
>> [info] compiling 19 Java sources to
>> forked/spark/common/network-shuffle/target/scala-2.12/test-classes ...
>> [info] compiling 330 Scala sources and 29 Java sources to
>> forked/spark/core/target/scala-2.12/test-classes ...
>> [error]
>> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:21:0:
>> There should at least one a single empty line separating groups 3rdParty
>> and spark.
>> [error]
>> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:32:0:
>> org.json4s.JsonAST.JValue should be in group 3rdParty, not spark.
>> [error]
>> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:33:0:
>> org.json4s.JsonDSL._ should be in group 3rdParty, not spark.
>> [error]
>> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:34:0:
>> org.json4s._ should be in group 3rdParty, not spark.
>> [error]
>> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:35:0:
>> org.json4s.jackson.JsonMethods._ should be in group 3rdParty, not spark.
>> [error]
>> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:37:0:
>> java.util.Locale should be in group java, not spark.
>> [error]
>> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:38:0:
>> scala.util.control.NonFatal should be in group scala, not spark.
>> [error]
>> forked/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala:226:
>> File line length exceeds 100 characters
>> [error] stack trace is suppressed; run last catalyst /
>> scalaStyleOnCompile for the full output
>> [error] stack trace is suppressed; run last scalaStyleOnTest for the full
>> outpu
>> [error] (catalyst / scalaStyleOnCompile) Failing because of negative
>> scalastyle result
>> [error] (scalaStyleOnTest) Failing because of negative scalastyle result
>>
>
> Can you please guide me if I am doing something wrong.
>
> Regards,
> Varun Shah
>


Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, is there any suggestion how much I can increase those
parameters?

On Mon, 17 Jul 2023 at 8:25 PM, Jay  wrote:

> Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
> copy and delete operation in GCS and therefore if there are many number of
> files it will take a long time to perform this step. One workaround will be
> to create smaller number of larger files if that is possible from Spark and
> if this is not possible then those configurations allow for configuring the
> threadpool which does the metadata copy.
>
> You can go through this table
> 
> to understand GCS performance implications.
>
>
>
> On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh 
> wrote:
>
>> You said this Hive table was a managed table partitioned by date
>> -->${TODAY}
>>
>> How  do you define your Hive managed table?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev 
>> wrote:
>>
>>> It does support- It doesn’t error out for me atleast. But it took around
>>> 4 hours to finish the job.
>>>
>>> Interestingly, it took only 10 minutes to write the output in the
>>> staging directory and rest of the time it took to rename the objects. Thats
>>> the concern.
>>>
>>> Looks like a known issue as spark behaves with GCS but not getting any
>>> workaround for this.
>>>
>>>
>>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park 
>>> wrote:
>>>
 Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
 supported on GCS? IIRC it wasn't, but you could check with GCP support


 On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev 
 wrote:

> Thanks Jay,
>
> I will try that option.
>
> Any insight on the file committer algorithms?
>
> I tried v2 algorithm but its not enhancing the runtime. What’s the
> best practice in Dataproc for dynamic updates in Spark.
>
>
> On Mon, 17 Jul 2023 at 7:05 PM, Jay 
> wrote:
>
>> You can try increasing fs.gs.batch.threads and
>> fs.gs.max.requests.per.batch.
>>
>> The definitions for these flags are available here -
>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>
>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev 
>> wrote:
>>
>>> No, I am using Spark 2.4 to update the GCS partitions . I have a
>>> managed Hive table on top of this.
>>> [image: image.png]
>>> When I do a dynamic partition update of Spark, it creates the new
>>> file in a Staging area as shown here.
>>> But the GCS blob renaming takes a lot of time. I have a partition
>>> based on dates and I need to update around 3 years of data. It usually
>>> takes 3 hours to finish the process. Anyway to speed up this?
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>>
>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 So you are using GCP and your Hive is installed on Dataproc which
 happens to run your Spark as well. Is that correct?

 What version of Hive are you using?

 HTH


 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 Palantir Technologies Limited
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility
 for any loss, damage or destruction of data or any other property 
 which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary 
 damages
 arising from such loss, damage or destruction.




 On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
 wrote:

> Hi All,
>
> Of late, I have encountered the issue where I have to overwrite a
> lot of partitions of the Hive table through Spark. It looks like 
> writing to
> 

Re: Contributing to Spark MLLib

2023-07-17 Thread Gourav Sengupta
Hi,

Holden Karau has some fantastic videos in her channel which will be quite
helpful.

Thanks
Gourav

On Sun, 16 Jul 2023, 19:15 Brian Huynh,  wrote:

> Good morning Dipayan,
>
> Happy to see another contributor!
>
> Please go through this document for contributors. Please note the
> MLlib-specific contribution guidelines section in particular.
>
> https://spark.apache.org/contributing.html
>
> Since you are looking for something to start with, take a look at this
> Jira query for starter issues.
>
>
> https://issues.apache.org/jira/browse/SPARK-38719?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20%22starter%22%20AND%20status%20%3D%20Open
>
> Cheers,
> Brian
>
> On Sun, Jul 16, 2023 at 8:49 AM Dipayan Dev 
> wrote:
>
>> Hi Spark Community,
>>
>> A very good morning to you.
>>
>> I am using Spark from last few years now, and new to the community.
>>
>> I am very much interested to be a contributor.
>>
>> I am looking to contribute to Spark MLLib. Can anyone please suggest me
>> how to start with contributing to any new MLLib feature? Is there any new
>> features in line and the best way to explore this?
>> Looking forward to little guidance to start with.
>>
>>
>> Thanks
>> Dipayan
>> --
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>> Author of *Deep Learning with Hadoop
>> *
>> M.Tech (AI), IISc, Bangalore
>>
>
>
> --
> From Brian H.
>


Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
copy and delete operation in GCS and therefore if there are many number of
files it will take a long time to perform this step. One workaround will be
to create smaller number of larger files if that is possible from Spark and
if this is not possible then those configurations allow for configuring the
threadpool which does the metadata copy.

You can go through this table

to understand GCS performance implications.



On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh 
wrote:

> You said this Hive table was a managed table partitioned by date
> -->${TODAY}
>
> How  do you define your Hive managed table?
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev  wrote:
>
>> It does support- It doesn’t error out for me atleast. But it took around
>> 4 hours to finish the job.
>>
>> Interestingly, it took only 10 minutes to write the output in the staging
>> directory and rest of the time it took to rename the objects. Thats the
>> concern.
>>
>> Looks like a known issue as spark behaves with GCS but not getting any
>> workaround for this.
>>
>>
>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park 
>> wrote:
>>
>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
>>> supported on GCS? IIRC it wasn't, but you could check with GCP support
>>>
>>>
>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev 
>>> wrote:
>>>
 Thanks Jay,

 I will try that option.

 Any insight on the file committer algorithms?

 I tried v2 algorithm but its not enhancing the runtime. What’s the best
 practice in Dataproc for dynamic updates in Spark.


 On Mon, 17 Jul 2023 at 7:05 PM, Jay 
 wrote:

> You can try increasing fs.gs.batch.threads and
> fs.gs.max.requests.per.batch.
>
> The definitions for these flags are available here -
> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>
> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev 
> wrote:
>
>> No, I am using Spark 2.4 to update the GCS partitions . I have a
>> managed Hive table on top of this.
>> [image: image.png]
>> When I do a dynamic partition update of Spark, it creates the new
>> file in a Staging area as shown here.
>> But the GCS blob renaming takes a lot of time. I have a partition
>> based on dates and I need to update around 3 years of data. It usually
>> takes 3 hours to finish the process. Anyway to speed up this?
>> With Best Regards,
>>
>> Dipayan Dev
>>
>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> So you are using GCP and your Hive is installed on Dataproc which
>>> happens to run your Spark as well. Is that correct?
>>>
>>> What version of Hive are you using?
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>> for any loss, damage or destruction of data or any other property which 
>>> may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary 
>>> damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
>>> wrote:
>>>
 Hi All,

 Of late, I have encountered the issue where I have to overwrite a
 lot of partitions of the Hive table through Spark. It looks like 
 writing to
 hive_staging_directory takes 25% of the total time, whereas 75% or more
 time goes in moving the ORC files from staging directory to the final
 partitioned directory structure.

 I got some reference where it's mentioned to use this config during
 the Spark write.

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
You said this Hive table was a managed table partitioned by date -->${TODAY}

How  do you define your Hive managed table?

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 17 Jul 2023 at 15:29, Dipayan Dev  wrote:

> It does support- It doesn’t error out for me atleast. But it took around 4
> hours to finish the job.
>
> Interestingly, it took only 10 minutes to write the output in the staging
> directory and rest of the time it took to rename the objects. Thats the
> concern.
>
> Looks like a known issue as spark behaves with GCS but not getting any
> workaround for this.
>
>
> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park  wrote:
>
>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
>> supported on GCS? IIRC it wasn't, but you could check with GCP support
>>
>>
>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev 
>> wrote:
>>
>>> Thanks Jay,
>>>
>>> I will try that option.
>>>
>>> Any insight on the file committer algorithms?
>>>
>>> I tried v2 algorithm but its not enhancing the runtime. What’s the best
>>> practice in Dataproc for dynamic updates in Spark.
>>>
>>>
>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay 
>>> wrote:
>>>
 You can try increasing fs.gs.batch.threads and
 fs.gs.max.requests.per.batch.

 The definitions for these flags are available here -
 https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md

 On Mon, 17 Jul 2023 at 14:59, Dipayan Dev 
 wrote:

> No, I am using Spark 2.4 to update the GCS partitions . I have a
> managed Hive table on top of this.
> [image: image.png]
> When I do a dynamic partition update of Spark, it creates the new file
> in a Staging area as shown here.
> But the GCS blob renaming takes a lot of time. I have a partition
> based on dates and I need to update around 3 years of data. It usually
> takes 3 hours to finish the process. Anyway to speed up this?
> With Best Regards,
>
> Dipayan Dev
>
> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> So you are using GCP and your Hive is installed on Dataproc which
>> happens to run your Spark as well. Is that correct?
>>
>> What version of Hive are you using?
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
>> wrote:
>>
>>> Hi All,
>>>
>>> Of late, I have encountered the issue where I have to overwrite a
>>> lot of partitions of the Hive table through Spark. It looks like 
>>> writing to
>>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>>> time goes in moving the ORC files from staging directory to the final
>>> partitioned directory structure.
>>>
>>> I got some reference where it's mentioned to use this config during
>>> the Spark write.
>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>
>>> However, it's also mentioned it's not safe as partial job failure
>>> might cause data loss.
>>>
>>> Is there any suggestion on the pros and cons of using this version?
>>> Or any ongoing Spark feature development to address this issue?
>>>
>>>
>>>
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>>
>> --
>>>
>>>
>>>
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>> Author of *Deep Learning with Hadoop
>>> *
>>> M.Tech (AI), IISc, Bangalore
>>>
>> --
>
>
>
> With Best Regards,
>
> Dipayan Dev
> Author of *Deep Learning with Hadoop
> *
> M.Tech (AI), IISc, 

Unsubscribe

2023-07-17 Thread mojianan2015
Unsubscribe


Unsubscribe

2023-07-17 Thread Zoran Jeremic
Unsubscribe


Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
It does support- It doesn’t error out for me atleast. But it took around 4
hours to finish the job.

Interestingly, it took only 10 minutes to write the output in the staging
directory and rest of the time it took to rename the objects. Thats the
concern.

Looks like a known issue as spark behaves with GCS but not getting any
workaround for this.


On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park  wrote:

> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
> supported on GCS? IIRC it wasn't, but you could check with GCP support
>
>
> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev 
> wrote:
>
>> Thanks Jay,
>>
>> I will try that option.
>>
>> Any insight on the file committer algorithms?
>>
>> I tried v2 algorithm but its not enhancing the runtime. What’s the best
>> practice in Dataproc for dynamic updates in Spark.
>>
>>
>> On Mon, 17 Jul 2023 at 7:05 PM, Jay  wrote:
>>
>>> You can try increasing fs.gs.batch.threads and
>>> fs.gs.max.requests.per.batch.
>>>
>>> The definitions for these flags are available here -
>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>
>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev 
>>> wrote:
>>>
 No, I am using Spark 2.4 to update the GCS partitions . I have a
 managed Hive table on top of this.
 [image: image.png]
 When I do a dynamic partition update of Spark, it creates the new file
 in a Staging area as shown here.
 But the GCS blob renaming takes a lot of time. I have a partition based
 on dates and I need to update around 3 years of data. It usually takes 3
 hours to finish the process. Anyway to speed up this?
 With Best Regards,

 Dipayan Dev

 On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> So you are using GCP and your Hive is installed on Dataproc which
> happens to run your Spark as well. Is that correct?
>
> What version of Hive are you using?
>
> HTH
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
> wrote:
>
>> Hi All,
>>
>> Of late, I have encountered the issue where I have to overwrite a lot
>> of partitions of the Hive table through Spark. It looks like writing to
>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>> time goes in moving the ORC files from staging directory to the final
>> partitioned directory structure.
>>
>> I got some reference where it's mentioned to use this config during
>> the Spark write.
>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>
>> However, it's also mentioned it's not safe as partial job failure
>> might cause data loss.
>>
>> Is there any suggestion on the pros and cons of using this version?
>> Or any ongoing Spark feature development to address this issue?
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
> --
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>> Author of *Deep Learning with Hadoop
>> *
>> M.Tech (AI), IISc, Bangalore
>>
> --



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
*
M.Tech (AI), IISc, Bangalore


Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Yeachan Park
Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
supported on GCS? IIRC it wasn't, but you could check with GCP support


On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev  wrote:

> Thanks Jay,
>
> I will try that option.
>
> Any insight on the file committer algorithms?
>
> I tried v2 algorithm but its not enhancing the runtime. What’s the best
> practice in Dataproc for dynamic updates in Spark.
>
>
> On Mon, 17 Jul 2023 at 7:05 PM, Jay  wrote:
>
>> You can try increasing fs.gs.batch.threads and
>> fs.gs.max.requests.per.batch.
>>
>> The definitions for these flags are available here -
>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>
>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev 
>> wrote:
>>
>>> No, I am using Spark 2.4 to update the GCS partitions . I have a managed
>>> Hive table on top of this.
>>> [image: image.png]
>>> When I do a dynamic partition update of Spark, it creates the new file
>>> in a Staging area as shown here.
>>> But the GCS blob renaming takes a lot of time. I have a partition based
>>> on dates and I need to update around 3 years of data. It usually takes 3
>>> hours to finish the process. Anyway to speed up this?
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>>
>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 So you are using GCP and your Hive is installed on Dataproc which
 happens to run your Spark as well. Is that correct?

 What version of Hive are you using?

 HTH


 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 Palantir Technologies Limited
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
 wrote:

> Hi All,
>
> Of late, I have encountered the issue where I have to overwrite a lot
> of partitions of the Hive table through Spark. It looks like writing to
> hive_staging_directory takes 25% of the total time, whereas 75% or more
> time goes in moving the ORC files from staging directory to the final
> partitioned directory structure.
>
> I got some reference where it's mentioned to use this config during
> the Spark write.
> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>
> However, it's also mentioned it's not safe as partial job failure
> might cause data loss.
>
> Is there any suggestion on the pros and cons of using this version? Or
> any ongoing Spark feature development to address this issue?
>
>
>
> With Best Regards,
>
> Dipayan Dev
>
 --
>
>
>
> With Best Regards,
>
> Dipayan Dev
> Author of *Deep Learning with Hadoop
> *
> M.Tech (AI), IISc, Bangalore
>


Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay,

I will try that option.

Any insight on the file committer algorithms?

I tried v2 algorithm but its not enhancing the runtime. What’s the best
practice in Dataproc for dynamic updates in Spark.


On Mon, 17 Jul 2023 at 7:05 PM, Jay  wrote:

> You can try increasing fs.gs.batch.threads and
> fs.gs.max.requests.per.batch.
>
> The definitions for these flags are available here -
> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>
> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev  wrote:
>
>> No, I am using Spark 2.4 to update the GCS partitions . I have a managed
>> Hive table on top of this.
>> [image: image.png]
>> When I do a dynamic partition update of Spark, it creates the new file in
>> a Staging area as shown here.
>> But the GCS blob renaming takes a lot of time. I have a partition based
>> on dates and I need to update around 3 years of data. It usually takes 3
>> hours to finish the process. Anyway to speed up this?
>> With Best Regards,
>>
>> Dipayan Dev
>>
>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> So you are using GCP and your Hive is installed on Dataproc which
>>> happens to run your Spark as well. Is that correct?
>>>
>>> What version of Hive are you using?
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
>>> wrote:
>>>
 Hi All,

 Of late, I have encountered the issue where I have to overwrite a lot
 of partitions of the Hive table through Spark. It looks like writing to
 hive_staging_directory takes 25% of the total time, whereas 75% or more
 time goes in moving the ORC files from staging directory to the final
 partitioned directory structure.

 I got some reference where it's mentioned to use this config during the
 Spark write.
 *mapreduce.fileoutputcommitter.algorithm.version = 2*

 However, it's also mentioned it's not safe as partial job failure might
 cause data loss.

 Is there any suggestion on the pros and cons of using this version? Or
 any ongoing Spark feature development to address this issue?



 With Best Regards,

 Dipayan Dev

>>> --



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
*
M.Tech (AI), IISc, Bangalore


Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch.

The definitions for these flags are available here -
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md

On Mon, 17 Jul 2023 at 14:59, Dipayan Dev  wrote:

> No, I am using Spark 2.4 to update the GCS partitions . I have a managed
> Hive table on top of this.
> [image: image.png]
> When I do a dynamic partition update of Spark, it creates the new file in
> a Staging area as shown here.
> But the GCS blob renaming takes a lot of time. I have a partition based on
> dates and I need to update around 3 years of data. It usually takes 3 hours
> to finish the process. Anyway to speed up this?
> With Best Regards,
>
> Dipayan Dev
>
> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh 
> wrote:
>
>> So you are using GCP and your Hive is installed on Dataproc which happens
>> to run your Spark as well. Is that correct?
>>
>> What version of Hive are you using?
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev 
>> wrote:
>>
>>> Hi All,
>>>
>>> Of late, I have encountered the issue where I have to overwrite a lot of
>>> partitions of the Hive table through Spark. It looks like writing to
>>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>>> time goes in moving the ORC files from staging directory to the final
>>> partitioned directory structure.
>>>
>>> I got some reference where it's mentioned to use this config during the
>>> Spark write.
>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>
>>> However, it's also mentioned it's not safe as partial job failure might
>>> cause data loss.
>>>
>>> Is there any suggestion on the pros and cons of using this version? Or
>>> any ongoing Spark feature development to address this issue?
>>>
>>>
>>>
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>>
>>


Unsubscribe

2023-07-17 Thread Bode, Meikel
Unsubscribe


Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Resending this message with a proper Subject line

Hi Spark Community,

I am trying to set up my forked apache/spark project locally for my 1st
Open Source Contribution, by building and creating a package as mentioned here
under Running Individual Tests
.
Here are the steps I have followed:
>> .build/sbt  # this opens a sbt console
>> test # to execute all tests

I am getting the following error and the tests are failing. Even compile /
package sbt commands fail with the same errors.

>
> [info] compiling 19 Java sources to
> forked/spark/common/network-shuffle/target/scala-2.12/test-classes ...
> [info] compiling 330 Scala sources and 29 Java sources to
> forked/spark/core/target/scala-2.12/test-classes ...
> [error]
> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:21:0:
> There should at least one a single empty line separating groups 3rdParty
> and spark.
> [error]
> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:32:0:
> org.json4s.JsonAST.JValue should be in group 3rdParty, not spark.
> [error]
> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:33:0:
> org.json4s.JsonDSL._ should be in group 3rdParty, not spark.
> [error]
> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:34:0:
> org.json4s._ should be in group 3rdParty, not spark.
> [error]
> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:35:0:
> org.json4s.jackson.JsonMethods._ should be in group 3rdParty, not spark.
> [error]
> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:37:0:
> java.util.Locale should be in group java, not spark.
> [error]
> forked/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala:38:0:
> scala.util.control.NonFatal should be in group scala, not spark.
> [error]
> forked/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala:226:
> File line length exceeds 100 characters
> [error] stack trace is suppressed; run last catalyst / scalaStyleOnCompile
> for the full output
> [error] stack trace is suppressed; run last scalaStyleOnTest for the full
> outpu
> [error] (catalyst / scalaStyleOnCompile) Failing because of negative
> scalastyle result
> [error] (scalaStyleOnTest) Failing because of negative scalastyle result
>

Can you please guide me if I am doing something wrong.

Regards,
Varun Shah


Re: Unsubscribe

2023-07-17 Thread srini subramanian
 Unsubscribe 
On Monday, July 17, 2023 at 11:19:41 AM GMT+5:30, Bode, Meikel 
 wrote:  
 
  
Unsubscribe
   

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
No, I am using Spark 2.4 to update the GCS partitions . I have a managed
Hive table on top of this.
[image: image.png]
When I do a dynamic partition update of Spark, it creates the new file in a
Staging area as shown here.
But the GCS blob renaming takes a lot of time. I have a partition based on
dates and I need to update around 3 years of data. It usually takes 3 hours
to finish the process. Anyway to speed up this?
With Best Regards,

Dipayan Dev

On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh 
wrote:

> So you are using GCP and your Hive is installed on Dataproc which happens
> to run your Spark as well. Is that correct?
>
> What version of Hive are you using?
>
> HTH
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev  wrote:
>
>> Hi All,
>>
>> Of late, I have encountered the issue where I have to overwrite a lot of
>> partitions of the Hive table through Spark. It looks like writing to
>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>> time goes in moving the ORC files from staging directory to the final
>> partitioned directory structure.
>>
>> I got some reference where it's mentioned to use this config during the
>> Spark write.
>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>
>> However, it's also mentioned it's not safe as partial job failure might
>> cause data loss.
>>
>> Is there any suggestion on the pros and cons of using this version? Or
>> any ongoing Spark feature development to address this issue?
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>


Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
So you are using GCP and your Hive is installed on Dataproc which happens
to run your Spark as well. Is that correct?

What version of Hive are you using?

HTH


Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 17 Jul 2023 at 09:16, Dipayan Dev  wrote:

> Hi All,
>
> Of late, I have encountered the issue where I have to overwrite a lot of
> partitions of the Hive table through Spark. It looks like writing to
> hive_staging_directory takes 25% of the total time, whereas 75% or more
> time goes in moving the ORC files from staging directory to the final
> partitioned directory structure.
>
> I got some reference where it's mentioned to use this config during the
> Spark write.
> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>
> However, it's also mentioned it's not safe as partial job failure might
> cause data loss.
>
> Is there any suggestion on the pros and cons of using this version? Or any
> ongoing Spark feature development to address this issue?
>
>
>
> With Best Regards,
>
> Dipayan Dev
>


Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Hi All,

Of late, I have encountered the issue where I have to overwrite a lot of
partitions of the Hive table through Spark. It looks like writing to
hive_staging_directory takes 25% of the total time, whereas 75% or more
time goes in moving the ORC files from staging directory to the final
partitioned directory structure.

I got some reference where it's mentioned to use this config during the
Spark write.
*mapreduce.fileoutputcommitter.algorithm.version = 2*

However, it's also mentioned it's not safe as partial job failure might
cause data loss.

Is there any suggestion on the pros and cons of using this version? Or any
ongoing Spark feature development to address this issue?



With Best Regards,

Dipayan Dev