date:20230717

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah

Hi Team, I am still looking for a guidance here. Really appreciate anything that points me in the right direction. On Mon, Jul 17, 2023, 16:14 Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev

Thanks Jay, is there any suggestion how much I can increase those parameters? On Mon, 17 Jul 2023 at 8:25 PM, Jay wrote: > Fileoutputcommitter v2 is supported in GCS but the rename is a metadata > copy and delete operation in GCS and therefore if there are many number of > files it will take a

Re: Contributing to Spark MLLib

2023-07-17 Thread Gourav Sengupta

Hi, Holden Karau has some fantastic videos in her channel which will be quite helpful. Thanks Gourav On Sun, 16 Jul 2023, 19:15 Brian Huynh, wrote: > Good morning Dipayan, > > Happy to see another contributor! > > Please go through this document for contributors. Please note the >

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay

Fileoutputcommitter v2 is supported in GCS but the rename is a metadata copy and delete operation in GCS and therefore if there are many number of files it will take a long time to perform this step. One workaround will be to create smaller number of larger files if that is possible from Spark and

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh

You said this Hive table was a managed table partitioned by date -->${TODAY} How do you define your Hive managed table? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Unsubscribe

2023-07-17 Thread mojianan2015

Unsubscribe

Unsubscribe

2023-07-17 Thread Zoran Jeremic

Unsubscribe

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev

It does support- It doesn’t error out for me atleast. But it took around 4 hours to finish the job. Interestingly, it took only 10 minutes to write the output in the staging directory and rest of the time it took to rename the objects. Thats the concern. Looks like a known issue as spark behaves

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Yeachan Park

Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is supported on GCS? IIRC it wasn't, but you could check with GCP support On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev wrote: > Thanks Jay, > > I will try that option. > > Any insight on the file committer algorithms? > > I

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev

Thanks Jay, I will try that option. Any insight on the file committer algorithms? I tried v2 algorithm but its not enhancing the runtime. What’s the best practice in Dataproc for dynamic updates in Spark. On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote: > You can try increasing

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay

You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch. The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote: > No, I am using Spark

Unsubscribe

2023-07-17 Thread Bode, Meikel

Unsubscribe

Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah

Resending this message with a proper Subject line Hi Spark Community, I am trying to set up my forked apache/spark project locally for my 1st Open Source Contribution, by building and creating a package as mentioned here under Running Individual Tests

Re: Unsubscribe

2023-07-17 Thread srini subramanian

Unsubscribe On Monday, July 17, 2023 at 11:19:41 AM GMT+5:30, Bode, Meikel wrote: Unsubscribe

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev

No, I am using Spark 2.4 to update the GCS partitions . I have a managed Hive table on top of this. [image: image.png] When I do a dynamic partition update of Spark, it creates the new file in a Staging area as shown here. But the GCS blob renaming takes a lot of time. I have a partition based on

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh

So you are using GCP and your Hive is installed on Dataproc which happens to run your Spark as well. Is that correct? What version of Hive are you using? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev

Hi All, Of late, I have encountered the issue where I have to overwrite a lot of partitions of the Hive table through Spark. It looks like writing to hive_staging_directory takes 25% of the total time, whereas 75% or more time goes in moving the ORC files from staging directory to the final

Re: Spark Scala SBT Local build fails

Re: Spark File Output Committer algorithm for GCS

Re: Contributing to Spark MLLib

Re: Spark File Output Committer algorithm for GCS

Re: Spark File Output Committer algorithm for GCS

Unsubscribe

Unsubscribe

Re: Spark File Output Committer algorithm for GCS

Re: Spark File Output Committer algorithm for GCS

Re: Spark File Output Committer algorithm for GCS

Re: Spark File Output Committer algorithm for GCS

Unsubscribe

Spark Scala SBT Local build fails

Re: Unsubscribe

Re: Spark File Output Committer algorithm for GCS

Re: Spark File Output Committer algorithm for GCS

Spark File Output Committer algorithm for GCS

17 matches

Site Navigation

Mail list logo

Footer information