Hi Team,
I am still looking for a guidance here. Really appreciate anything that
points me in the right direction.
On Mon, Jul 17, 2023, 16:14 Varun Shah wrote:
> Resending this message with a proper Subject line
>
> Hi Spark Community,
>
> I am trying to set up my forked apache/spark project
Thanks Jay, is there any suggestion how much I can increase those
parameters?
On Mon, 17 Jul 2023 at 8:25 PM, Jay wrote:
> Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
> copy and delete operation in GCS and therefore if there are many number of
> files it will take a
Hi,
Holden Karau has some fantastic videos in her channel which will be quite
helpful.
Thanks
Gourav
On Sun, 16 Jul 2023, 19:15 Brian Huynh, wrote:
> Good morning Dipayan,
>
> Happy to see another contributor!
>
> Please go through this document for contributors. Please note the
>
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
copy and delete operation in GCS and therefore if there are many number of
files it will take a long time to perform this step. One workaround will be
to create smaller number of larger files if that is possible from Spark and
You said this Hive table was a managed table partitioned by date -->${TODAY}
How do you define your Hive managed table?
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
Unsubscribe
Unsubscribe
It does support- It doesn’t error out for me atleast. But it took around 4
hours to finish the job.
Interestingly, it took only 10 minutes to write the output in the staging
directory and rest of the time it took to rename the objects. Thats the
concern.
Looks like a known issue as spark behaves
Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
supported on GCS? IIRC it wasn't, but you could check with GCP support
On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev wrote:
> Thanks Jay,
>
> I will try that option.
>
> Any insight on the file committer algorithms?
>
> I
Thanks Jay,
I will try that option.
Any insight on the file committer algorithms?
I tried v2 algorithm but its not enhancing the runtime. What’s the best
practice in Dataproc for dynamic updates in Spark.
On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote:
> You can try increasing
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch.
The definitions for these flags are available here -
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote:
> No, I am using Spark
Unsubscribe
Resending this message with a proper Subject line
Hi Spark Community,
I am trying to set up my forked apache/spark project locally for my 1st
Open Source Contribution, by building and creating a package as mentioned here
under Running Individual Tests
Unsubscribe
On Monday, July 17, 2023 at 11:19:41 AM GMT+5:30, Bode, Meikel
wrote:
Unsubscribe
No, I am using Spark 2.4 to update the GCS partitions . I have a managed
Hive table on top of this.
[image: image.png]
When I do a dynamic partition update of Spark, it creates the new file in a
Staging area as shown here.
But the GCS blob renaming takes a lot of time. I have a partition based on
So you are using GCP and your Hive is installed on Dataproc which happens
to run your Spark as well. Is that correct?
What version of Hive are you using?
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
Hi All,
Of late, I have encountered the issue where I have to overwrite a lot of
partitions of the Hive table through Spark. It looks like writing to
hive_staging_directory takes 25% of the total time, whereas 75% or more
time goes in moving the ORC files from staging directory to the final
17 matches
Mail list logo