this link might help
https://stackoverflow.com/questions/46929351/spark-reading-orc-file-in-driver-not-in-executors
Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
I used the following config and the performance has improved a lot.
.config("spark.sql.orc.splits.include.file.footer", true)
I am not able to find the default value of this config anywhere? Can
someone please share what's the default config of this- is it false?
Also just curious what this
Thank you. Will try out these options.
With Best Regards,
On Wed, Jul 19, 2023 at 1:40 PM Mich Talebzadeh
wrote:
> Sounds like if the mv command is inherently slow, there is little that can
> be done.
>
> The only suggestion I can make is to create the staging table as
> compressed to
Sounds like if the mv command is inherently slow, there is little that can
be done.
The only suggestion I can make is to create the staging table as compressed
to reduce its size and hence mv? Is that feasible? Also the managed table
can be created with SNAPPY compression
STORED AS ORC
Hi Mich,
Ok, my use-case is a bit different.
I have a Hive table partitioned by dates and need to do dynamic partition
updates(insert overwrite) daily for the last 30 days (partitions).
The ETL inside the staging directories is completed in hardly 5minutes, but
then renaming takes a lot of time as
Spark has no role in creating that hive staging directory. That directory
belongs to Hive and Spark simply does ETL there, loading to the Hive
managed table in your case which ends up in saging directory
I suggest that you review your design and use an external hive table with
explicit location
It does help performance but not significantly.
I am just wondering, once Spark creates that staging directory along with
the SUCCESS file, can we just do a gsutil rsync command and move these
files to original directory? Anyone tried this approach or foresee any
concern?
On Mon, 17 Jul 2023
Thanks Jay, is there any suggestion how much I can increase those
parameters?
On Mon, 17 Jul 2023 at 8:25 PM, Jay wrote:
> Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
> copy and delete operation in GCS and therefore if there are many number of
> files it will take a
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
copy and delete operation in GCS and therefore if there are many number of
files it will take a long time to perform this step. One workaround will be
to create smaller number of larger files if that is possible from Spark and
You said this Hive table was a managed table partitioned by date -->${TODAY}
How do you define your Hive managed table?
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
It does support- It doesn’t error out for me atleast. But it took around 4
hours to finish the job.
Interestingly, it took only 10 minutes to write the output in the staging
directory and rest of the time it took to rename the objects. Thats the
concern.
Looks like a known issue as spark behaves
Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
supported on GCS? IIRC it wasn't, but you could check with GCP support
On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev wrote:
> Thanks Jay,
>
> I will try that option.
>
> Any insight on the file committer algorithms?
>
> I
Thanks Jay,
I will try that option.
Any insight on the file committer algorithms?
I tried v2 algorithm but its not enhancing the runtime. What’s the best
practice in Dataproc for dynamic updates in Spark.
On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote:
> You can try increasing
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch.
The definitions for these flags are available here -
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote:
> No, I am using Spark
No, I am using Spark 2.4 to update the GCS partitions . I have a managed
Hive table on top of this.
[image: image.png]
When I do a dynamic partition update of Spark, it creates the new file in a
Staging area as shown here.
But the GCS blob renaming takes a lot of time. I have a partition based on
So you are using GCP and your Hive is installed on Dataproc which happens
to run your Spark as well. Is that correct?
What version of Hive are you using?
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
Hi All,
Of late, I have encountered the issue where I have to overwrite a lot of
partitions of the Hive table through Spark. It looks like writing to
hive_staging_directory takes 25% of the total time, whereas 75% or more
time goes in moving the ORC files from staging directory to the final
17 matches
Mail list logo