RE: Understanding Spark S3 Read Performance

info Tue, 16 May 2023 11:18:09 -0700

Hi,For clarification, are those 12 / 14 minutes cumulative cpu time or wall
clock time? How many executors executed those 10000 / 375 tasks?Cheers,Enrico
-------- Ursprüngliche Nachricht --------Von: Shashank Rao
<[email protected]> Datum: 16.05.23 19:48 (GMT+01:00) An:
[email protected] Betreff: Understanding Spark S3 Read Performance Hi,I'm
trying to set up a Spark pipeline which reads data from S3 and writes it into
Google Big Query.Environment Details:-----------------------Java 8AWS
EMR-6.10.0Spark v3.3.12 m5.xlarge executor nodesS3 Directory
structure:----------- bucket-name:|---folder1: |---folder2:
|---file1.jsonl |---file2.jsonl ...
|---file12000.jsonlEach file is of size 1.6 MB and there
are a total of 12,000 files. The code to read the source data looks like
this:----------------------------spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key",<ACCESS_KEY>)spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key",<SECRET_KEY>)spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","s3.amazonaws.com")Dataset<Row>
data = spark.read().option("recursiveFileLookup",
"true").json("s3a://bucket-name/folder1/folder2/")----------------------------Now
here's my problem:This triggers two jobs: a) Listing Leaf Nodes, b) Json
ReadThe list job takes around 12 mins and has 10,000 partitions/tasks. The read
job takes 14 mins and has 375 partitions/tasks.1. What's going on here? Why
does the list job take so much time? I could write a simple JAVA code using AWS
SDK which listed all the files in the exact same S3 directory in 5 seconds.2.
Why are there 10,000 partitions/tasks for listing? Can this be configured /
changed? Is there any optimisation that can be done to reduce the time taken
over here?3. What is the second job doing? Why are there 375 partitions in that
job? If this works out, I actually need to run the pipeline on a much larger
data set of around 200,000 files and it doesn't look like it will scale very
well.Please note, modifying the source data is not an option that I have.
Hence, I cannot merge multiple small files into a single large file.Any help is
appreciated.-- Thanks,Shashank Rao

RE: Understanding Spark S3 Read Performance

Reply via email to