Hi,
I'm trying to set up a Spark pipeline which reads data from S3 and writes
it into Google Big Query.

Environment Details:
-----------------------
Java 8
AWS EMR-6.10.0
Spark v3.3.1
2 m5.xlarge executor nodes


S3 Directory structure:
-----------
bucket-name:
|---folder1:
  |---folder2:
     |---file1.jsonl
     |---file2.jsonl
       ...
     |---file12000.jsonl


Each file is of size 1.6 MB and there are a total of 12,000 files.

The code to read the source data looks like this:
----------------------------
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key",<ACCESS_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key",<SECRET_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","
s3.amazonaws.com")
Dataset<Row> data = spark.read().option("recursiveFileLookup",
"true").json("s3a://bucket-name/folder1/folder2/")
----------------------------

Now here's my problem:
This triggers two jobs: a) Listing Leaf Nodes, b) Json Read

[image: Screenshot 2023-05-16 at 23.00.23.png]

The list job takes around 12 mins and has 10,000 partitions/tasks. The read
job takes 14 mins and has 375 partitions/tasks.

1. What's going on here? Why does the list job take so much time? I could
write a simple JAVA code using AWS SDK which listed all the files in the
exact same S3 directory in 5 seconds.

2. Why are there 10,000 partitions/tasks for listing? Can this be
configured / changed? Is there any optimisation that can be done to reduce
the time taken over here?

3. What is the second job doing? Why are there 375 partitions in that job?

If this works out, I actually need to run the pipeline on a much larger
data set of around 200,000 files and it doesn't look like it will scale
very well.

Please note, modifying the source data is not an option that I have. Hence,
I cannot merge multiple small files into a single large file.

Any help is appreciated.


-- 
Thanks,
Shashank Rao

Reply via email to