Hi,For clarification, are those 12 / 14 minutes cumulative cpu time or wall 
clock time? How many executors executed those 10000 / 375 tasks?Cheers,Enrico
-------- Ursprüngliche Nachricht --------Von: Shashank Rao 
<shashank93...@gmail.com> Datum: 16.05.23  19:48  (GMT+01:00) An: 
user@spark.apache.org Betreff: Understanding Spark S3 Read Performance Hi,I'm 
trying to set up a Spark pipeline which reads data from S3 and writes it into 
Google Big Query.Environment Details:-----------------------Java 8AWS 
EMR-6.10.0Spark v3.3.12 m5.xlarge executor nodesS3 Directory 
structure:----------- bucket-name:|---folder1:                 |---folder2:     
              |---file1.jsonl             |---file2.jsonl               ...     
                      |---file12000.jsonlEach file is of size 1.6 MB and there 
are a total of 12,000 files. The code to read the source data looks like 
this:----------------------------spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key",<ACCESS_KEY>)spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key",<SECRET_KEY>)spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","s3.amazonaws.com")Dataset<Row>
 data = spark.read().option("recursiveFileLookup", 
"true").json("s3a://bucket-name/folder1/folder2/")----------------------------Now
 here's my problem:This triggers two jobs: a) Listing Leaf Nodes, b) Json 
ReadThe list job takes around 12 mins and has 10,000 partitions/tasks. The read 
job takes 14 mins and has 375 partitions/tasks.1. What's going on here? Why 
does the list job take so much time? I could write a simple JAVA code using AWS 
SDK which listed all the files in the exact same S3 directory in 5 seconds.2. 
Why are there 10,000 partitions/tasks for listing? Can this be configured / 
changed? Is there any optimisation that can be done to reduce the time taken 
over here?3. What is the second job doing? Why are there 375 partitions in that 
job? If this works out, I actually need to run the pipeline on a much larger 
data set of around 200,000 files and it doesn't look like it will scale very 
well.Please note, modifying the source data is not an option that I have. 
Hence, I cannot merge multiple small files into a single large file.Any help is 
appreciated.-- Thanks,Shashank Rao

Reply via email to