Issue:  We are using wholeTextFile() API to read files from S3.  But this API 
is extremely SLOW due to reasons mentioned below.  Question is how to fix this 
issue?
Here is our analysis so FAR: 
Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile 
API works in two step. First step driver/master tries to list all the S3 files 
second step is driver/master tries to split the list of files and distribute 
those files to number of worker nodes and executor to process).

STEP 1. List all the s3 files in the given paths (we pass this path when we run 
the every single gw/device/app step). Issue is every single batch of every 
single report is first listing number of files. Main problem that we have is we 
are using S3 where listing files in a bucket is single threaded. This is 
because the S3 API for listing the keys in a bucket only returns keys by chunks 
of 1000 per call. Single threaded S3 API just tries to list files 1000 at a 
time, so for a million files we are looking at 1000 S3 single threaded API 
call. 

STEP 2. Control the number of splits depends on number of input partitions and 
distribute the load to worker nodes to process.

Reply via email to