Issue: We are using wholeTextFile() API to read files from S3. But this API is extremely SLOW due to reasons mentioned below. Question is how to fix this issue? Here is our analysis so FAR: Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile API works in two step. First step driver/master tries to list all the S3 files second step is driver/master tries to split the list of files and distribute those files to number of worker nodes and executor to process).
STEP 1. List all the s3 files in the given paths (we pass this path when we run the every single gw/device/app step). Issue is every single batch of every single report is first listing number of files. Main problem that we have is we are using S3 where listing files in a bucket is single threaded. This is because the S3 API for listing the keys in a bucket only returns keys by chunks of 1000 per call. Single threaded S3 API just tries to list files 1000 at a time, so for a million files we are looking at 1000 S3 single threaded API call. STEP 2. Control the number of splits depends on number of input partitions and distribute the load to worker nodes to process.