Not sure on answer on this, but am solving similar issues. So looking for
additional feedback on how to do this.

My thoughts if unable to do via spark and S3 boto commands,  then have apps
self report those changes. Where instead of having just mappers discovering
the keys, you have services self reporting that a new key has been created
or modified to a metadata service for incremental and more realtime updates.

Would like to hear more ideas on this, thanks
David




On Mon, Mar 15, 2021, 11:31 AM Alchemist <alchemistsrivast...@gmail.com>
wrote:

> *How to optimize s3 list S3 file using wholeTextFile()*: We are using
> wholeTextFile to read data from S3.  As per my understanding wholeTextFile
> first list files of given path.  Since we are using S3 as input source,
> then listing files in a bucket is single-threaded, the S3 API for listing
> the keys in a bucket only returns keys by chunks of 1000 per call.   Since
> we have at millions of files, we are making thousands API calls.  This
> listing make our processing very slow. How can we make listing of S3 faster?
>
> Thanks,
>
> Rachana
>

Reply via email to