Re: How to make bucket listing faster while using S3 with wholeTextFile

Stephen Coy Mon, 15 Mar 2021 15:56:17 -0700

Hi there,

At risk of stating the obvious, the first step is to ensure that your Spark 
application and S3 bucket are colocated in the same AWS region.


Steve C

On 16 Mar 2021, at 3:31 am, Alchemist 
<alchemistsrivast...@gmail.com<mailto:alchemistsrivast...@gmail.com>> wrote:

How to optimize s3 list S3 file using wholeTextFile(): We are using 
wholeTextFile to read data from S3.  As per my understanding wholeTextFile 
first list files of given path.  Since we are using S3 as input source, then 
listing files in a bucket is single-threaded, the S3 API for listing the keys 
in a bucket only returns keys by chunks of 1000 per call.   Since we have at 
millions of files, we are making thousands API calls.  This listing make our 
processing very slow. How can we make listing of S3 faster?

Thanks,

Rachana

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/

Re: How to make bucket listing faster while using S3 with wholeTextFile

Reply via email to