Hi there, At risk of stating the obvious, the first step is to ensure that your Spark application and S3 bucket are colocated in the same AWS region.
Steve C On 16 Mar 2021, at 3:31 am, Alchemist <alchemistsrivast...@gmail.com<mailto:alchemistsrivast...@gmail.com>> wrote: How to optimize s3 list S3 file using wholeTextFile(): We are using wholeTextFile to read data from S3. As per my understanding wholeTextFile first list files of given path. Since we are using S3 as input source, then listing files in a bucket is single-threaded, the S3 API for listing the keys in a bucket only returns keys by chunks of 1000 per call. Since we have at millions of files, we are making thousands API calls. This listing make our processing very slow. How can we make listing of S3 faster? Thanks, Rachana This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia's privacy policy. http://www.infomedia.com.au/privacy-policy/