This is very helpful Boris. I will need to re-architect a piece of my code to work with this service but see it as more maintainable/stable long term. I will be developing it out over the course of a few weeks so will let you know how it goes.
On Tue, Mar 16, 2021, 2:05 AM Boris Litvak <boris.lit...@skf.com> wrote: > P.S.: 3. If fast updates are required, one way would be capturing S3 > events & putting the paths/modifications dates/etc of the paths into > DynamoDB/your DB of choice. > > > > *From:* Boris Litvak > *Sent:* Tuesday, 16 March 2021 9:03 > *To:* Ben Kaylor <kaylor...@gmail.com>; Alchemist < > alchemistsrivast...@gmail.com> > *Cc:* User <user@spark.apache.org> > *Subject:* RE: How to make bucket listing faster while using S3 with > wholeTextFile > > > > Ben, I’d explore these approaches: > > 1. To address your problem, I’d setup an inventory for the S3 bucket: > > https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. > Then you can list the files from the inventory. Have not tried this myself. > Note that the inventory update is done once per day, at most, and it’s > eventually consistent. > 2. If possible, would try & make bigger files. One can’t do many > things, such as streaming from scratch, when you have millions of files. > > > > Please tell us if it helps & how it goes. > > > > Boris > > > > *From:* Ben Kaylor <kaylor...@gmail.com> > *Sent:* Monday, 15 March 2021 21:10 > *To:* Alchemist <alchemistsrivast...@gmail.com> > *Cc:* User <user@spark.apache.org> > *Subject:* Re: How to make bucket listing faster while using S3 with > wholeTextFile > > > > Not sure on answer on this, but am solving similar issues. So looking for > additional feedback on how to do this. > > > > My thoughts if unable to do via spark and S3 boto commands, then have > apps self report those changes. Where instead of having just mappers > discovering the keys, you have services self reporting that a new key has > been created or modified to a metadata service for incremental and more > realtime updates. > > > > Would like to hear more ideas on this, thanks > > David > > > > > > > > On Mon, Mar 15, 2021, 11:31 AM Alchemist <alchemistsrivast...@gmail.com> > wrote: > > *How to optimize s3 list S3 file using wholeTextFile()*: We are using > wholeTextFile to read data from S3. As per my understanding wholeTextFile > first list files of given path. Since we are using S3 as input source, > then listing files in a bucket is single-threaded, the S3 API for listing > the keys in a bucket only returns keys by chunks of 1000 per call. Since > we have at millions of files, we are making thousands API calls. This > listing make our processing very slow. How can we make listing of S3 faster? > > > > Thanks, > > > > Rachana > >