One other possibility that might help is using the S3 SDK to generate the list you want and loading groups into dfs and doing unions as the end of the loading/filtering.

 

Something like


import com.amazonaws.services.s3.AmazonS3Client

import com.amazonaws.services.s3.model.ListObjectsV2Request

import scala.collection.JavaConverters._

 

val s3Client = new AmazonS3Client()

val commonPrefixesToDate = s3Client.listObjectsV2(new ListObjectsV2Request().withBucketName("your-bucket").withPrefix("prefix/to/dates").withDelimiter("/"))

# Maybe get more prefixes depending on structure

....

val dfs = commonPrefixesToDate.seq.grouped(100).toList.par.map(groupedParts => spark.read.parquet(groupedParts: _*))

val finalDF = dfs.seq.grouped(100).toList.par.map(dfgroup => dfgroup.reduce(_ union _)).reduce(_ union _).coalesce(2000)

 

From: Ben Kaylor <kaylor...@gmail.com>
Date: Tuesday, March 16, 2021 at 3:23 PM
To: Boris Litvak <boris.lit...@skf.com>
Cc: Alchemist <alchemistsrivast...@gmail.com>, User <user@spark.apache.org>
Subject: Re: How to make bucket listing faster while using S3 with wholeTextFile

This is very helpful Boris. 

I will need to re-architect a piece of my code to work with this service but see it as more maintainable/stable long term.

I will be developing it out over the course of a few weeks so will let you know how it goes.

 

On Tue, Mar 16, 2021, 2:05 AM Boris Litvak <boris.lit...@skf.com> wrote:

P.S.: 3. If fast updates are required, one way would be capturing S3 events & putting the paths/modifications dates/etc of the paths into DynamoDB/your DB of choice.

 

From: Boris Litvak
Sent: Tuesday, 16 March 2021 9:03
To: Ben Kaylor <kaylor...@gmail.com>; Alchemist <alchemistsrivast...@gmail.com>
Cc: User <user@spark.apache.org>
Subject: RE: How to make bucket listing faster while using S3 with wholeTextFile

 

Ben, I’d explore these approaches:

  1. To address your problem, I’d setup an inventory for the S3 bucket: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. Then you can list the files from the inventory. Have not tried this myself. Note that the inventory update is done once per day, at most, and it’s eventually consistent.
  2. If possible, would try & make bigger files. One can’t do many things, such as streaming from scratch, when you have millions of files.

 

Please tell us if it helps & how it goes.

 

Boris

 

From: Ben Kaylor <kaylor...@gmail.com>
Sent: Monday, 15 March 2021 21:10
To: Alchemist <alchemistsrivast...@gmail.com>
Cc: User <user@spark.apache.org>
Subject: Re: How to make bucket listing faster while using S3 with wholeTextFile

 

Not sure on answer on this, but am solving similar issues. So looking for additional feedback on how to do this.

 

My thoughts if unable to do via spark and S3 boto commands,  then have apps self report those changes. Where instead of having just mappers discovering the keys, you have services self reporting that a new key has been created or modified to a metadata service for incremental and more realtime updates.

 

Would like to hear more ideas on this, thanks

David

 

 

 

On Mon, Mar 15, 2021, 11:31 AM Alchemist <alchemistsrivast...@gmail.com> wrote:

How to optimize s3 list S3 file using wholeTextFile(): We are using wholeTextFile to read data from S3.  As per my understanding wholeTextFile first list files of given path.  Since we are using S3 as input source, then listing files in a bucket is single-threaded, the S3 API for listing the keys in a bucket only returns keys by chunks of 1000 per call.   Since we have at millions of files, we are making thousands API calls.  This listing make our processing very slow. How can we make listing of S3 faster?

 

Thanks,

 

Rachana

--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to