One other possibility that might help is using the S3 SDK to generate the list you want and loading groups into dfs and doing unions as the end of the loading/filtering. Something like
import com.amazonaws.services.s3.model.ListObjectsV2Request import scala.collection.JavaConverters._ val s3Client = new AmazonS3Client() val commonPrefixesToDate = s3Client.listObjectsV2(new ListObjectsV2Request().withBucketName("your-bucket").withPrefix("prefix/to/dates").withDelimiter("/")) # Maybe get more prefixes depending on structure .... val dfs = commonPrefixesToDate.seq.grouped(100).toList.par.map(groupedParts => spark.read.parquet(groupedParts: _*)) val finalDF = dfs.seq.grouped(100).toList.par.map(dfgroup => dfgroup.reduce(_ union _)).reduce(_ union _).coalesce(2000) From: Ben Kaylor <kaylor...@gmail.com> This is very helpful Boris. I will need to re-architect a piece of my code to work with this service but see it as more maintainable/stable long term. I will be developing it out over the course of a few weeks so will let you know how it goes. On Tue, Mar 16, 2021, 2:05 AM Boris Litvak <boris.lit...@skf.com> wrote:
|
- How to make bucket listing faster while using S3 wi... Alchemist
- Re: How to make bucket listing faster while us... Ben Kaylor
- RE: How to make bucket listing faster whil... Boris Litvak
- RE: How to make bucket listing faster ... Boris Litvak
- Re: How to make bucket listing fas... Ben Kaylor
- Re: How to make bucket listin... brandonge...@gmail.com
- Re: How to make bucket li... Ben Kaylor
- Re: How to make bucket listing faster while us... Stephen Coy