Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor
I currently am just writing in python on small batch machine using s3
boto3.client.

And the s3.get_paginator('list_objects_v2')

And using NextContinuationToken.

It is pretty fast on single bucket to list all with a million keys but have
not extended it to all other buckets.

Example code

import boto3
from pandas.io.json import json_normalize

s3=boto3 client('s3')
paginator=s3.get_paginator('list_objects_v2')
pages=paginator.paginate(Bucket=, Prefix=,ContinuationToken=)

df_arr=[]
for page_num, page_val in enumerate(pages):
 temp_df=json_normalize(page_val)
 df_arr.append(temp_df)

final_df=pd.concat(df_arr)
ContToken=final_df.NextContinuationToken.values[-1]

I was going to run per bucket so then can run multiple paginations on
different buckets. Or else now look at aws s3 inventory as way to generate
the keys.

This has metadata and head object info as well like storage class.



On Tue, Mar 16, 2021, 2:40 PM brandonge...@gmail.com 
wrote:

> One other possibility that might help is using the S3 SDK to generate the
> list you want and loading groups into dfs and doing unions as the end of
> the loading/filtering.
>
>
>
> Something like
>
>
> import com.amazonaws.services.s3.AmazonS3Client
>
> import com.amazonaws.services.s3.model.ListObjectsV2Request
>
> import scala.collection.JavaConverters._
>
>
>
> val s3Client = new AmazonS3Client()
>
> val commonPrefixesToDate = s3Client.listObjectsV2(new
> ListObjectsV2Request().withBucketName("your-bucket").withPrefix("prefix/to/dates").withDelimiter("/"))
>
> # Maybe get more prefixes depending on structure
>
> 
>
> val dfs =
> commonPrefixesToDate.seq.grouped(100).toList.par.map(groupedParts =>
> spark.read.parquet(groupedParts: _*))
>
> val finalDF = dfs.seq.grouped(100).toList.par.map(dfgroup =>
> dfgroup.reduce(_ union _)).reduce(_ union _).coalesce(2000)
>
>
>
> *From: *Ben Kaylor 
> *Date: *Tuesday, March 16, 2021 at 3:23 PM
> *To: *Boris Litvak 
> *Cc: *Alchemist , User <
> user@spark.apache.org>
> *Subject: *Re: How to make bucket listing faster while using S3 with
> wholeTextFile
>
> This is very helpful Boris.
>
> I will need to re-architect a piece of my code to work with this service
> but see it as more maintainable/stable long term.
>
> I will be developing it out over the course of a few weeks so will let you
> know how it goes.
>
>
>
> On Tue, Mar 16, 2021, 2:05 AM Boris Litvak  wrote:
>
> P.S.: 3. If fast updates are required, one way would be capturing S3
> events & putting the paths/modifications dates/etc of the paths into
> DynamoDB/your DB of choice.
>
>
>
> *From:* Boris Litvak
> *Sent:* Tuesday, 16 March 2021 9:03
> *To:* Ben Kaylor ; Alchemist <
> alchemistsrivast...@gmail.com>
> *Cc:* User 
> *Subject:* RE: How to make bucket listing faster while using S3 with
> wholeTextFile
>
>
>
> Ben, I’d explore these approaches:
>
>1. To address your problem, I’d setup an inventory for the S3 bucket:
>
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html.
>Then you can list the files from the inventory. Have not tried this myself.
>Note that the inventory update is done once per day, at most, and it’s
>eventually consistent.
>2. If possible, would try & make bigger files. One can’t do many
>    things, such as streaming from scratch, when you have millions of files.
>
>
>
> Please tell us if it helps & how it goes.
>
>
>
> Boris
>
>
>
> *From:* Ben Kaylor 
> *Sent:* Monday, 15 March 2021 21:10
> *To:* Alchemist 
> *Cc:* User 
> *Subject:* Re: How to make bucket listing faster while using S3 with
> wholeTextFile
>
>
>
> Not sure on answer on this, but am solving similar issues. So looking for
> additional feedback on how to do this.
>
>
>
> My thoughts if unable to do via spark and S3 boto commands,  then have
> apps self report those changes. Where instead of having just mappers
> discovering the keys, you have services self reporting that a new key has
> been created or modified to a metadata service for incremental and more
> realtime updates.
>
>
>
> Would like to hear more ideas on this, thanks
>
> David
>
>
>
>
>
>
>
> On Mon, Mar 15, 2021, 11:31 AM Alchemist 
> wrote:
>
> *How to optimize s3 list S3 file using wholeTextFile()*: We are using
> wholeTextFile to read data from S3.  As per my understanding wholeTextFile
> first list files of given path.  Since we are using S3 as input source,
> then listing files in a bucket is single-threaded, the S3 API for listing
> the keys in a bucket only returns keys by chunks of 1000 per call.   Since
> we have at millions of files, we are making thousands API calls.  This
> listing make our processing very slow. How can we make listing of S3 faster?
>
>
>
> Thanks,
>
>
>
> Rachana
>
>


Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread brandonge...@gmail.com
One other possibility that might help is using the S3 SDK to generate the list you want and loading groups into dfs and doing unions as the end of the loading/filtering. Something likeimport com.amazonaws.services.s3.AmazonS3Clientimport com.amazonaws.services.s3.model.ListObjectsV2Requestimport scala.collection.JavaConverters._ val s3Client = new AmazonS3Client()val commonPrefixesToDate = s3Client.listObjectsV2(new ListObjectsV2Request().withBucketName("your-bucket").withPrefix("prefix/to/dates").withDelimiter("/"))# Maybe get more prefixes depending on structureval dfs = commonPrefixesToDate.seq.grouped(100).toList.par.map(groupedParts => spark.read.parquet(groupedParts: _*))val finalDF = dfs.seq.grouped(100).toList.par.map(dfgroup => dfgroup.reduce(_ union _)).reduce(_ union _).coalesce(2000) From: Ben Kaylor Date: Tuesday, March 16, 2021 at 3:23 PMTo: Boris Litvak Cc: Alchemist , User Subject: Re: How to make bucket listing faster while using S3 with wholeTextFileThis is very helpful Boris. I will need to re-architect a piece of my code to work with this service but see it as more maintainable/stable long term.I will be developing it out over the course of a few weeks so will let you know how it goes. On Tue, Mar 16, 2021, 2:05 AM Boris Litvak <boris.lit...@skf.com> wrote:P.S.: 3. If fast updates are required, one way would be capturing S3 events & putting the paths/modifications dates/etc of the paths into DynamoDB/your DB of choice. From: Boris Litvak Sent: Tuesday, 16 March 2021 9:03To: Ben Kaylor <kaylor...@gmail.com>; Alchemist <alchemistsrivast...@gmail.com>Cc: User <user@spark.apache.org>Subject: RE: How to make bucket listing faster while using S3 with wholeTextFile Ben, I’d explore these approaches:To address your problem, I’d setup an inventory for the S3 bucket: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. Then you can list the files from the inventory. Have not tried this myself. Note that the inventory update is done once per day, at most, and it’s eventually consistent.If possible, would try & make bigger files. One can’t do many things, such as streaming from scratch, when you have millions of files. Please tell us if it helps & how it goes. Boris  From: Ben Kaylor <kaylor...@gmail.com> Sent: Monday, 15 March 2021 21:10To: Alchemist <alchemistsrivast...@gmail.com>Cc: User <user@spark.apache.org>Subject: Re: How to make bucket listing faster while using S3 with wholeTextFile Not sure on answer on this, but am solving similar issues. So looking for additional feedback on how to do this. My thoughts if unable to do via spark and S3 boto commands,  then have apps self report those changes. Where instead of having just mappers discovering the keys, you have services self reporting that a new key has been created or modified to a metadata service for incremental and more realtime updates. Would like to hear more ideas on this, thanksDavid   On Mon, Mar 15, 2021, 11:31 AM Alchemist <alchemistsrivast...@gmail.com> wrote:How to optimize s3 list S3 file using wholeTextFile(): We are using wholeTextFile to read data from S3.  As per my understanding wholeTextFile first list files of given path.  Since we are using S3 as input source, then listing files in a bucket is single-threaded, the S3 API for listing the keys in a bucket only returns keys by chunks of 1000 per call.   Since we have at millions of files, we are making thousands API calls.  This listing make our processing very slow. How can we make listing of S3 faster? Thanks, Rachana

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor
This is very helpful Boris.
I will need to re-architect a piece of my code to work with this service
but see it as more maintainable/stable long term.
I will be developing it out over the course of a few weeks so will let you
know how it goes.


On Tue, Mar 16, 2021, 2:05 AM Boris Litvak  wrote:

> P.S.: 3. If fast updates are required, one way would be capturing S3
> events & putting the paths/modifications dates/etc of the paths into
> DynamoDB/your DB of choice.
>
>
>
> *From:* Boris Litvak
> *Sent:* Tuesday, 16 March 2021 9:03
> *To:* Ben Kaylor ; Alchemist <
> alchemistsrivast...@gmail.com>
> *Cc:* User 
> *Subject:* RE: How to make bucket listing faster while using S3 with
> wholeTextFile
>
>
>
> Ben, I’d explore these approaches:
>
>1. To address your problem, I’d setup an inventory for the S3 bucket:
>
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html.
>Then you can list the files from the inventory. Have not tried this myself.
>Note that the inventory update is done once per day, at most, and it’s
>eventually consistent.
>2. If possible, would try & make bigger files. One can’t do many
>things, such as streaming from scratch, when you have millions of files.
>
>
>
> Please tell us if it helps & how it goes.
>
>
>
> Boris
>
>
>
> *From:* Ben Kaylor 
> *Sent:* Monday, 15 March 2021 21:10
> *To:* Alchemist 
> *Cc:* User 
> *Subject:* Re: How to make bucket listing faster while using S3 with
> wholeTextFile
>
>
>
> Not sure on answer on this, but am solving similar issues. So looking for
> additional feedback on how to do this.
>
>
>
> My thoughts if unable to do via spark and S3 boto commands,  then have
> apps self report those changes. Where instead of having just mappers
> discovering the keys, you have services self reporting that a new key has
> been created or modified to a metadata service for incremental and more
> realtime updates.
>
>
>
> Would like to hear more ideas on this, thanks
>
> David
>
>
>
>
>
>
>
> On Mon, Mar 15, 2021, 11:31 AM Alchemist 
> wrote:
>
> *How to optimize s3 list S3 file using wholeTextFile()*: We are using
> wholeTextFile to read data from S3.  As per my understanding wholeTextFile
> first list files of given path.  Since we are using S3 as input source,
> then listing files in a bucket is single-threaded, the S3 API for listing
> the keys in a bucket only returns keys by chunks of 1000 per call.   Since
> we have at millions of files, we are making thousands API calls.  This
> listing make our processing very slow. How can we make listing of S3 faster?
>
>
>
> Thanks,
>
>
>
> Rachana
>
>


RE: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Boris Litvak
P.S.: 3. If fast updates are required, one way would be capturing S3 events & 
putting the paths/modifications dates/etc of the paths into DynamoDB/your DB of 
choice.

From: Boris Litvak
Sent: Tuesday, 16 March 2021 9:03
To: Ben Kaylor ; Alchemist 
Cc: User 
Subject: RE: How to make bucket listing faster while using S3 with wholeTextFile

Ben, I’d explore these approaches:

  1.  To address your problem, I’d setup an inventory for the S3 bucket: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. 
Then you can list the files from the inventory. Have not tried this myself. 
Note that the inventory update is done once per day, at most, and it’s 
eventually consistent.
  2.  If possible, would try & make bigger files. One can’t do many things, 
such as streaming from scratch, when you have millions of files.

Please tell us if it helps & how it goes.

Boris

From: Ben Kaylor mailto:kaylor...@gmail.com>>
Sent: Monday, 15 March 2021 21:10
To: Alchemist 
mailto:alchemistsrivast...@gmail.com>>
Cc: User mailto:user@spark.apache.org>>
Subject: Re: How to make bucket listing faster while using S3 with wholeTextFile

Not sure on answer on this, but am solving similar issues. So looking for 
additional feedback on how to do this.

My thoughts if unable to do via spark and S3 boto commands,  then have apps 
self report those changes. Where instead of having just mappers discovering the 
keys, you have services self reporting that a new key has been created or 
modified to a metadata service for incremental and more realtime updates.

Would like to hear more ideas on this, thanks
David



On Mon, Mar 15, 2021, 11:31 AM Alchemist 
mailto:alchemistsrivast...@gmail.com>> wrote:
How to optimize s3 list S3 file using wholeTextFile(): We are using 
wholeTextFile to read data from S3.  As per my understanding wholeTextFile 
first list files of given path.  Since we are using S3 as input source, then 
listing files in a bucket is single-threaded, the S3 API for listing the keys 
in a bucket only returns keys by chunks of 1000 per call.   Since we have at 
millions of files, we are making thousands API calls.  This listing make our 
processing very slow. How can we make listing of S3 faster?

Thanks,

Rachana


RE: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Boris Litvak
Ben, I’d explore these approaches:

  1.  To address your problem, I’d setup an inventory for the S3 bucket: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. 
Then you can list the files from the inventory. Have not tried this myself. 
Note that the inventory update is done once per day, at most, and it’s 
eventually consistent.
  2.  If possible, would try & make bigger files. One can’t do many things, 
such as streaming from scratch, when you have millions of files.

Please tell us if it helps & how it goes.

Boris

From: Ben Kaylor 
Sent: Monday, 15 March 2021 21:10
To: Alchemist 
Cc: User 
Subject: Re: How to make bucket listing faster while using S3 with wholeTextFile

Not sure on answer on this, but am solving similar issues. So looking for 
additional feedback on how to do this.

My thoughts if unable to do via spark and S3 boto commands,  then have apps 
self report those changes. Where instead of having just mappers discovering the 
keys, you have services self reporting that a new key has been created or 
modified to a metadata service for incremental and more realtime updates.

Would like to hear more ideas on this, thanks
David



On Mon, Mar 15, 2021, 11:31 AM Alchemist 
mailto:alchemistsrivast...@gmail.com>> wrote:
How to optimize s3 list S3 file using wholeTextFile(): We are using 
wholeTextFile to read data from S3.  As per my understanding wholeTextFile 
first list files of given path.  Since we are using S3 as input source, then 
listing files in a bucket is single-threaded, the S3 API for listing the keys 
in a bucket only returns keys by chunks of 1000 per call.   Since we have at 
millions of files, we are making thousands API calls.  This listing make our 
processing very slow. How can we make listing of S3 faster?

Thanks,

Rachana


Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Stephen Coy
Hi there,

At risk of stating the obvious, the first step is to ensure that your Spark 
application and S3 bucket are colocated in the same AWS region.

Steve C

On 16 Mar 2021, at 3:31 am, Alchemist 
mailto:alchemistsrivast...@gmail.com>> wrote:

How to optimize s3 list S3 file using wholeTextFile(): We are using 
wholeTextFile to read data from S3.  As per my understanding wholeTextFile 
first list files of given path.  Since we are using S3 as input source, then 
listing files in a bucket is single-threaded, the S3 API for listing the keys 
in a bucket only returns keys by chunks of 1000 per call.   Since we have at 
millions of files, we are making thousands API calls.  This listing make our 
processing very slow. How can we make listing of S3 faster?

Thanks,

Rachana

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Ben Kaylor
Not sure on answer on this, but am solving similar issues. So looking for
additional feedback on how to do this.

My thoughts if unable to do via spark and S3 boto commands,  then have apps
self report those changes. Where instead of having just mappers discovering
the keys, you have services self reporting that a new key has been created
or modified to a metadata service for incremental and more realtime updates.

Would like to hear more ideas on this, thanks
David




On Mon, Mar 15, 2021, 11:31 AM Alchemist 
wrote:

> *How to optimize s3 list S3 file using wholeTextFile()*: We are using
> wholeTextFile to read data from S3.  As per my understanding wholeTextFile
> first list files of given path.  Since we are using S3 as input source,
> then listing files in a bucket is single-threaded, the S3 API for listing
> the keys in a bucket only returns keys by chunks of 1000 per call.   Since
> we have at millions of files, we are making thousands API calls.  This
> listing make our processing very slow. How can we make listing of S3 faster?
>
> Thanks,
>
> Rachana
>


How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Alchemist
How to optimize s3 list S3 file using wholeTextFile(): We are using 
wholeTextFile to read data from S3.  As per my understanding wholeTextFile 
first list files of given path.  Since we are using S3 as input source, then 
listing files in a bucket is single-threaded, the S3 API for listing the keys 
in a bucket only returns keys by chunks of 1000 per call.   Since we have at 
millions of files, we are making thousands API calls.  This listing make our 
processing very slow. How can we make listing of S3 faster?
Thanks,
Rachana