[jira] [Commented] (ARROW-17961) Add read/write optimization for pyarrow.fs.S3FileSystem

Jacob Wujciak-Jens (Jira) Fri, 07 Oct 2022 05:08:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614053#comment-17614053
 ]


Jacob Wujciak-Jens commented on ARROW-17961:
--------------------------------------------

Things like readahead and metadata caching cc [~lidavidm] for details

> Add read/write optimization for pyarrow.fs.S3FileSystem
> -------------------------------------------------------
>
>                 Key: ARROW-17961
>                 URL: https://issues.apache.org/jira/browse/ARROW-17961
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Volker Lorrmann
>            Priority: Minor
>
> I found large differences in loading time, when loading data  from aws s3 
> using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See 
> example below.
> The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not 
> (yet) using.
> {code:python}
> import pyarrow.dataset as ds
> import pyarrow.parquet as pq
> import pyarrow.fs as pafs
> import s3fs
> import load_credentials
> credentials = load_credentials()
> path = "path/to/data" # folder with about 300 small (~10kb) files
> fs1 = s3fs.S3FileSystem(
>     anon=False,
>     key=credentials["accessKeyId"],
>     secret=credentials["secretAccessKey"],
>     token=credentials["sessionToken"],
> )
> fs2 = pafs.S3FileSystem(
>     access_key=credentials["accessKeyId"],
>     secret_key=credentials["secretAccessKey"],
>     session_token=credentials["sessionToken"],
>    
> )
> _ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds
> _ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds
> _ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds
> _ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17961) Add read/write optimization for pyarrow.fs.S3FileSystem

Reply via email to