[ https://issues.apache.org/jira/browse/ARROW-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614053#comment-17614053 ]
Jacob Wujciak-Jens commented on ARROW-17961: -------------------------------------------- Things like readahead and metadata caching cc [~lidavidm] for details > Add read/write optimization for pyarrow.fs.S3FileSystem > ------------------------------------------------------- > > Key: ARROW-17961 > URL: https://issues.apache.org/jira/browse/ARROW-17961 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Volker Lorrmann > Priority: Minor > > I found large differences in loading time, when loading data from aws s3 > using {{pyarrows.fs.S3FileSystem}} compared to {{s3fs.S3FileSystem}} See > example below. > The difference comes from {{s3fs}} optimization, which {{pyarrow.fs}} is not > (yet) using. > {code:python} > import pyarrow.dataset as ds > import pyarrow.parquet as pq > import pyarrow.fs as pafs > import s3fs > import load_credentials > credentials = load_credentials() > path = "path/to/data" # folder with about 300 small (~10kb) files > fs1 = s3fs.S3FileSystem( > anon=False, > key=credentials["accessKeyId"], > secret=credentials["secretAccessKey"], > token=credentials["sessionToken"], > ) > fs2 = pafs.S3FileSystem( > access_key=credentials["accessKeyId"], > secret_key=credentials["secretAccessKey"], > session_token=credentials["sessionToken"], > > ) > _ = ds.dataset(path, filesystem=fs1).to_table() # takes about 5 seconds > _ = ds.dataset(path, filesystem=fs2).to_table() # takes about 25 seconds > _ = pq.read_table(path, filesyste=fs1) # takes about 5 seconds > _ = pq.read_table(path, filesytem=fs2) # takes about 10 seconds > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)