[ https://issues.apache.org/jira/browse/ARROW-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-16272: ----------------------------------- Labels: S3FileSystem csv pandas pull-request-available s3 (was: S3FileSystem csv pandas s3) > [C++][Python] Poor read performance of S3FileSystem.open_input_file when used > with `pd.read_csv` > ------------------------------------------------------------------------------------------------ > > Key: ARROW-16272 > URL: https://issues.apache.org/jira/browse/ARROW-16272 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 4.0.1, 5.0.0, 7.0.0 > Environment: MacOS 12.1 > MacBook Pro > Intel x86 > Reporter: Sahil Gupta > Assignee: Antoine Pitrou > Priority: Major > Labels: S3FileSystem, csv, pandas, pull-request-available, s3 > Fix For: 9.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > `pyarrow.fs.S3FileSystem.open_input_file` and > `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used > with Pandas' `read_csv`. > {code:python} > import pandas as pd > import time > from pyarrow.fs import S3FileSystem > def load_parking_tickets(): > print("Running...") > t0 = time.time() > fs = S3FileSystem( > anonymous=True, > region="us-east-2", > endpoint_override=None, > proxy_options=None, > ) > print("Time to create fs: ", time.time() - t0) > t0 = time.time() > # fhandler = fs.open_input_stream( > # > "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", > # ) > fhandler = fs.open_input_file( > > "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", > ) > print("Time to create fhandler: ", time.time() - t0) > t0 = time.time() > year_2016_df = pd.read_csv( > fhandler, > nrows=100, > ) > print("read time:", time.time() - t0) > return year_2016_df > t0 = time.time() > load_parking_tickets() > print("total time:", time.time() - t0) > {code} > Output: > {code} > Running... > Time to create fs: 0.0003612041473388672 > Time to create fhandler: 0.22461509704589844 > read time: 105.76488208770752 > total time: 105.99135684967041 > {code} > This is with `pandas==1.4.2`. > Getting similar performance with `fs.open_input_stream` as well (commented > out in the code). > {code} > Running... > Time to create fs: 0.0002570152282714844 > Time to create fhandler: 0.18540692329406738 > read time: 186.8419930934906 > total time: 187.03169012069702 > {code} > When running it with just pandas (which uses `s3fs` under the hood), it's > much faster: > {code:python} > import pandas as pd > import time > def load_parking_tickets(): > print("Running...") > t0 = time.time() > year_2016_df = pd.read_csv( > > "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", > nrows=100, > ) > print("read time:", time.time() - t0) > return year_2016_df > t0 = time.time() > load_parking_tickets() > print("total time:", time.time() - t0) > {code} > Output: > {code} > Running... > read time: 1.1012001037597656 > total time: 1.101264238357544 > {code} > Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs > performance: > {code:python} > import pandas as pd > import time > from pyarrow.fs import S3FileSystem > from fsspec.implementations.arrow import ArrowFSWrapper > def load_parking_tickets(): > print("Running...") > t0 = time.time() > fs = ArrowFSWrapper( > S3FileSystem( > anonymous=True, > region="us-east-2", > endpoint_override=None, > proxy_options=None, > ) > ) > print("Time to create fs: ", time.time() - t0) > t0 = time.time() > fhandler = fs._open( > > "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv", > ) > print("Time to create fhandler: ", time.time() - t0) > t0 = time.time() > year_2016_df = pd.read_csv( > fhandler, > nrows=100, > ) > print("read time:", time.time() - t0) > return year_2016_df > t0 = time.time() > load_parking_tickets() > print("total time:", time.time() - t0) > {code} > Output: > {code} > Running... > Time to create fs: 0.0002467632293701172 > Time to create fhandler: 0.1858382225036621 > read time: 0.13701486587524414 > total time: 0.3232450485229492 > {code} > Packages: > {code} > pyarrow=7.0.0 > pandas : 1.4.2 > numpy : 1.20.3 > {code} > I tested it with 4.0.1, 5.0.0 as well and saw similar results. -- This message was sent by Atlassian Jira (v8.20.7#820007)