xu2011 commented on issue #34414: URL: https://github.com/apache/arrow/issues/34414#issuecomment-1455341709
I bumped into a similar issue with pd.read_parquet to read from s3. Read performance is slow compare to pq.read_table().to_pandas() though underneath the pandas.read_parquet using the same function call. I found the issue could cause by [_ensure_filesystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/parquet/core.py#L2431) in _ParquetDataset class. Which [check object type for the given filesystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/fs.py#L99). When the check failed, it will reconstruct the filesystem with [PyFileSystem](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/fs.py#L119) and slow down the read. With pq.read_table().to_pandas(), it will [parse the file system from s3 path](https://github.com/apache/arrow/blob/d5b3b4737838774db658d3c488fcd3e72bc13f7e/python/pyarrow/parquet/core.py#L2459) and get `pyarrow._s3fs.S3FileSystem` ``` def _ensure_filesystem_checkinstance(): import s3fs from pyarrow._fs import FileSystem s3 = s3fs.S3FileSystem() print(isinstance(s3,FileSystem)) def fs_pandas(): import s3fs from pyarrow._fs import (FileSystem, _ensure_filesystem) s3 = s3fs.S3FileSystem() fs = _ensure_filesystem(s3) print(fs) def fs_pq(): filesystem, path_or_paths = FileSystem.from_uri( s3_path) print(filesystem) _ensure_filesystem_checkinstance() fs_pandas() fs_pq() ``` Result ``` _ensure_filesystem_checkinstance() False fs_pandas() pyarrow._fs.FileSystem fs_pq() pyarrow._s3.S3FileSystem ``` I don't think it's expected behavior and I suggest reopen the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
