Vedant Roy created ARROW-18076: ---------------------------------- Summary: PyArrow cannot read from R2 (Cloudflare's S3) Key: ARROW-18076 URL: https://issues.apache.org/jira/browse/ARROW-18076 Project: Apache Arrow Issue Type: Bug Environment: Ubuntu 20 Reporter: Vedant Roy
When using pyarrow to read parquet data (as part of the Ray project), I get the following stracktrace: ``` (_sample_piece pid=49818) Traceback (most recent call last): (_sample_piece pid=49818) File "python/ray/_raylet.pyx", line 859, in ray._raylet.execute_task (_sample_piece pid=49818) File "python/ray/_raylet.pyx", line 863, in ray._raylet.execute_task (_sample_piece pid=49818) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 446, in _sample_piece (_sample_piece pid=49818) batch = next(batches) (_sample_piece pid=49818) File "pyarrow/_dataset.pyx", line 3202, in _iterator (_sample_piece pid=49818) File "pyarrow/_dataset.pyx", line 2891, in pyarrow._dataset.TaggedRecordBatchIterator.__next__ (_sample_piece pid=49818) File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status (_sample_piece pid=49818) File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status (_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, Transferred a partial file ``` I do not get this error when using Amazon S3 for the exact same data. The error is coming from this line: https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446 -- This message was sent by Atlassian Jira (v8.20.10#820010)