hutch3232 opened a new issue, #44119: URL: https://github.com/apache/arrow/issues/44119
### Describe the bug, including details regarding any error messages, version, and platform. Awhile back I opened this issue: https://github.com/pandas-dev/pandas/issues/57449 mistakenly thinking `pandas` parquet reader wasn't working quite right due to the odd error thrown. Recent testing has shown it is probably an issue with `pyarrow`. Testing was done on Ubuntu 20.04 and `pyarrow` 17.0.0. I am using an on-prem S3-compatibible storage provider. This means `AWS_REGION` is irrelevant, but the `AWS_ENDPOINT_URL` is important. I have defined `endpoint_url` in `~/.aws/config` under the correct profile per the specification here: https://aws.amazon.com/blogs/developer/new-improved-flexibility-when-configuring-endpoint-urls-with-the-aws-sdks-and-tools/ ```python import os import pyarrow.parquet as pq os.environ["AWS_PROFILE"] = "my-bucket-role" tbl = pq.read_table("s3://my-bucket/my-parquet") --------------------------------------------------------------------------- OSError Traceback (most recent call last) Cell In[1], line 14 ---> 14 tbl = pq.read_table("s3://my-bucket/my-parquet") File /opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1793, in read_table(source, columns, use_threads, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification) 1787 warnings.warn( 1788 "Passing 'use_legacy_dataset' is deprecated as of pyarrow 15.0.0 " 1789 "and will be removed in a future version.", 1790 FutureWarning, stacklevel=2) 1792 try: -> 1793 dataset = ParquetDataset( 1794 source, 1795 schema=schema, 1796 filesystem=filesystem, 1797 partitioning=partitioning, 1798 memory_map=memory_map, 1799 read_dictionary=read_dictionary, 1800 buffer_size=buffer_size, 1801 filters=filters, 1802 ignore_prefixes=ignore_prefixes, 1803 pre_buffer=pre_buffer, 1804 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit, 1805 decryption_properties=decryption_properties, 1806 thrift_string_size_limit=thrift_string_size_limit, 1807 thrift_container_size_limit=thrift_container_size_limit, 1808 page_checksum_verification=page_checksum_verification, 1809 ) 1810 except ImportError: 1811 # fall back on ParquetFile for simple cases when pyarrow.dataset 1812 # module is not available 1813 if filters is not None: File /opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1344, in ParquetDataset.__init__(self, path_or_paths, filesystem, schema, filters, read_dictionary, memory_map, buffer_size, partitioning, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification, use_legacy_dataset) 1341 if filesystem is None: 1342 # path might be a URI describing the FileSystem as well 1343 try: -> 1344 filesystem, path_or_paths = FileSystem.from_uri( 1345 path_or_paths) 1346 except ValueError: 1347 filesystem = LocalFileSystem(use_mmap=memory_map) File /opt/conda/lib/python3.9/site-packages/pyarrow/_fs.pyx:477, in pyarrow._fs.FileSystem.from_uri() File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status() File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status() OSError: When resolving region for bucket 'my-bucket': AWS Error NETWORK_CONNECTION during HeadBucket operation: curlCode: 28, Timeout was reached ``` If I also specify a definition of `AWS_ENDPOINT_URL` as an environmental variable, it does work. ```python os.environ["AWS_ENDPOINT_URL"] = "https://my-endpoint.com" tbl = pq.read_table("s3://my-bucket/my-parquet") # no error ``` I think that `pyarrow` should read `endpoint_url` from `~/.aws/config` if it exists and `AWS_ENDPOINT_URL` is not specified, per: https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-endpoints.html. That would avoid this confusing error about 'region' and would also be convenient to avoid having to specify an additional environmental variable. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
