hutch3232 opened a new issue, #44119:
URL: https://github.com/apache/arrow/issues/44119

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Awhile back I opened this issue: 
https://github.com/pandas-dev/pandas/issues/57449 mistakenly thinking `pandas` 
parquet reader wasn't working quite right due to the odd error thrown. Recent 
testing has shown it is probably an issue with `pyarrow`.
   
   Testing was done on Ubuntu 20.04 and `pyarrow` 17.0.0. I am using an on-prem 
S3-compatibible storage provider. This means `AWS_REGION` is irrelevant, but 
the `AWS_ENDPOINT_URL` is important.
   
   I have defined `endpoint_url` in `~/.aws/config` under the correct profile 
per the specification here:
   
https://aws.amazon.com/blogs/developer/new-improved-flexibility-when-configuring-endpoint-urls-with-the-aws-sdks-and-tools/
   
   ```python
   import os
   import pyarrow.parquet as pq
   
   os.environ["AWS_PROFILE"] = "my-bucket-role"
   
   tbl = pq.read_table("s3://my-bucket/my-parquet")
   
   ---------------------------------------------------------------------------
   OSError                                   Traceback (most recent call last)
   Cell In[1], line 14
   ---> 14 tbl = pq.read_table("s3://my-bucket/my-parquet")
   
   File /opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1793, in 
read_table(source, columns, use_threads, schema, use_pandas_metadata, 
read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, 
use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, 
decryption_properties, thrift_string_size_limit, thrift_container_size_limit, 
page_checksum_verification)
      1787     warnings.warn(
      1788         "Passing 'use_legacy_dataset' is deprecated as of pyarrow 
15.0.0 "
      1789         "and will be removed in a future version.",
      1790         FutureWarning, stacklevel=2)
      1792 try:
   -> 1793     dataset = ParquetDataset(
      1794         source,
      1795         schema=schema,
      1796         filesystem=filesystem,
      1797         partitioning=partitioning,
      1798         memory_map=memory_map,
      1799         read_dictionary=read_dictionary,
      1800         buffer_size=buffer_size,
      1801         filters=filters,
      1802         ignore_prefixes=ignore_prefixes,
      1803         pre_buffer=pre_buffer,
      1804         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
      1805         decryption_properties=decryption_properties,
      1806         thrift_string_size_limit=thrift_string_size_limit,
      1807         thrift_container_size_limit=thrift_container_size_limit,
      1808         page_checksum_verification=page_checksum_verification,
      1809     )
      1810 except ImportError:
      1811     # fall back on ParquetFile for simple cases when pyarrow.dataset
      1812     # module is not available
      1813     if filters is not None:
   
   File /opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1344, in 
ParquetDataset.__init__(self, path_or_paths, filesystem, schema, filters, 
read_dictionary, memory_map, buffer_size, partitioning, ignore_prefixes, 
pre_buffer, coerce_int96_timestamp_unit, decryption_properties, 
thrift_string_size_limit, thrift_container_size_limit, 
page_checksum_verification, use_legacy_dataset)
      1341 if filesystem is None:
      1342     # path might be a URI describing the FileSystem as well
      1343     try:
   -> 1344         filesystem, path_or_paths = FileSystem.from_uri(
      1345             path_or_paths)
      1346     except ValueError:
      1347         filesystem = LocalFileSystem(use_mmap=memory_map)
   
   File /opt/conda/lib/python3.9/site-packages/pyarrow/_fs.pyx:477, in 
pyarrow._fs.FileSystem.from_uri()
   
   File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:155, in 
pyarrow.lib.pyarrow_internal_check_status()
   
   File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:92, in 
pyarrow.lib.check_status()
   
   OSError: When resolving region for bucket 'my-bucket': AWS Error 
NETWORK_CONNECTION during HeadBucket operation: curlCode: 28, Timeout was 
reached
   ```
   
   If I also specify a definition of `AWS_ENDPOINT_URL` as an environmental 
variable, it does work.
   ```python
   os.environ["AWS_ENDPOINT_URL"] = "https://my-endpoint.com";
   tbl = pq.read_table("s3://my-bucket/my-parquet") # no error
   ```
   
   I think that `pyarrow` should read `endpoint_url` from `~/.aws/config` if it 
exists and `AWS_ENDPOINT_URL` is not specified, per: 
https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-endpoints.html. That 
would avoid this confusing error about 'region' and would also be convenient to 
avoid having to specify an additional environmental variable.
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to