raulcd commented on issue #38536:
URL: https://github.com/apache/arrow/issues/38536#issuecomment-4779385967
I am working on moving s3 into its own librarry as seen on the PR above and
I wasn't sure where to land this thoughts but I am going to post here some
experiments I am currently doing:
I've manually removed `libarrow_s3.so` from a local pyarrow build:
```python
$ python
Python 3.13.12 (main, Feb 4 2026, 15:06:39) [GCC 15.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow._s3fs
Traceback (most recent call last):
File "<python-input-0>", line 1, in <module>
import pyarrow._s3fs
ImportError: libarrow_s3.so.2500: cannot open shared object file: No such
file or directory
>>>
```
And have created a minimal wheel with *only* `libarrow_s3.so.2500` and a
minimal `__init__.py` that loads the DLL:
```python
$ python
Python 3.13.12 (main, Feb 4 2026, 15:06:39) [GCC 15.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow_s3
>>> import pyarrow._s3fs
>>> import pyarrow.fs as fs
>>> fs.FileSystem.from_uri("s3://bucket/key")
(<pyarrow._s3fs.S3FileSystem object at 0x7fa3b05a3330>, 'bucket/key')
```
This UX should change, users shouldn't have to explicitly import
`pyarrow_s3`. When `import pyarrow.fs` tries importing s3 it should
automatically register by importing `pyarrow_s3`.
The current `__init__.py` is hardcoded but can be updated to make it
portable and cross-OS:
```python
import os
import ctypes
# ensure libarrow.so.2500 is loaded first (libarrow_s3 depends on it)
import pyarrow
ctypes.CDLL(
os.path.join(os.path.dirname(__file__), "libarrow_s3.so.2500"),
mode=ctypes.RTLD_GLOBAL,
)
```
About the versioning, we could make the metadata pin handle the required
pyarrow version and tie them together. All provided wheels must match the
versions. Otherwise it gets messy for users when we expand this functionality
to have other wheels like `pyarrow_s3`, `pyarrow_flight`, `pyarrow_gcs`. We
could expose pyarrow extras with optional-dependencies, to have a better UX and
have a tighter version matching, like:
```toml
[project.optional-dependencies]
s3 = ["pyarrow-s3==25.0.0"]
gcs = ["pyarrow-gcs==25.0.0"]
flight = ["pyarrow-flight==25.0.0"]
```
The optional wheels should also match and pin the version `pyarrow==25.0.0`.
Basically we are making those extra wheels just a wrapper for a single `.so`.
This is against a local build on this branch, not yet two
auditwheel/delvewheel-repaired wheels in a clean venv, that's the next thing I
plan to validate.
To be fair this is trying to extend the current proven conda model into
separated wheels.
Caveats:
- Upgrade path for users. The exact pins `==` make it an all-or-nothing for
users. Not too different from what we have today as we ship a single wheel but
worth noting.
- What happens when other external dependencies pin versions of arrow? The
exact pins might complicate things for users and library maintainers.
- Added complexity of managing all the extra wheels.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]