raulcd commented on issue #38536:
URL: https://github.com/apache/arrow/issues/38536#issuecomment-4779385967

   I am working on moving s3 into its own librarry as seen on the PR above and 
I wasn't sure where to land this thoughts but I am going to post here some 
experiments I am currently doing:
   I've manually removed `libarrow_s3.so` from a local pyarrow build:
   ```python
   $ python
   Python 3.13.12 (main, Feb  4 2026, 15:06:39) [GCC 15.2.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow._s3fs
   Traceback (most recent call last):
     File "<python-input-0>", line 1, in <module>
       import pyarrow._s3fs
   ImportError: libarrow_s3.so.2500: cannot open shared object file: No such 
file or directory
   >>> 
   ```
   And have created a minimal wheel with *only* `libarrow_s3.so.2500` and a 
minimal `__init__.py` that loads the DLL:
   ```python
   $ python
   Python 3.13.12 (main, Feb  4 2026, 15:06:39) [GCC 15.2.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow_s3
   >>> import pyarrow._s3fs
   >>> import pyarrow.fs as fs
   >>> fs.FileSystem.from_uri("s3://bucket/key")
   (<pyarrow._s3fs.S3FileSystem object at 0x7fa3b05a3330>, 'bucket/key')
   ``` 
   
   This UX should change, users shouldn't have to explicitly import 
`pyarrow_s3`. When `import pyarrow.fs` tries importing s3 it should 
automatically register by importing `pyarrow_s3`.
   
   The current `__init__.py` is hardcoded but can be updated to make it 
portable and cross-OS:
   ```python
   import os
   import ctypes
   # ensure libarrow.so.2500 is loaded first (libarrow_s3 depends on it)
   import pyarrow
   
   ctypes.CDLL(
       os.path.join(os.path.dirname(__file__), "libarrow_s3.so.2500"),
       mode=ctypes.RTLD_GLOBAL,
   )
   ```
   About the versioning, we could make the metadata pin handle the required 
pyarrow version and tie them together. All provided wheels must match the 
versions. Otherwise it gets messy for users when we expand this functionality 
to have other wheels like `pyarrow_s3`, `pyarrow_flight`, `pyarrow_gcs`. We 
could expose pyarrow extras with optional-dependencies, to have a better UX and 
have a tighter version matching, like:
   ```toml
   [project.optional-dependencies]
   s3     = ["pyarrow-s3==25.0.0"]
   gcs    = ["pyarrow-gcs==25.0.0"]
   flight = ["pyarrow-flight==25.0.0"]
   ```
   The optional wheels should also match and pin the version `pyarrow==25.0.0`.
   
   Basically we are making those extra wheels just a wrapper for a single `.so`.
   
   This is against a local build on this branch, not yet two 
auditwheel/delvewheel-repaired wheels in a clean venv, that's the next thing I 
plan to validate.
   
   To be fair this is trying to extend the current proven conda model into 
separated wheels.
   
   Caveats:
   - Upgrade path for users. The exact pins `==` make it an all-or-nothing for 
users. Not too different from what we have today as we ship a single wheel but 
worth noting.
   - What happens when other external dependencies pin versions of arrow? The 
exact pins might complicate things for users and library maintainers.
   - Added complexity of managing all the extra wheels.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to