orf opened a new issue, #40557:
URL: https://github.com/apache/arrow/issues/40557

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Running the following snippet shows that `open_output_stream()` initiates a 
multipart upload immediately, before anything is written.
   
   This is quite unexpected: I would expect that the `buffer_size` argument 
would ensure that a multipart upload is not initiated until at least 1,000 
bytes are written. The issue with the current behaviour is that writing a 
single byte results in three requests to s3: one to create the multipart 
upload, one to upload the 1-byte part, and one to finish the multipart upload.
   
   This is very inefficient if you are writing a small file to S3, where a 
simple put object (without multipart uploading) would suffice. Using 
`background_writes=False` and `fs.copy_files(...)` with a local, "known-sized" 
small file also results in a multipart upload.
   
   While this behaviour keeps the implementation simple, it is surprising and I 
couldn't find [it described in the documentation 
anywhere](https://arrow.apache.org/docs/python/filesystems.html).
   
   ```python
   import time
   
   from pyarrow import fs
   
   fs.initialize_s3(fs.S3LogLevel.Debug)
   
   sfs = fs.S3FileSystem()
   with sfs.open_output_stream("a_bucket/test", buffer_size=1000):
       time.sleep(10)
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to