[GitHub] [airflow] potiuk edited a comment on pull request #17609: sftp_to_s3 stream file option

GitBox Sun, 15 Aug 2021 07:01:21 -0700


potiuk edited a comment on pull request #17609:
URL: https://github.com/apache/airflow/pull/17609#issuecomment-899055233



   > Is there any advantage on saving the file locally in a temporary manner? I 
am wondering if it makes sense to just change the way it uploads the file to S3 
without giving the option to store the temporary file in local system
   
   I think the main reason are implementation details of the `upload_fileobj`. 
It's not really obvious how the data is buffered while `upload_fileobj` runs so 
there might be significant memory usage during this operation. But the main 
reason is that from what I see the description of upload_fileobj, whenever 
possible it will use multiple threads and upload s3 object in parallel (which - 
I know for a fact) can speed up the s3 upload immensely (this is how S3 upload 
is designed). However (my guess but quite likely), this cannot be done if the 
"fileobj" does not provide "seek()" functionality. Looking how sftp get is 
implemented, it's fileobj does not allow seek, it can only read the file 
sequentially (this is how sftp protocol works I believe). It could only provide 
"seek" if it loaded the file entirely in memory first (but this would not be 
good for huge files).
   
   So if you have a fast (local network) sftp connection, downloading the file 
first and then uploading the local file might significantly speed up the 
transfer, as `upload_fileobj` will be able to utilise multiple threads to 
upload.  That's moslty educated guess, but I think  it's very likely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [airflow] potiuk edited a comment on pull request #17609: sftp_to_s3 stream file option

Reply via email to