Hello Airflow community, I'm interested in transferring data between S3 and Google Cloud Storage. I want to transfer data on the scale of hundreds of gigabytes to a few terrabytes.
Airflow already has an operator that could be used for this use-case: the S3ToGoogleCloudStorageOperator. However, looking over its implementation it appears that all the data to be transferred actually passes through the machine running airflow. That seems completely unnecessary to me, and will place a lot of burden on the airflow workers and will be bottlenecked by the bandwidth of the workers. It could even lead to out of disk errors like this one <https://stackoverflow.com/questions/52400144/airflow-s3togooglecloudstorageoperator-no-space-left-on-device> . I would much rather use Google Cloud's 'Transfer Service' for doing this--that way the airflow operator just needs to make an API call and (optionally) keep polling the API until the transfer is done (this last bit could be done in a sensor). The heavy work of performing the transfer is offloaded to the Transfer Service. Was it an intentional design decision to avoid using the Google Transfer Service? If I create a PR that adds the ability to perform transfers with the Google Transfer Service, should it - replace the existing operator - be an option on the existing operator (i.e., add an argument that toggles between 'local worker transfer' and 'google hosted transfer') - make a new operator Thanks, Conrad Lee