ihji commented on pull request #15105:
URL: https://github.com/apache/beam/pull/15105#issuecomment-898768860


   Is there any reason you want to download remote packages from the SDK 
harness? From the perspective of Dataflow runner, I personally don't see much 
benefit since downloading from third-party services (Azure, AWS, PyPI, Maven, 
etc.) every time the SDK harness boot-up seems vulnerable to network 
instability or third-party service failures. Dataflow job would fail when 
third-party services are unavailable even all GCP services are green.
   
   The SDK harness won't download anything itself based on 
`extra_packages.txt`. All package files listed in `extra_packages.txt` should 
already exist in Docker container's staging location when pip installs them. To 
implement deferred remote package download, you need to 1) create URL artifact 
information for remote artifacts and add only local file names (matched with 
`staging_to` names from URL artifact information) in `extra_packages.txt` 2) 
make sure that URL artifact information is passed through the SDK harness 
without being materialized during job submission 3) extend `materialize.go` to 
support  non-GCS URL artifact information.
   
   Also, please note that Dataflow uses some google-internal Python SDK harness 
boot-up codes. So this PR could not be merged (at least for a few months) until 
Dataflow fully migrate to the public Python SDK harness container.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to