NayanMathur411 commented on code in PR #37360:
URL: https://github.com/apache/beam/pull/37360#discussion_r2731503510


##########
sdks/python/apache_beam/runners/portability/stager.py:
##########
@@ -741,20 +752,25 @@ def _populate_requirements_cache(
     # the requirements file and will download package dependencies.
 
     # The apache-beam dependency  is excluded from requirements cache 
population
-    # because we  stage the SDK separately.
+    # because we stage the SDK separately.
     with tempfile.TemporaryDirectory() as temp_directory:
       tmp_requirements_filepath = Stager._remove_dependency_from_requirements(
           requirements_file=requirements_file,
           dependency_to_remove='apache-beam',
           temp_directory_path=temp_directory)
 
+      # Download to a temporary directory first, then copy to cache.
+      # This allows us to track exactly which packages are needed for this
+      # requirements file.
+      download_dir = tempfile.mkdtemp(dir=temp_directory)
+
       cmd_args = [
           Stager._get_python_executable(),
           '-m',
           'pip',
           'download',
           '--dest',
-          cache_dir,
+          download_dir,

Review Comment:
   Great point, yes, this should address the broader goal of using cached 
directory of re-downloading the packages. I have added a `--find-links` 
argument which checks for the packages which exists in cache, and only download 
the new ones.



##########
sdks/python/apache_beam/runners/portability/stager.py:
##########
@@ -741,20 +752,25 @@ def _populate_requirements_cache(
     # the requirements file and will download package dependencies.
 
     # The apache-beam dependency  is excluded from requirements cache 
population
-    # because we  stage the SDK separately.
+    # because we stage the SDK separately.
     with tempfile.TemporaryDirectory() as temp_directory:
       tmp_requirements_filepath = Stager._remove_dependency_from_requirements(
           requirements_file=requirements_file,
           dependency_to_remove='apache-beam',
           temp_directory_path=temp_directory)
 
+      # Download to a temporary directory first, then copy to cache.
+      # This allows us to track exactly which packages are needed for this
+      # requirements file.
+      download_dir = tempfile.mkdtemp(dir=temp_directory)
+
       cmd_args = [
           Stager._get_python_executable(),
           '-m',
           'pip',
           'download',
           '--dest',
-          cache_dir,
+          download_dir,

Review Comment:
   Great point, yes, this should address the broader goal of using cached 
directory of re-downloading the packages. I have added a `--find-links` 
argument which checks for the packages which exists in cache, and only download 
the new ones.
   
   Let me know if it makes sense to you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to