Another way to copy only the deps you care about is to use `pip download`
to do the copy.  I believe you can provide the cache dir to `pip download
--find-links` and it will read from that before reading from pypi (you may
also need to set --wheel-dir to the cache dir as well), and thus it acts as
a simple copy.

-chad


On Thu, Dec 5, 2019 at 12:07 PM Valentyn Tymofieiev <valen...@google.com>
wrote:

> Looked for a bit at pip download command. The alternative seems to parse
> the output of
>
> python -m pip download  --dest . -r requirements.txt  --exists-action i
> --no-binary :all:
>
> and see which files were downloaded and/or skipped since they were already
> present, and then stage only the files that appear in the log output. Seems
> doable but may break if pip output changes between pip implementations, so
> we'd have to add a test as well.
>
> On Thu, Dec 5, 2019 at 11:10 AM Luke Cwik <lc...@google.com> wrote:
>
>> I think reusing the same cache directory makes sense during downloading
>> but why do we upload everything that is there?
>>
>> On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri <eh...@google.com> wrote:
>>
>>> Looking at the source, it seems that it should be using a
>>> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
>>> to create a different tmp directory on each run.
>>>
>>> Also, sampling worker no. 2:
>>>
>>> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
>>> total 7172
>>> -rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
>>> -rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*
>>>
>>>
>>> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> Can we filter the cache directory only for the artifacts that we want
>>>> and not everything that is there?
>>>>
>>>> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev <valen...@google.com>
>>>> wrote:
>>>>
>>>>> Luke, I am not sure I understand the question. The caching that
>>>>> happens here is implemented in the SDK for requirements packages:
>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>>
>>>>>
>>>>> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> Is there a way to use a cache on disk that is separate from the set
>>>>>> of packages we use as requirements?
>>>>>>
>>>>>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri <eh...@google.com> wrote:
>>>>>>
>>>>>>> Thanks!
>>>>>>> Another reason to periodically referesh workers.
>>>>>>>
>>>>>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>>>>>> valen...@google.com> wrote:
>>>>>>>
>>>>>>>> Tests job specify[1] a requirements.txt file that contains two
>>>>>>>> entries: pyhamcrest, mock.
>>>>>>>>
>>>>>>>> We download[2]  sources of packages specified in requirements file,
>>>>>>>> and packages they depend on. While doing so, it appears that we use a 
>>>>>>>> cache
>>>>>>>> directory on jenkins to store the sources of the packages [3], perhaps 
>>>>>>>> to
>>>>>>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the 
>>>>>>>> entire
>>>>>>>> cache directory[4], which includes all packages ever cached. Overtime 
>>>>>>>> the
>>>>>>>> versions that our requirements packages need change, but I guess we 
>>>>>>>> don't
>>>>>>>> clean the cache on Jenkins workers.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>>>>>>> [2]
>>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>>>>>>> [3]
>>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>>>>>
>>>>>>>> [4]
>>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>>>>>>
>>>>>>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I was investigating a Dataflow postcommit test failure
>>>>>>>>> (endpoints_pb2 missing), and saw this in the staging directory:
>>>>>>>>>
>>>>>>>>> $ gsutil ls 
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does anyone know why so many versions of setuptools need to be
>>>>>>>>> staged? Shouldn't 1 be enough?
>>>>>>>>>
>>>>>>>>

Reply via email to