Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-29 Thread Hannah Jiang
Yes, that's the plan for release images. I had a PR[1] merged several days
ago, and need to make change again as the default settings changed.
I will add validation tasks with details to release guide as well.

The other question is, how easy would it be for a release manager to
> accidentally push non-compliant images?

The validation tasks would be checking licenses are included and the number
of third dependencies are align with what we see with daily run for each
image.
With this checking, I assume it's not easy to push non-compliant images
*accidentally* as long as following the release guide.

1.
https://github.com/apache/beam/commit/8cfc8a8576c49b53f3091e800db673736070473a#diff-ef90cea897c13eb19f47e51532e706a8



On Wed, Apr 29, 2020 at 4:39 PM Ahmet Altay  wrote:

>
>
> On Wed, Apr 29, 2020 at 3:49 PM Valentyn Tymofieiev 
> wrote:
>
>>
>>
>> On Tue, Apr 28, 2020 at 5:03 PM Robert Bradshaw 
>> wrote:
>>
>>> On Tue, Apr 28, 2020 at 4:46 PM Hannah Jiang 
>>> wrote:
>>>
 I guess I assumed there was some reason we needed "lightweight images"
> in our tests (because licenses take up a lot of space IIRC), but maybe 
> not.
> Can you elaborate on the purpose of this option Hannah?

 Reducing image size is a good reason to have the pull option. There
 were user requests to create lightweight images. Now I think users can use
 skip option to create lightweight iamges. pull option is not needed.

 Then the discussion becomes which one should be the default mode?
 According to feedback, skip should be the default mode, and change it to
 add mode when running Jenkins test. And the docker-pull-licenses tag is
 binary again.
 It seems like there is an easy way to pass the tag to all Jenkins test,
 which is adding *context.switches("-Pdocker-pull-licenses"**)* to
 CommonJobProperties.groovy
 .
 I'm not very familiar with Jenkins, does this would work as expected?
 If it sounds like a bad idea to pass this tag to ALL tests, I would ask
 help from community to identify those tests which create docker images
 (Java and Python for now) and pass the tag to the tests.

>>>
>>> That could work.
>>>
>>> The other question is, how easy would it be for a release manager to
>>> accidentally push non-compliant images?
>>>
>>
>> Echoing my previous question: would it make sense to add licenses
>> whenever we build containers on the release branch, which can be checked by
>> [1]?
>>
>
> I think this is a good idea. We can also add a validation task to the
> validation spreadsheet to manually check that containers are shipped with
> licenses.
>
>
>> [1]
>> https://github.com/apache/beam/blob/74a6565c8b64d9fadf256370df47a4c5dadafb55/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L296
>>
>>
>>
>>
>>>
>>> We have definitely not explored how cheap we can make doing the "real
>>> thing" can be, which would IMHO be best.
>>>
>>>


 On Tue, Apr 28, 2020 at 3:50 PM Kyle Weaver 
 wrote:

> > To overwrite, we need to pass the tag for ALL Jenkins tests that
> create docker images, which is quite a bunch of work.
>
> If it comes to that, I'd rather we do the work of passing tags instead
> of users.
>
> > Does anyone know if there is an easy way to pass the tag to many
> tests? or overwrite the default mode, which will be defined at
> gradle.properties, to pull ONLY with Jenkins tests?
>
> There is precedence for the latter option:
> https://github.com/apache/beam/blob/1905dbde5ca0858fce89f65c761e88511840a384/build.gradle#L45
>
> > (On that note, however, pull seems bad for both.)
>
> I guess I assumed there was some reason we needed "lightweight images"
> in our tests (because licenses take up a lot of space IIRC), but maybe 
> not.
> Can you elaborate on the purpose of this option Hannah?
>
> On Tue, Apr 28, 2020 at 6:40 PM Robert Bradshaw 
> wrote:
>
>> Fundamentally having license checking off by default is dangerous for
>> releases, but having it on by default is annoying for developers. (On 
>> that
>> note, however, pull seems bad for both.) Is it possible to make it a 
>> gradle
>> target that only runs when something (specifically dependencies, or the
>> files declaring them) have changed, and would leverage the gradle cache
>> (and hence be cheap) otherwise?
>>
>> Alternatively, I think we should invest in URL caching. These urls
>> should be immutable; let's only download them once, ever.
>>
>> On Tue, Apr 28, 2020 at 3:31 PM Hannah Jiang 
>> wrote:
>>
>>> Do you mean set the default mode to skip and overwrite it to pull
>>> mode with Jenkins test? To overwrite, we need to pass the tag for ALL
>>> Jenkins tests that create docker images, which i

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-29 Thread Ahmet Altay
On Wed, Apr 29, 2020 at 3:49 PM Valentyn Tymofieiev 
wrote:

>
>
> On Tue, Apr 28, 2020 at 5:03 PM Robert Bradshaw 
> wrote:
>
>> On Tue, Apr 28, 2020 at 4:46 PM Hannah Jiang 
>> wrote:
>>
>>> I guess I assumed there was some reason we needed "lightweight images"
 in our tests (because licenses take up a lot of space IIRC), but maybe not.
 Can you elaborate on the purpose of this option Hannah?
>>>
>>> Reducing image size is a good reason to have the pull option. There were
>>> user requests to create lightweight images. Now I think users can use skip
>>> option to create lightweight iamges. pull option is not needed.
>>>
>>> Then the discussion becomes which one should be the default mode?
>>> According to feedback, skip should be the default mode, and change it to
>>> add mode when running Jenkins test. And the docker-pull-licenses tag is
>>> binary again.
>>> It seems like there is an easy way to pass the tag to all Jenkins test,
>>> which is adding *context.switches("-Pdocker-pull-licenses"**)* to
>>> CommonJobProperties.groovy
>>> .
>>> I'm not very familiar with Jenkins, does this would work as expected?
>>> If it sounds like a bad idea to pass this tag to ALL tests, I would ask
>>> help from community to identify those tests which create docker images
>>> (Java and Python for now) and pass the tag to the tests.
>>>
>>
>> That could work.
>>
>> The other question is, how easy would it be for a release manager to
>> accidentally push non-compliant images?
>>
>
> Echoing my previous question: would it make sense to add licenses whenever
> we build containers on the release branch, which can be checked by [1]?
>

I think this is a good idea. We can also add a validation task to the
validation spreadsheet to manually check that containers are shipped with
licenses.


> [1]
> https://github.com/apache/beam/blob/74a6565c8b64d9fadf256370df47a4c5dadafb55/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L296
>
>
>
>
>>
>> We have definitely not explored how cheap we can make doing the "real
>> thing" can be, which would IMHO be best.
>>
>>
>>>
>>>
>>> On Tue, Apr 28, 2020 at 3:50 PM Kyle Weaver  wrote:
>>>
 > To overwrite, we need to pass the tag for ALL Jenkins tests that
 create docker images, which is quite a bunch of work.

 If it comes to that, I'd rather we do the work of passing tags instead
 of users.

 > Does anyone know if there is an easy way to pass the tag to many
 tests? or overwrite the default mode, which will be defined at
 gradle.properties, to pull ONLY with Jenkins tests?

 There is precedence for the latter option:
 https://github.com/apache/beam/blob/1905dbde5ca0858fce89f65c761e88511840a384/build.gradle#L45

 > (On that note, however, pull seems bad for both.)

 I guess I assumed there was some reason we needed "lightweight images"
 in our tests (because licenses take up a lot of space IIRC), but maybe not.
 Can you elaborate on the purpose of this option Hannah?

 On Tue, Apr 28, 2020 at 6:40 PM Robert Bradshaw 
 wrote:

> Fundamentally having license checking off by default is dangerous for
> releases, but having it on by default is annoying for developers. (On that
> note, however, pull seems bad for both.) Is it possible to make it a 
> gradle
> target that only runs when something (specifically dependencies, or the
> files declaring them) have changed, and would leverage the gradle cache
> (and hence be cheap) otherwise?
>
> Alternatively, I think we should invest in URL caching. These urls
> should be immutable; let's only download them once, ever.
>
> On Tue, Apr 28, 2020 at 3:31 PM Hannah Jiang 
> wrote:
>
>> Do you mean set the default mode to skip and overwrite it to pull
>> mode with Jenkins test? To overwrite, we need to pass the tag for ALL
>> Jenkins tests that create docker images, which is quite a bunch of work.
>> Does anyone know if there is an easy way to pass the tag to many
>> tests? or overwrite the default mode, which will be defined at
>> gradle.properties, to pull ONLY with Jenkins tests? We can add a step to 
>> do
>> this after cloning a git branch to Jenkins machine. But it seems easier 
>> to
>> change the local gradle.properties for local development purpose..?
>>
>>
>>
>> On Tue, Apr 28, 2020 at 3:09 PM Thomas Weise  wrote:
>>
>>> Can this be solved by enabling the license magic for the Jenkins
>>> jobs?
>>>
>>> I also think the default should be off for better local development
>>> experience.
>>>
>>> On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang 
>>> wrote:
>>>
 We need to make sure the release images contains licenses/notices
 in order to avoid potential legal issues. The purpose

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-29 Thread Valentyn Tymofieiev
On Tue, Apr 28, 2020 at 5:03 PM Robert Bradshaw  wrote:

> On Tue, Apr 28, 2020 at 4:46 PM Hannah Jiang 
> wrote:
>
>> I guess I assumed there was some reason we needed "lightweight images" in
>>> our tests (because licenses take up a lot of space IIRC), but maybe not.
>>> Can you elaborate on the purpose of this option Hannah?
>>
>> Reducing image size is a good reason to have the pull option. There were
>> user requests to create lightweight images. Now I think users can use skip
>> option to create lightweight iamges. pull option is not needed.
>>
>> Then the discussion becomes which one should be the default mode?
>> According to feedback, skip should be the default mode, and change it to
>> add mode when running Jenkins test. And the docker-pull-licenses tag is
>> binary again.
>> It seems like there is an easy way to pass the tag to all Jenkins test,
>> which is adding *context.switches("-Pdocker-pull-licenses"**)* to
>> CommonJobProperties.groovy
>> .
>> I'm not very familiar with Jenkins, does this would work as expected?
>> If it sounds like a bad idea to pass this tag to ALL tests, I would ask
>> help from community to identify those tests which create docker images
>> (Java and Python for now) and pass the tag to the tests.
>>
>
> That could work.
>
> The other question is, how easy would it be for a release manager to
> accidentally push non-compliant images?
>

Echoing my previous question: would it make sense to add licenses whenever
we build containers on the release branch, which can be checked by [1]?

[1]
https://github.com/apache/beam/blob/74a6565c8b64d9fadf256370df47a4c5dadafb55/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L296




>
> We have definitely not explored how cheap we can make doing the "real
> thing" can be, which would IMHO be best.
>
>
>>
>>
>> On Tue, Apr 28, 2020 at 3:50 PM Kyle Weaver  wrote:
>>
>>> > To overwrite, we need to pass the tag for ALL Jenkins tests that
>>> create docker images, which is quite a bunch of work.
>>>
>>> If it comes to that, I'd rather we do the work of passing tags instead
>>> of users.
>>>
>>> > Does anyone know if there is an easy way to pass the tag to many
>>> tests? or overwrite the default mode, which will be defined at
>>> gradle.properties, to pull ONLY with Jenkins tests?
>>>
>>> There is precedence for the latter option:
>>> https://github.com/apache/beam/blob/1905dbde5ca0858fce89f65c761e88511840a384/build.gradle#L45
>>>
>>> > (On that note, however, pull seems bad for both.)
>>>
>>> I guess I assumed there was some reason we needed "lightweight images"
>>> in our tests (because licenses take up a lot of space IIRC), but maybe not.
>>> Can you elaborate on the purpose of this option Hannah?
>>>
>>> On Tue, Apr 28, 2020 at 6:40 PM Robert Bradshaw 
>>> wrote:
>>>
 Fundamentally having license checking off by default is dangerous for
 releases, but having it on by default is annoying for developers. (On that
 note, however, pull seems bad for both.) Is it possible to make it a gradle
 target that only runs when something (specifically dependencies, or the
 files declaring them) have changed, and would leverage the gradle cache
 (and hence be cheap) otherwise?

 Alternatively, I think we should invest in URL caching. These urls
 should be immutable; let's only download them once, ever.

 On Tue, Apr 28, 2020 at 3:31 PM Hannah Jiang 
 wrote:

> Do you mean set the default mode to skip and overwrite it to pull mode
> with Jenkins test? To overwrite, we need to pass the tag for ALL Jenkins
> tests that create docker images, which is quite a bunch of work.
> Does anyone know if there is an easy way to pass the tag to many
> tests? or overwrite the default mode, which will be defined at
> gradle.properties, to pull ONLY with Jenkins tests? We can add a step to 
> do
> this after cloning a git branch to Jenkins machine. But it seems easier to
> change the local gradle.properties for local development purpose..?
>
>
>
> On Tue, Apr 28, 2020 at 3:09 PM Thomas Weise  wrote:
>
>> Can this be solved by enabling the license magic for the Jenkins jobs?
>>
>> I also think the default should be off for better local development
>> experience.
>>
>> On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang 
>> wrote:
>>
>>> We need to make sure the release images contains licenses/notices in
>>> order to avoid potential legal issues. The purpose of setting the 
>>> default
>>> to pull is checking PRs which introduce new dependencies include thier
>>> licenses as well, in a way of auto pulling or tool pulling, to make sure
>>> all licenses are included when we create release images. If setting 
>>> default
>>> to skip, we only check licenses when we create release images and ma

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Robert Bradshaw
On Tue, Apr 28, 2020 at 4:46 PM Hannah Jiang  wrote:

> I guess I assumed there was some reason we needed "lightweight images" in
>> our tests (because licenses take up a lot of space IIRC), but maybe not.
>> Can you elaborate on the purpose of this option Hannah?
>
> Reducing image size is a good reason to have the pull option. There were
> user requests to create lightweight images. Now I think users can use skip
> option to create lightweight iamges. pull option is not needed.
>
> Then the discussion becomes which one should be the default mode?
> According to feedback, skip should be the default mode, and change it to
> add mode when running Jenkins test. And the docker-pull-licenses tag is
> binary again.
> It seems like there is an easy way to pass the tag to all Jenkins test,
> which is adding *context.switches("-Pdocker-pull-licenses"**)* to
> CommonJobProperties.groovy
> .
> I'm not very familiar with Jenkins, does this would work as expected?
> If it sounds like a bad idea to pass this tag to ALL tests, I would ask
> help from community to identify those tests which create docker images
> (Java and Python for now) and pass the tag to the tests.
>

That could work.

The other question is, how easy would it be for a release manager to
accidentally push non-compliant images?

We have definitely not explored how cheap we can make doing the "real
thing" can be, which would IMHO be best.


>
>
> On Tue, Apr 28, 2020 at 3:50 PM Kyle Weaver  wrote:
>
>> > To overwrite, we need to pass the tag for ALL Jenkins tests that create
>> docker images, which is quite a bunch of work.
>>
>> If it comes to that, I'd rather we do the work of passing tags instead of
>> users.
>>
>> > Does anyone know if there is an easy way to pass the tag to many tests?
>> or overwrite the default mode, which will be defined at gradle.properties,
>> to pull ONLY with Jenkins tests?
>>
>> There is precedence for the latter option:
>> https://github.com/apache/beam/blob/1905dbde5ca0858fce89f65c761e88511840a384/build.gradle#L45
>>
>> > (On that note, however, pull seems bad for both.)
>>
>> I guess I assumed there was some reason we needed "lightweight images" in
>> our tests (because licenses take up a lot of space IIRC), but maybe not.
>> Can you elaborate on the purpose of this option Hannah?
>>
>> On Tue, Apr 28, 2020 at 6:40 PM Robert Bradshaw 
>> wrote:
>>
>>> Fundamentally having license checking off by default is dangerous for
>>> releases, but having it on by default is annoying for developers. (On that
>>> note, however, pull seems bad for both.) Is it possible to make it a gradle
>>> target that only runs when something (specifically dependencies, or the
>>> files declaring them) have changed, and would leverage the gradle cache
>>> (and hence be cheap) otherwise?
>>>
>>> Alternatively, I think we should invest in URL caching. These urls
>>> should be immutable; let's only download them once, ever.
>>>
>>> On Tue, Apr 28, 2020 at 3:31 PM Hannah Jiang 
>>> wrote:
>>>
 Do you mean set the default mode to skip and overwrite it to pull mode
 with Jenkins test? To overwrite, we need to pass the tag for ALL Jenkins
 tests that create docker images, which is quite a bunch of work.
 Does anyone know if there is an easy way to pass the tag to many tests?
 or overwrite the default mode, which will be defined at gradle.properties,
 to pull ONLY with Jenkins tests? We can add a step to do this after cloning
 a git branch to Jenkins machine. But it seems easier to change the local
 gradle.properties for local development purpose..?



 On Tue, Apr 28, 2020 at 3:09 PM Thomas Weise  wrote:

> Can this be solved by enabling the license magic for the Jenkins jobs?
>
> I also think the default should be off for better local development
> experience.
>
> On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang 
> wrote:
>
>> We need to make sure the release images contains licenses/notices in
>> order to avoid potential legal issues. The purpose of setting the default
>> to pull is checking PRs which introduce new dependencies include thier
>> licenses as well, in a way of auto pulling or tool pulling, to make sure
>> all licenses are included when we create release images. If setting 
>> default
>> to skip, we only check licenses when we create release images and may see
>> many issues with the licenses, and the release would be delayed, maybe a
>> lot.
>>
>>
>> On Tue, Apr 28, 2020 at 2:50 PM Kyle Weaver 
>> wrote:
>>
>>> The different modes make sense. One disagreement: I think the
>>> default should be skip. I imagine few users would want to put licenses 
>>> in
>>> their own images.
>>>
>>> On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang 
>>> wrote:
>>>
 The above 1,2,3,4 are merged

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Hannah Jiang
>
> I guess I assumed there was some reason we needed "lightweight images" in
> our tests (because licenses take up a lot of space IIRC), but maybe not.
> Can you elaborate on the purpose of this option Hannah?

Reducing image size is a good reason to have the pull option. There were
user requests to create lightweight images. Now I think users can use skip
option to create lightweight iamges. pull option is not needed.

Then the discussion becomes which one should be the default mode? According
to feedback, skip should be the default mode, and change it to add mode
when running Jenkins test. And the docker-pull-licenses tag is binary again.
It seems like there is an easy way to pass the tag to all Jenkins test,
which is adding *context.switches("-Pdocker-pull-licenses"**)* to
CommonJobProperties.groovy
.
I'm not very familiar with Jenkins, does this would work as expected?
If it sounds like a bad idea to pass this tag to ALL tests, I would ask
help from community to identify those tests which create docker images
(Java and Python for now) and pass the tag to the tests.


On Tue, Apr 28, 2020 at 3:50 PM Kyle Weaver  wrote:

> > To overwrite, we need to pass the tag for ALL Jenkins tests that create
> docker images, which is quite a bunch of work.
>
> If it comes to that, I'd rather we do the work of passing tags instead of
> users.
>
> > Does anyone know if there is an easy way to pass the tag to many tests?
> or overwrite the default mode, which will be defined at gradle.properties,
> to pull ONLY with Jenkins tests?
>
> There is precedence for the latter option:
> https://github.com/apache/beam/blob/1905dbde5ca0858fce89f65c761e88511840a384/build.gradle#L45
>
> > (On that note, however, pull seems bad for both.)
>
> I guess I assumed there was some reason we needed "lightweight images" in
> our tests (because licenses take up a lot of space IIRC), but maybe not.
> Can you elaborate on the purpose of this option Hannah?
>
> On Tue, Apr 28, 2020 at 6:40 PM Robert Bradshaw 
> wrote:
>
>> Fundamentally having license checking off by default is dangerous for
>> releases, but having it on by default is annoying for developers. (On that
>> note, however, pull seems bad for both.) Is it possible to make it a gradle
>> target that only runs when something (specifically dependencies, or the
>> files declaring them) have changed, and would leverage the gradle cache
>> (and hence be cheap) otherwise?
>>
>> Alternatively, I think we should invest in URL caching. These urls should
>> be immutable; let's only download them once, ever.
>>
>> On Tue, Apr 28, 2020 at 3:31 PM Hannah Jiang 
>> wrote:
>>
>>> Do you mean set the default mode to skip and overwrite it to pull mode
>>> with Jenkins test? To overwrite, we need to pass the tag for ALL Jenkins
>>> tests that create docker images, which is quite a bunch of work.
>>> Does anyone know if there is an easy way to pass the tag to many tests?
>>> or overwrite the default mode, which will be defined at gradle.properties,
>>> to pull ONLY with Jenkins tests? We can add a step to do this after cloning
>>> a git branch to Jenkins machine. But it seems easier to change the local
>>> gradle.properties for local development purpose..?
>>>
>>>
>>>
>>> On Tue, Apr 28, 2020 at 3:09 PM Thomas Weise  wrote:
>>>
 Can this be solved by enabling the license magic for the Jenkins jobs?

 I also think the default should be off for better local development
 experience.

 On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang 
 wrote:

> We need to make sure the release images contains licenses/notices in
> order to avoid potential legal issues. The purpose of setting the default
> to pull is checking PRs which introduce new dependencies include thier
> licenses as well, in a way of auto pulling or tool pulling, to make sure
> all licenses are included when we create release images. If setting 
> default
> to skip, we only check licenses when we create release images and may see
> many issues with the licenses, and the release would be delayed, maybe a
> lot.
>
>
> On Tue, Apr 28, 2020 at 2:50 PM Kyle Weaver 
> wrote:
>
>> The different modes make sense. One disagreement: I think the default
>> should be skip. I imagine few users would want to put licenses in their 
>> own
>> images.
>>
>> On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang 
>> wrote:
>>
>>> The above 1,2,3,4 are merged to master.
>>> Here I would like to change behaviors with *docker-pull-licenses*
>>> tag and would like to collect feedbacks from community before 
>>> finalizing my
>>> PR.
>>>
>>> docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].
>>>
>>>- *docker-pull-licenses=add* pulls licenses/notices and the
>>>files are added to docker images. This tag is used

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Valentyn Tymofieiev
How about we pull/add dependencies when either:
- The container build target is executed for a release branch [1].
- We are running the build target on Jenkins:
  -  some environmental variable tells us that we are running on Jenkins
  - path to repo starts with /home/jenkins
  - perhaps there is a better way to check this.

Would this mitigate concerns for local development mentioned here?

[1]
https://github.com/apache/beam/blob/74a6565c8b64d9fadf256370df47a4c5dadafb55/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L296


On Tue, Apr 28, 2020 at 3:50 PM Kyle Weaver  wrote:

> > To overwrite, we need to pass the tag for ALL Jenkins tests that create
> docker images, which is quite a bunch of work.
>
> If it comes to that, I'd rather we do the work of passing tags instead of
> users.
>
> > Does anyone know if there is an easy way to pass the tag to many tests?
> or overwrite the default mode, which will be defined at gradle.properties,
> to pull ONLY with Jenkins tests?
>
> There is precedence for the latter option:
> https://github.com/apache/beam/blob/1905dbde5ca0858fce89f65c761e88511840a384/build.gradle#L45
>
> > (On that note, however, pull seems bad for both.)
>
> I guess I assumed there was some reason we needed "lightweight images" in
> our tests (because licenses take up a lot of space IIRC), but maybe not.
> Can you elaborate on the purpose of this option Hannah?
>
> On Tue, Apr 28, 2020 at 6:40 PM Robert Bradshaw 
> wrote:
>
>> Fundamentally having license checking off by default is dangerous for
>> releases, but having it on by default is annoying for developers. (On that
>> note, however, pull seems bad for both.) Is it possible to make it a gradle
>> target that only runs when something (specifically dependencies, or the
>> files declaring them) have changed, and would leverage the gradle cache
>> (and hence be cheap) otherwise?
>>
>> Alternatively, I think we should invest in URL caching. These urls should
>> be immutable; let's only download them once, ever.
>>
>> On Tue, Apr 28, 2020 at 3:31 PM Hannah Jiang 
>> wrote:
>>
>>> Do you mean set the default mode to skip and overwrite it to pull mode
>>> with Jenkins test? To overwrite, we need to pass the tag for ALL Jenkins
>>> tests that create docker images, which is quite a bunch of work.
>>> Does anyone know if there is an easy way to pass the tag to many tests?
>>> or overwrite the default mode, which will be defined at gradle.properties,
>>> to pull ONLY with Jenkins tests? We can add a step to do this after cloning
>>> a git branch to Jenkins machine. But it seems easier to change the local
>>> gradle.properties for local development purpose..?
>>>
>>>
>>>
>>> On Tue, Apr 28, 2020 at 3:09 PM Thomas Weise  wrote:
>>>
 Can this be solved by enabling the license magic for the Jenkins jobs?

 I also think the default should be off for better local development
 experience.

 On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang 
 wrote:

> We need to make sure the release images contains licenses/notices in
> order to avoid potential legal issues. The purpose of setting the default
> to pull is checking PRs which introduce new dependencies include thier
> licenses as well, in a way of auto pulling or tool pulling, to make sure
> all licenses are included when we create release images. If setting 
> default
> to skip, we only check licenses when we create release images and may see
> many issues with the licenses, and the release would be delayed, maybe a
> lot.
>
>
> On Tue, Apr 28, 2020 at 2:50 PM Kyle Weaver 
> wrote:
>
>> The different modes make sense. One disagreement: I think the default
>> should be skip. I imagine few users would want to put licenses in their 
>> own
>> images.
>>
>> On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang 
>> wrote:
>>
>>> The above 1,2,3,4 are merged to master.
>>> Here I would like to change behaviors with *docker-pull-licenses*
>>> tag and would like to collect feedbacks from community before 
>>> finalizing my
>>> PR.
>>>
>>> docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].
>>>
>>>- *docker-pull-licenses=add* pulls licenses/notices and the
>>>files are added to docker images. This tag is used to create release 
>>> docker
>>>images.
>>>- *docker-pull-licenses=pull* pulls licenses/notices but do not
>>>add the files to docker images so that create lightweight images. 
>>> This is
>>>the default mode, and docker images created by Jenkins test use this 
>>> mode.
>>>This will make sure the files are all pullable and detect issues 
>>> with the
>>>tool, except the ADD part at DockerFile, which is rarely likely to 
>>> be an
>>>issue. It is better than the checking URLs approach mentioned above,
>>>because it is a more similar proce

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Kyle Weaver
> To overwrite, we need to pass the tag for ALL Jenkins tests that create
docker images, which is quite a bunch of work.

If it comes to that, I'd rather we do the work of passing tags instead of
users.

> Does anyone know if there is an easy way to pass the tag to many tests?
or overwrite the default mode, which will be defined at gradle.properties,
to pull ONLY with Jenkins tests?

There is precedence for the latter option:
https://github.com/apache/beam/blob/1905dbde5ca0858fce89f65c761e88511840a384/build.gradle#L45

> (On that note, however, pull seems bad for both.)

I guess I assumed there was some reason we needed "lightweight images" in
our tests (because licenses take up a lot of space IIRC), but maybe not.
Can you elaborate on the purpose of this option Hannah?

On Tue, Apr 28, 2020 at 6:40 PM Robert Bradshaw  wrote:

> Fundamentally having license checking off by default is dangerous for
> releases, but having it on by default is annoying for developers. (On that
> note, however, pull seems bad for both.) Is it possible to make it a gradle
> target that only runs when something (specifically dependencies, or the
> files declaring them) have changed, and would leverage the gradle cache
> (and hence be cheap) otherwise?
>
> Alternatively, I think we should invest in URL caching. These urls should
> be immutable; let's only download them once, ever.
>
> On Tue, Apr 28, 2020 at 3:31 PM Hannah Jiang 
> wrote:
>
>> Do you mean set the default mode to skip and overwrite it to pull mode
>> with Jenkins test? To overwrite, we need to pass the tag for ALL Jenkins
>> tests that create docker images, which is quite a bunch of work.
>> Does anyone know if there is an easy way to pass the tag to many tests?
>> or overwrite the default mode, which will be defined at gradle.properties,
>> to pull ONLY with Jenkins tests? We can add a step to do this after cloning
>> a git branch to Jenkins machine. But it seems easier to change the local
>> gradle.properties for local development purpose..?
>>
>>
>>
>> On Tue, Apr 28, 2020 at 3:09 PM Thomas Weise  wrote:
>>
>>> Can this be solved by enabling the license magic for the Jenkins jobs?
>>>
>>> I also think the default should be off for better local development
>>> experience.
>>>
>>> On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang 
>>> wrote:
>>>
 We need to make sure the release images contains licenses/notices in
 order to avoid potential legal issues. The purpose of setting the default
 to pull is checking PRs which introduce new dependencies include thier
 licenses as well, in a way of auto pulling or tool pulling, to make sure
 all licenses are included when we create release images. If setting default
 to skip, we only check licenses when we create release images and may see
 many issues with the licenses, and the release would be delayed, maybe a
 lot.


 On Tue, Apr 28, 2020 at 2:50 PM Kyle Weaver 
 wrote:

> The different modes make sense. One disagreement: I think the default
> should be skip. I imagine few users would want to put licenses in their 
> own
> images.
>
> On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang 
> wrote:
>
>> The above 1,2,3,4 are merged to master.
>> Here I would like to change behaviors with *docker-pull-licenses*
>> tag and would like to collect feedbacks from community before finalizing 
>> my
>> PR.
>>
>> docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].
>>
>>- *docker-pull-licenses=add* pulls licenses/notices and the files
>>are added to docker images. This tag is used to create release docker
>>images.
>>- *docker-pull-licenses=pull* pulls licenses/notices but do not
>>add the files to docker images so that create lightweight images. 
>> This is
>>the default mode, and docker images created by Jenkins test use this 
>> mode.
>>This will make sure the files are all pullable and detect issues with 
>> the
>>tool, except the ADD part at DockerFile, which is rarely likely to be 
>> an
>>issue. It is better than the checking URLs approach mentioned above,
>>because it is a more similar process like creating release images and 
>> code
>>can be simplified so that make it easier to maintain.
>>- *docker-pull-licenses=skip* skips all license pulling related
>>tasks. This tag is provided for users who customize their images with
>>DockerFile provided by us and would not like to deal with license 
>> stuff.
>>Without this tag, the users should change the .gradle file to get rid 
>> of
>>it. It also can be used for local development when developers want to 
>> solve
>>license related issues later.
>>
>> I would like to see this tag is adopted by all docker images released
>> by Beam, including Flink and Spark job server images, and share 

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Robert Bradshaw
Fundamentally having license checking off by default is dangerous for
releases, but having it on by default is annoying for developers. (On that
note, however, pull seems bad for both.) Is it possible to make it a gradle
target that only runs when something (specifically dependencies, or the
files declaring them) have changed, and would leverage the gradle cache
(and hence be cheap) otherwise?

Alternatively, I think we should invest in URL caching. These urls should
be immutable; let's only download them once, ever.

On Tue, Apr 28, 2020 at 3:31 PM Hannah Jiang  wrote:

> Do you mean set the default mode to skip and overwrite it to pull mode
> with Jenkins test? To overwrite, we need to pass the tag for ALL Jenkins
> tests that create docker images, which is quite a bunch of work.
> Does anyone know if there is an easy way to pass the tag to many tests? or
> overwrite the default mode, which will be defined at gradle.properties, to
> pull ONLY with Jenkins tests? We can add a step to do this after cloning a
> git branch to Jenkins machine. But it seems easier to change the local
> gradle.properties for local development purpose..?
>
>
>
> On Tue, Apr 28, 2020 at 3:09 PM Thomas Weise  wrote:
>
>> Can this be solved by enabling the license magic for the Jenkins jobs?
>>
>> I also think the default should be off for better local development
>> experience.
>>
>> On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang 
>> wrote:
>>
>>> We need to make sure the release images contains licenses/notices in
>>> order to avoid potential legal issues. The purpose of setting the default
>>> to pull is checking PRs which introduce new dependencies include thier
>>> licenses as well, in a way of auto pulling or tool pulling, to make sure
>>> all licenses are included when we create release images. If setting default
>>> to skip, we only check licenses when we create release images and may see
>>> many issues with the licenses, and the release would be delayed, maybe a
>>> lot.
>>>
>>>
>>> On Tue, Apr 28, 2020 at 2:50 PM Kyle Weaver  wrote:
>>>
 The different modes make sense. One disagreement: I think the default
 should be skip. I imagine few users would want to put licenses in their own
 images.

 On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang 
 wrote:

> The above 1,2,3,4 are merged to master.
> Here I would like to change behaviors with *docker-pull-licenses* tag
> and would like to collect feedbacks from community before finalizing my 
> PR.
>
> docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].
>
>- *docker-pull-licenses=add* pulls licenses/notices and the files
>are added to docker images. This tag is used to create release docker
>images.
>- *docker-pull-licenses=pull* pulls licenses/notices but do not
>add the files to docker images so that create lightweight images. This 
> is
>the default mode, and docker images created by Jenkins test use this 
> mode.
>This will make sure the files are all pullable and detect issues with 
> the
>tool, except the ADD part at DockerFile, which is rarely likely to be 
> an
>issue. It is better than the checking URLs approach mentioned above,
>because it is a more similar process like creating release images and 
> code
>can be simplified so that make it easier to maintain.
>- *docker-pull-licenses=skip* skips all license pulling related
>tasks. This tag is provided for users who customize their images with
>DockerFile provided by us and would not like to deal with license 
> stuff.
>Without this tag, the users should change the .gradle file to get rid 
> of
>it. It also can be used for local development when developers want to 
> solve
>license related issues later.
>
> I would like to see this tag is adopted by all docker images released
> by Beam, including Flink and Spark job server images, and share the same
> default mode.
> Does this sound good? Are there any concerns?
>
> Please let me know what you think.
> Hannah
>
>
>
> On Thu, Apr 16, 2020 at 11:46 AM Hannah Jiang 
> wrote:
>
>> Ok, so the fianl approach is
>> 1. Using multi threading with 16 threads.
>> 2. Introduce docker-pull-license tag, by default it is disabled.
>> 3. By default, only check if urls return 200, not actually pull the
>> files to reduce image size. Licenses add 85MB of image size as of today.
>> 4. Add retries to avoid flakiness. Initially it was added to download
>> part only, assuming urlib is stable when validates urls, but it is not 
>> true.
>>
>> For caching, it is definitely useful. It will mitigate flakiness
>> further, like when github is down (many urls point to github). It also
>> reduces unnecessary internet traffic.
>> Let's reconsider it after the curre

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Hannah Jiang
Do you mean set the default mode to skip and overwrite it to pull mode with
Jenkins test? To overwrite, we need to pass the tag for ALL Jenkins tests
that create docker images, which is quite a bunch of work.
Does anyone know if there is an easy way to pass the tag to many tests? or
overwrite the default mode, which will be defined at gradle.properties, to
pull ONLY with Jenkins tests? We can add a step to do this after cloning a
git branch to Jenkins machine. But it seems easier to change the local
gradle.properties for local development purpose..?



On Tue, Apr 28, 2020 at 3:09 PM Thomas Weise  wrote:

> Can this be solved by enabling the license magic for the Jenkins jobs?
>
> I also think the default should be off for better local development
> experience.
>
> On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang 
> wrote:
>
>> We need to make sure the release images contains licenses/notices in
>> order to avoid potential legal issues. The purpose of setting the default
>> to pull is checking PRs which introduce new dependencies include thier
>> licenses as well, in a way of auto pulling or tool pulling, to make sure
>> all licenses are included when we create release images. If setting default
>> to skip, we only check licenses when we create release images and may see
>> many issues with the licenses, and the release would be delayed, maybe a
>> lot.
>>
>>
>> On Tue, Apr 28, 2020 at 2:50 PM Kyle Weaver  wrote:
>>
>>> The different modes make sense. One disagreement: I think the default
>>> should be skip. I imagine few users would want to put licenses in their own
>>> images.
>>>
>>> On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang 
>>> wrote:
>>>
 The above 1,2,3,4 are merged to master.
 Here I would like to change behaviors with *docker-pull-licenses* tag
 and would like to collect feedbacks from community before finalizing my PR.

 docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].

- *docker-pull-licenses=add* pulls licenses/notices and the files
are added to docker images. This tag is used to create release docker
images.
- *docker-pull-licenses=pull* pulls licenses/notices but do not add
the files to docker images so that create lightweight images. This is 
 the
default mode, and docker images created by Jenkins test use this mode. 
 This
will make sure the files are all pullable and detect issues with the 
 tool,
except the ADD part at DockerFile, which is rarely likely to be an 
 issue.
It is better than the checking URLs approach mentioned above, because 
 it is
a more similar process like creating release images and code can be
simplified so that make it easier to maintain.
- *docker-pull-licenses=skip* skips all license pulling related
tasks. This tag is provided for users who customize their images with
DockerFile provided by us and would not like to deal with license stuff.
Without this tag, the users should change the .gradle file to get rid of
it. It also can be used for local development when developers want to 
 solve
license related issues later.

 I would like to see this tag is adopted by all docker images released
 by Beam, including Flink and Spark job server images, and share the same
 default mode.
 Does this sound good? Are there any concerns?

 Please let me know what you think.
 Hannah



 On Thu, Apr 16, 2020 at 11:46 AM Hannah Jiang 
 wrote:

> Ok, so the fianl approach is
> 1. Using multi threading with 16 threads.
> 2. Introduce docker-pull-license tag, by default it is disabled.
> 3. By default, only check if urls return 200, not actually pull the
> files to reduce image size. Licenses add 85MB of image size as of today.
> 4. Add retries to avoid flakiness. Initially it was added to download
> part only, assuming urlib is stable when validates urls, but it is not 
> true.
>
> For caching, it is definitely useful. It will mitigate flakiness
> further, like when github is down (many urls point to github). It also
> reduces unnecessary internet traffic.
> Let's reconsider it after the current work is merged to 2.21.0.
>
> Thank you all for providing feedback and ideas.
>
>
> On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver 
> wrote:
>
>> > I tried with multi processing and it improved the performance a lot.
>>
>> Great! Though it won't help with flakes, so as you said we should
>> still look into caching as well.
>>
>> > You could try using a threadpool
>>
>> +1
>>
>> > However, we would like to include this work as part of 2.21.0
>>
>> I had marked the jira as a blocker for 2.21.0 because I was afraid
>> something was broken, but now it looks like the failures were just 
>> flakes.
>> So BEAM-9764

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Thomas Weise
Can this be solved by enabling the license magic for the Jenkins jobs?

I also think the default should be off for better local development
experience.

On Tue, Apr 28, 2020 at 3:05 PM Hannah Jiang  wrote:

> We need to make sure the release images contains licenses/notices in order
> to avoid potential legal issues. The purpose of setting the default to pull
> is checking PRs which introduce new dependencies include thier licenses as
> well, in a way of auto pulling or tool pulling, to make sure all licenses
> are included when we create release images. If setting default to skip, we
> only check licenses when we create release images and may see many issues
> with the licenses, and the release would be delayed, maybe a lot.
>
>
> On Tue, Apr 28, 2020 at 2:50 PM Kyle Weaver  wrote:
>
>> The different modes make sense. One disagreement: I think the default
>> should be skip. I imagine few users would want to put licenses in their own
>> images.
>>
>> On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang 
>> wrote:
>>
>>> The above 1,2,3,4 are merged to master.
>>> Here I would like to change behaviors with *docker-pull-licenses* tag
>>> and would like to collect feedbacks from community before finalizing my PR.
>>>
>>> docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].
>>>
>>>- *docker-pull-licenses=add* pulls licenses/notices and the files
>>>are added to docker images. This tag is used to create release docker
>>>images.
>>>- *docker-pull-licenses=pull* pulls licenses/notices but do not add
>>>the files to docker images so that create lightweight images. This is the
>>>default mode, and docker images created by Jenkins test use this mode. 
>>> This
>>>will make sure the files are all pullable and detect issues with the 
>>> tool,
>>>except the ADD part at DockerFile, which is rarely likely to be an issue.
>>>It is better than the checking URLs approach mentioned above, because it 
>>> is
>>>a more similar process like creating release images and code can be
>>>simplified so that make it easier to maintain.
>>>- *docker-pull-licenses=skip* skips all license pulling related
>>>tasks. This tag is provided for users who customize their images with
>>>DockerFile provided by us and would not like to deal with license stuff.
>>>Without this tag, the users should change the .gradle file to get rid of
>>>it. It also can be used for local development when developers want to 
>>> solve
>>>license related issues later.
>>>
>>> I would like to see this tag is adopted by all docker images released by
>>> Beam, including Flink and Spark job server images, and share the same
>>> default mode.
>>> Does this sound good? Are there any concerns?
>>>
>>> Please let me know what you think.
>>> Hannah
>>>
>>>
>>>
>>> On Thu, Apr 16, 2020 at 11:46 AM Hannah Jiang 
>>> wrote:
>>>
 Ok, so the fianl approach is
 1. Using multi threading with 16 threads.
 2. Introduce docker-pull-license tag, by default it is disabled.
 3. By default, only check if urls return 200, not actually pull the
 files to reduce image size. Licenses add 85MB of image size as of today.
 4. Add retries to avoid flakiness. Initially it was added to download
 part only, assuming urlib is stable when validates urls, but it is not 
 true.

 For caching, it is definitely useful. It will mitigate flakiness
 further, like when github is down (many urls point to github). It also
 reduces unnecessary internet traffic.
 Let's reconsider it after the current work is merged to 2.21.0.

 Thank you all for providing feedback and ideas.


 On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver 
 wrote:

> > I tried with multi processing and it improved the performance a lot.
>
> Great! Though it won't help with flakes, so as you said we should
> still look into caching as well.
>
> > You could try using a threadpool
>
> +1
>
> > However, we would like to include this work as part of 2.21.0
>
> I had marked the jira as a blocker for 2.21.0 because I was afraid
> something was broken, but now it looks like the failures were just flakes.
> So BEAM-9764  should
> not be a release blocker.
>
> On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise  wrote:
>
>> Hi Hannah,
>>
>> Thanks for investigating!
>>
>> I think it would be great to eliminate the overhead for local builds
>> (by default turn off the license assembly) and make it as lightweight
>> as possible in the frequent CI path.
>>
>> Thomas
>>
>>
>> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang 
>> wrote:
>>
>>> I tried to check if urls are valid instead of pulling the files and
>>> it reduced only 1 min of running time. So, it's not an option here.
>>>
>>> I tried with multi processing and it impro

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Hannah Jiang
We need to make sure the release images contains licenses/notices in order
to avoid potential legal issues. The purpose of setting the default to pull
is checking PRs which introduce new dependencies include thier licenses as
well, in a way of auto pulling or tool pulling, to make sure all licenses
are included when we create release images. If setting default to skip, we
only check licenses when we create release images and may see many issues
with the licenses, and the release would be delayed, maybe a lot.


On Tue, Apr 28, 2020 at 2:50 PM Kyle Weaver  wrote:

> The different modes make sense. One disagreement: I think the default
> should be skip. I imagine few users would want to put licenses in their own
> images.
>
> On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang 
> wrote:
>
>> The above 1,2,3,4 are merged to master.
>> Here I would like to change behaviors with *docker-pull-licenses* tag
>> and would like to collect feedbacks from community before finalizing my PR.
>>
>> docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].
>>
>>- *docker-pull-licenses=add* pulls licenses/notices and the files are
>>added to docker images. This tag is used to create release docker images.
>>- *docker-pull-licenses=pull* pulls licenses/notices but do not add
>>the files to docker images so that create lightweight images. This is the
>>default mode, and docker images created by Jenkins test use this mode. 
>> This
>>will make sure the files are all pullable and detect issues with the tool,
>>except the ADD part at DockerFile, which is rarely likely to be an issue.
>>It is better than the checking URLs approach mentioned above, because it 
>> is
>>a more similar process like creating release images and code can be
>>simplified so that make it easier to maintain.
>>- *docker-pull-licenses=skip* skips all license pulling related
>>tasks. This tag is provided for users who customize their images with
>>DockerFile provided by us and would not like to deal with license stuff.
>>Without this tag, the users should change the .gradle file to get rid of
>>it. It also can be used for local development when developers want to 
>> solve
>>license related issues later.
>>
>> I would like to see this tag is adopted by all docker images released by
>> Beam, including Flink and Spark job server images, and share the same
>> default mode.
>> Does this sound good? Are there any concerns?
>>
>> Please let me know what you think.
>> Hannah
>>
>>
>>
>> On Thu, Apr 16, 2020 at 11:46 AM Hannah Jiang 
>> wrote:
>>
>>> Ok, so the fianl approach is
>>> 1. Using multi threading with 16 threads.
>>> 2. Introduce docker-pull-license tag, by default it is disabled.
>>> 3. By default, only check if urls return 200, not actually pull the
>>> files to reduce image size. Licenses add 85MB of image size as of today.
>>> 4. Add retries to avoid flakiness. Initially it was added to download
>>> part only, assuming urlib is stable when validates urls, but it is not true.
>>>
>>> For caching, it is definitely useful. It will mitigate flakiness
>>> further, like when github is down (many urls point to github). It also
>>> reduces unnecessary internet traffic.
>>> Let's reconsider it after the current work is merged to 2.21.0.
>>>
>>> Thank you all for providing feedback and ideas.
>>>
>>>
>>> On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver 
>>> wrote:
>>>
 > I tried with multi processing and it improved the performance a lot.

 Great! Though it won't help with flakes, so as you said we should still
 look into caching as well.

 > You could try using a threadpool

 +1

 > However, we would like to include this work as part of 2.21.0

 I had marked the jira as a blocker for 2.21.0 because I was afraid
 something was broken, but now it looks like the failures were just flakes.
 So BEAM-9764  should
 not be a release blocker.

 On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise  wrote:

> Hi Hannah,
>
> Thanks for investigating!
>
> I think it would be great to eliminate the overhead for local builds
> (by default turn off the license assembly) and make it as lightweight
> as possible in the frequent CI path.
>
> Thomas
>
>
> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang 
> wrote:
>
>> I tried to check if urls are valid instead of pulling the files and
>> it reduced only 1 min of running time. So, it's not an option here.
>>
>> I tried with multi processing and it improved the performance a lot.
>> With 12 subprocesses, the running time reduced to 49 seconds, and
>> with 16 cores, it reduced to 18 seconds.
>> The number of subprocesses is defined by the number of cores, and
>> Jenkins machine has 16 cores.
>> FYI: with my local machine (12 cores) and home network, it takes 1min
>> 40 

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Kyle Weaver
The different modes make sense. One disagreement: I think the default
should be skip. I imagine few users would want to put licenses in their own
images.

On Tue, Apr 28, 2020 at 5:36 PM Hannah Jiang  wrote:

> The above 1,2,3,4 are merged to master.
> Here I would like to change behaviors with *docker-pull-licenses* tag and
> would like to collect feedbacks from community before finalizing my PR.
>
> docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].
>
>- *docker-pull-licenses=add* pulls licenses/notices and the files are
>added to docker images. This tag is used to create release docker images.
>- *docker-pull-licenses=pull* pulls licenses/notices but do not add
>the files to docker images so that create lightweight images. This is the
>default mode, and docker images created by Jenkins test use this mode. This
>will make sure the files are all pullable and detect issues with the tool,
>except the ADD part at DockerFile, which is rarely likely to be an issue.
>It is better than the checking URLs approach mentioned above, because it is
>a more similar process like creating release images and code can be
>simplified so that make it easier to maintain.
>- *docker-pull-licenses=skip* skips all license pulling related tasks.
>This tag is provided for users who customize their images with DockerFile
>provided by us and would not like to deal with license stuff. Without this
>tag, the users should change the .gradle file to get rid of it. It also can
>be used for local development when developers want to solve license related
>issues later.
>
> I would like to see this tag is adopted by all docker images released by
> Beam, including Flink and Spark job server images, and share the same
> default mode.
> Does this sound good? Are there any concerns?
>
> Please let me know what you think.
> Hannah
>
>
>
> On Thu, Apr 16, 2020 at 11:46 AM Hannah Jiang 
> wrote:
>
>> Ok, so the fianl approach is
>> 1. Using multi threading with 16 threads.
>> 2. Introduce docker-pull-license tag, by default it is disabled.
>> 3. By default, only check if urls return 200, not actually pull the files
>> to reduce image size. Licenses add 85MB of image size as of today.
>> 4. Add retries to avoid flakiness. Initially it was added to download
>> part only, assuming urlib is stable when validates urls, but it is not true.
>>
>> For caching, it is definitely useful. It will mitigate flakiness further,
>> like when github is down (many urls point to github). It also reduces
>> unnecessary internet traffic.
>> Let's reconsider it after the current work is merged to 2.21.0.
>>
>> Thank you all for providing feedback and ideas.
>>
>>
>> On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver  wrote:
>>
>>> > I tried with multi processing and it improved the performance a lot.
>>>
>>> Great! Though it won't help with flakes, so as you said we should still
>>> look into caching as well.
>>>
>>> > You could try using a threadpool
>>>
>>> +1
>>>
>>> > However, we would like to include this work as part of 2.21.0
>>>
>>> I had marked the jira as a blocker for 2.21.0 because I was afraid
>>> something was broken, but now it looks like the failures were just flakes.
>>> So BEAM-9764  should
>>> not be a release blocker.
>>>
>>> On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise  wrote:
>>>
 Hi Hannah,

 Thanks for investigating!

 I think it would be great to eliminate the overhead for local builds
 (by default turn off the license assembly) and make it as lightweight
 as possible in the frequent CI path.

 Thomas


 On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang 
 wrote:

> I tried to check if urls are valid instead of pulling the files and it
> reduced only 1 min of running time. So, it's not an option here.
>
> I tried with multi processing and it improved the performance a lot.
> With 12 subprocesses, the running time reduced to 49 seconds, and with
> 16 cores, it reduced to 18 seconds.
> The number of subprocesses is defined by the number of cores, and
> Jenkins machine has 16 cores.
> FYI: with my local machine (12 cores) and home network, it takes 1min
> 40 secs to create a Java docker image.
>
> The caching approach mentioned by Robert brings many benefits, not
> only to this use case.
> However, we would like to include this work as part of 2.21.0, so I
> will move with the multi processing approach this time.
>
> Please let me know if you have objections.
>
>
> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw 
> wrote:
>
>> Is the cost primarily in pulling these remote licenses/sources? I'd
>> guess that 99.9% of the URLs remain the same from run to run. Would a
>> simple cache, or caching proxy, be sufficient?
>>
>> Otherwise, a tag to check that licenses can be pulled

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-28 Thread Hannah Jiang
The above 1,2,3,4 are merged to master.
Here I would like to change behaviors with *docker-pull-licenses* tag and
would like to collect feedbacks from community before finalizing my PR.

docker-pull-licenses can be set to one of ['add', 'pull', 'skip'].

   - *docker-pull-licenses=add* pulls licenses/notices and the files are
   added to docker images. This tag is used to create release docker images.
   - *docker-pull-licenses=pull* pulls licenses/notices but do not add the
   files to docker images so that create lightweight images. This is the
   default mode, and docker images created by Jenkins test use this mode. This
   will make sure the files are all pullable and detect issues with the tool,
   except the ADD part at DockerFile, which is rarely likely to be an issue.
   It is better than the checking URLs approach mentioned above, because it is
   a more similar process like creating release images and code can be
   simplified so that make it easier to maintain.
   - *docker-pull-licenses=skip* skips all license pulling related tasks.
   This tag is provided for users who customize their images with DockerFile
   provided by us and would not like to deal with license stuff. Without this
   tag, the users should change the .gradle file to get rid of it. It also can
   be used for local development when developers want to solve license related
   issues later.

I would like to see this tag is adopted by all docker images released by
Beam, including Flink and Spark job server images, and share the same
default mode.
Does this sound good? Are there any concerns?

Please let me know what you think.
Hannah



On Thu, Apr 16, 2020 at 11:46 AM Hannah Jiang 
wrote:

> Ok, so the fianl approach is
> 1. Using multi threading with 16 threads.
> 2. Introduce docker-pull-license tag, by default it is disabled.
> 3. By default, only check if urls return 200, not actually pull the files
> to reduce image size. Licenses add 85MB of image size as of today.
> 4. Add retries to avoid flakiness. Initially it was added to download part
> only, assuming urlib is stable when validates urls, but it is not true.
>
> For caching, it is definitely useful. It will mitigate flakiness further,
> like when github is down (many urls point to github). It also reduces
> unnecessary internet traffic.
> Let's reconsider it after the current work is merged to 2.21.0.
>
> Thank you all for providing feedback and ideas.
>
>
> On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver  wrote:
>
>> > I tried with multi processing and it improved the performance a lot.
>>
>> Great! Though it won't help with flakes, so as you said we should still
>> look into caching as well.
>>
>> > You could try using a threadpool
>>
>> +1
>>
>> > However, we would like to include this work as part of 2.21.0
>>
>> I had marked the jira as a blocker for 2.21.0 because I was afraid
>> something was broken, but now it looks like the failures were just flakes.
>> So BEAM-9764  should
>> not be a release blocker.
>>
>> On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise  wrote:
>>
>>> Hi Hannah,
>>>
>>> Thanks for investigating!
>>>
>>> I think it would be great to eliminate the overhead for local builds (by
>>> default turn off the license assembly) and make it as lightweight
>>> as possible in the frequent CI path.
>>>
>>> Thomas
>>>
>>>
>>> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang 
>>> wrote:
>>>
 I tried to check if urls are valid instead of pulling the files and it
 reduced only 1 min of running time. So, it's not an option here.

 I tried with multi processing and it improved the performance a lot.
 With 12 subprocesses, the running time reduced to 49 seconds, and with
 16 cores, it reduced to 18 seconds.
 The number of subprocesses is defined by the number of cores, and
 Jenkins machine has 16 cores.
 FYI: with my local machine (12 cores) and home network, it takes 1min
 40 secs to create a Java docker image.

 The caching approach mentioned by Robert brings many benefits, not only
 to this use case.
 However, we would like to include this work as part of 2.21.0, so I
 will move with the multi processing approach this time.

 Please let me know if you have objections.


 On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw 
 wrote:

> Is the cost primarily in pulling these remote licenses/sources? I'd
> guess that 99.9% of the URLs remain the same from run to run. Would a
> simple cache, or caching proxy, be sufficient?
>
> Otherwise, a tag to check that licenses can be pulled, but not really
> pull them, might be sufficient. (Making sure the default is cheap but we
> don't accidentally omit them when it matters is the tricky bit I see 
> here.)
>
>
> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang 
> wrote:
>
>> Thanks for providing feedback.
>>
>> Here is what happending now an

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-16 Thread Hannah Jiang
Ok, so the fianl approach is
1. Using multi threading with 16 threads.
2. Introduce docker-pull-license tag, by default it is disabled.
3. By default, only check if urls return 200, not actually pull the files
to reduce image size. Licenses add 85MB of image size as of today.
4. Add retries to avoid flakiness. Initially it was added to download part
only, assuming urlib is stable when validates urls, but it is not true.

For caching, it is definitely useful. It will mitigate flakiness further,
like when github is down (many urls point to github). It also reduces
unnecessary internet traffic.
Let's reconsider it after the current work is merged to 2.21.0.

Thank you all for providing feedback and ideas.


On Thu, Apr 16, 2020 at 10:24 AM Kyle Weaver  wrote:

> > I tried with multi processing and it improved the performance a lot.
>
> Great! Though it won't help with flakes, so as you said we should still
> look into caching as well.
>
> > You could try using a threadpool
>
> +1
>
> > However, we would like to include this work as part of 2.21.0
>
> I had marked the jira as a blocker for 2.21.0 because I was afraid
> something was broken, but now it looks like the failures were just flakes.
> So BEAM-9764  should not
> be a release blocker.
>
> On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise  wrote:
>
>> Hi Hannah,
>>
>> Thanks for investigating!
>>
>> I think it would be great to eliminate the overhead for local builds (by
>> default turn off the license assembly) and make it as lightweight
>> as possible in the frequent CI path.
>>
>> Thomas
>>
>>
>> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang 
>> wrote:
>>
>>> I tried to check if urls are valid instead of pulling the files and it
>>> reduced only 1 min of running time. So, it's not an option here.
>>>
>>> I tried with multi processing and it improved the performance a lot.
>>> With 12 subprocesses, the running time reduced to 49 seconds, and with
>>> 16 cores, it reduced to 18 seconds.
>>> The number of subprocesses is defined by the number of cores, and
>>> Jenkins machine has 16 cores.
>>> FYI: with my local machine (12 cores) and home network, it takes 1min 40
>>> secs to create a Java docker image.
>>>
>>> The caching approach mentioned by Robert brings many benefits, not only
>>> to this use case.
>>> However, we would like to include this work as part of 2.21.0, so I will
>>> move with the multi processing approach this time.
>>>
>>> Please let me know if you have objections.
>>>
>>>
>>> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw 
>>> wrote:
>>>
 Is the cost primarily in pulling these remote licenses/sources? I'd
 guess that 99.9% of the URLs remain the same from run to run. Would a
 simple cache, or caching proxy, be sufficient?

 Otherwise, a tag to check that licenses can be pulled, but not really
 pull them, might be sufficient. (Making sure the default is cheap but we
 don't accidentally omit them when it matters is the tricky bit I see here.)


 On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang 
 wrote:

> Thanks for providing feedback.
>
> Here is what happending now and I would discuss when to run the job.
>
> *Why it takes 7-8 mins for Java?*
> When we list dependencies both in runtime and compile environment,
> there are almost 1400 third party dependencies and we need to pull
> licenses/notices for all of them.
> In addition, we need to pull source code if license is CDDL, MPL, GLP
> or LGPL. 69 of the dependencies need to pull the source code as of
> 4/14/2020.
> Getting dependency list + pulling licenses/notices/source code takes
> 7-8 minutes.
>
> Now I see there are *two patterns of failures*.
> 1. In valid URLs. In fact, the urls are not invalid, but occationally,
> it returns URLError. This can be resolved by adding retries. However, it
> will add runtime to the job.
> 2. No artifacts available. Sometimes, when a new version of package is
> released  and the plugin still looks for staging location. For example, 
> new
> zetasql packages were released on 4/14, and today I saw several failures
> with looking for staging repo. The behavior is not consistent, sometimes 
> it
> scans correct location, sometimes not. This can be resolved by running the
> job again.
>
> *When the job is running?*
> generateThirdPartyLicenses is added to :sdks:java:container and it is
> an upstream of the docker task. As such, whenever a docker is created, the
> job is triggered.
> :sdks:java:container:docker is added to Java PreSubmit job.
>
> *How to improve it?*
> According to some ideas provided above, how about doing this?
> Introduce a tag (ie: pull-licenses) to docker job to decide if pull
> the files. Default tag is NOT setting pull-licenses.
> When pull-licenses is not set, it checks if the
> licenses/

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-16 Thread Kyle Weaver
> I tried with multi processing and it improved the performance a lot.

Great! Though it won't help with flakes, so as you said we should still
look into caching as well.

> You could try using a threadpool

+1

> However, we would like to include this work as part of 2.21.0

I had marked the jira as a blocker for 2.21.0 because I was afraid
something was broken, but now it looks like the failures were just flakes.
So BEAM-9764  should not
be a release blocker.

On Thu, Apr 16, 2020 at 1:00 PM Thomas Weise  wrote:

> Hi Hannah,
>
> Thanks for investigating!
>
> I think it would be great to eliminate the overhead for local builds (by
> default turn off the license assembly) and make it as lightweight
> as possible in the frequent CI path.
>
> Thomas
>
>
> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang 
> wrote:
>
>> I tried to check if urls are valid instead of pulling the files and it
>> reduced only 1 min of running time. So, it's not an option here.
>>
>> I tried with multi processing and it improved the performance a lot.
>> With 12 subprocesses, the running time reduced to 49 seconds, and with 16
>> cores, it reduced to 18 seconds.
>> The number of subprocesses is defined by the number of cores, and Jenkins
>> machine has 16 cores.
>> FYI: with my local machine (12 cores) and home network, it takes 1min 40
>> secs to create a Java docker image.
>>
>> The caching approach mentioned by Robert brings many benefits, not only
>> to this use case.
>> However, we would like to include this work as part of 2.21.0, so I will
>> move with the multi processing approach this time.
>>
>> Please let me know if you have objections.
>>
>>
>> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw 
>> wrote:
>>
>>> Is the cost primarily in pulling these remote licenses/sources? I'd
>>> guess that 99.9% of the URLs remain the same from run to run. Would a
>>> simple cache, or caching proxy, be sufficient?
>>>
>>> Otherwise, a tag to check that licenses can be pulled, but not really
>>> pull them, might be sufficient. (Making sure the default is cheap but we
>>> don't accidentally omit them when it matters is the tricky bit I see here.)
>>>
>>>
>>> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang 
>>> wrote:
>>>
 Thanks for providing feedback.

 Here is what happending now and I would discuss when to run the job.

 *Why it takes 7-8 mins for Java?*
 When we list dependencies both in runtime and compile environment,
 there are almost 1400 third party dependencies and we need to pull
 licenses/notices for all of them.
 In addition, we need to pull source code if license is CDDL, MPL, GLP
 or LGPL. 69 of the dependencies need to pull the source code as of
 4/14/2020.
 Getting dependency list + pulling licenses/notices/source code takes
 7-8 minutes.

 Now I see there are *two patterns of failures*.
 1. In valid URLs. In fact, the urls are not invalid, but occationally,
 it returns URLError. This can be resolved by adding retries. However, it
 will add runtime to the job.
 2. No artifacts available. Sometimes, when a new version of package is
 released  and the plugin still looks for staging location. For example, new
 zetasql packages were released on 4/14, and today I saw several failures
 with looking for staging repo. The behavior is not consistent, sometimes it
 scans correct location, sometimes not. This can be resolved by running the
 job again.

 *When the job is running?*
 generateThirdPartyLicenses is added to :sdks:java:container and it is
 an upstream of the docker task. As such, whenever a docker is created, the
 job is triggered.
 :sdks:java:container:docker is added to Java PreSubmit job.

 *How to improve it?*
 According to some ideas provided above, how about doing this?
 Introduce a tag (ie: pull-licenses) to docker job to decide if pull the
 files. Default tag is NOT setting pull-licenses.
 When pull-licenses is not set, it checks if the licenses/notices/source
 code can be pull automaticall or they have urls to pull from, but don't
 really pull.
 When pull-license is set, files are pulled.

 For each PR (Presubmit): applying default option. The test would fail
 if the files cannot be pulled, so committers still need to fix dependency
 errors. I believe it would reduce the running time.
 For release: set the tag and pull the files and source code. Since it
 is checked for each PR, pulling should finish without problems.

 Please let me know what you think and if there are other things can be
 improved.

 Hannah



 On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver 
 wrote:

> Looks like the same error as this Jira:
> https://issues.apache.org/jira/browse/BEAM-9764
>
> Even if/when we are able to fix this particular issue, I agree it is
> 

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-16 Thread Robert Bradshaw
That sounds very promising.

You could try using a threadpool rather than the (more heavyweight)
multiprocessing, as this is almost certainly an IO-bound task. You won't
need to be bound by the number of cores.

On Thu, Apr 16, 2020 at 10:00 AM Thomas Weise  wrote:

> Hi Hannah,
>
> Thanks for investigating!
>
> I think it would be great to eliminate the overhead for local builds (by
> default turn off the license assembly) and make it as lightweight
> as possible in the frequent CI path.
>
> Thomas
>
>
> On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang 
> wrote:
>
>> I tried to check if urls are valid instead of pulling the files and it
>> reduced only 1 min of running time. So, it's not an option here.
>>
>> I tried with multi processing and it improved the performance a lot.
>> With 12 subprocesses, the running time reduced to 49 seconds, and with 16
>> cores, it reduced to 18 seconds.
>> The number of subprocesses is defined by the number of cores, and Jenkins
>> machine has 16 cores.
>> FYI: with my local machine (12 cores) and home network, it takes 1min 40
>> secs to create a Java docker image.
>>
>> The caching approach mentioned by Robert brings many benefits, not only
>> to this use case.
>> However, we would like to include this work as part of 2.21.0, so I will
>> move with the multi processing approach this time.
>>
>> Please let me know if you have objections.
>>
>>
>> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw 
>> wrote:
>>
>>> Is the cost primarily in pulling these remote licenses/sources? I'd
>>> guess that 99.9% of the URLs remain the same from run to run. Would a
>>> simple cache, or caching proxy, be sufficient?
>>>
>>> Otherwise, a tag to check that licenses can be pulled, but not really
>>> pull them, might be sufficient. (Making sure the default is cheap but we
>>> don't accidentally omit them when it matters is the tricky bit I see here.)
>>>
>>>
>>> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang 
>>> wrote:
>>>
 Thanks for providing feedback.

 Here is what happending now and I would discuss when to run the job.

 *Why it takes 7-8 mins for Java?*
 When we list dependencies both in runtime and compile environment,
 there are almost 1400 third party dependencies and we need to pull
 licenses/notices for all of them.
 In addition, we need to pull source code if license is CDDL, MPL, GLP
 or LGPL. 69 of the dependencies need to pull the source code as of
 4/14/2020.
 Getting dependency list + pulling licenses/notices/source code takes
 7-8 minutes.

 Now I see there are *two patterns of failures*.
 1. In valid URLs. In fact, the urls are not invalid, but occationally,
 it returns URLError. This can be resolved by adding retries. However, it
 will add runtime to the job.
 2. No artifacts available. Sometimes, when a new version of package is
 released  and the plugin still looks for staging location. For example, new
 zetasql packages were released on 4/14, and today I saw several failures
 with looking for staging repo. The behavior is not consistent, sometimes it
 scans correct location, sometimes not. This can be resolved by running the
 job again.

 *When the job is running?*
 generateThirdPartyLicenses is added to :sdks:java:container and it is
 an upstream of the docker task. As such, whenever a docker is created, the
 job is triggered.
 :sdks:java:container:docker is added to Java PreSubmit job.

 *How to improve it?*
 According to some ideas provided above, how about doing this?
 Introduce a tag (ie: pull-licenses) to docker job to decide if pull the
 files. Default tag is NOT setting pull-licenses.
 When pull-licenses is not set, it checks if the licenses/notices/source
 code can be pull automaticall or they have urls to pull from, but don't
 really pull.
 When pull-license is set, files are pulled.

 For each PR (Presubmit): applying default option. The test would fail
 if the files cannot be pulled, so committers still need to fix dependency
 errors. I believe it would reduce the running time.
 For release: set the tag and pull the files and source code. Since it
 is checked for each PR, pulling should finish without problems.

 Please let me know what you think and if there are other things can be
 improved.

 Hannah



 On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver 
 wrote:

> Looks like the same error as this Jira:
> https://issues.apache.org/jira/browse/BEAM-9764
>
> Even if/when we are able to fix this particular issue, I agree it is
> best not to run this job except for releases because of the inherent
> network cost and possible reliability issues. +Hannah Jiang
>  What do you think?
>
> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise  wrote:
>
>> The new feature to assemble licenses is very useful but appears 

Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-16 Thread Thomas Weise
Hi Hannah,

Thanks for investigating!

I think it would be great to eliminate the overhead for local builds (by
default turn off the license assembly) and make it as lightweight
as possible in the frequent CI path.

Thomas


On Thu, Apr 16, 2020 at 1:37 AM Hannah Jiang  wrote:

> I tried to check if urls are valid instead of pulling the files and it
> reduced only 1 min of running time. So, it's not an option here.
>
> I tried with multi processing and it improved the performance a lot.
> With 12 subprocesses, the running time reduced to 49 seconds, and with 16
> cores, it reduced to 18 seconds.
> The number of subprocesses is defined by the number of cores, and Jenkins
> machine has 16 cores.
> FYI: with my local machine (12 cores) and home network, it takes 1min 40
> secs to create a Java docker image.
>
> The caching approach mentioned by Robert brings many benefits, not only to
> this use case.
> However, we would like to include this work as part of 2.21.0, so I will
> move with the multi processing approach this time.
>
> Please let me know if you have objections.
>
>
> On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw 
> wrote:
>
>> Is the cost primarily in pulling these remote licenses/sources? I'd guess
>> that 99.9% of the URLs remain the same from run to run. Would a simple
>> cache, or caching proxy, be sufficient?
>>
>> Otherwise, a tag to check that licenses can be pulled, but not really
>> pull them, might be sufficient. (Making sure the default is cheap but we
>> don't accidentally omit them when it matters is the tricky bit I see here.)
>>
>>
>> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang 
>> wrote:
>>
>>> Thanks for providing feedback.
>>>
>>> Here is what happending now and I would discuss when to run the job.
>>>
>>> *Why it takes 7-8 mins for Java?*
>>> When we list dependencies both in runtime and compile environment, there
>>> are almost 1400 third party dependencies and we need to pull
>>> licenses/notices for all of them.
>>> In addition, we need to pull source code if license is CDDL, MPL, GLP or
>>> LGPL. 69 of the dependencies need to pull the source code as of 4/14/2020.
>>> Getting dependency list + pulling licenses/notices/source code takes 7-8
>>> minutes.
>>>
>>> Now I see there are *two patterns of failures*.
>>> 1. In valid URLs. In fact, the urls are not invalid, but occationally,
>>> it returns URLError. This can be resolved by adding retries. However, it
>>> will add runtime to the job.
>>> 2. No artifacts available. Sometimes, when a new version of package is
>>> released  and the plugin still looks for staging location. For example, new
>>> zetasql packages were released on 4/14, and today I saw several failures
>>> with looking for staging repo. The behavior is not consistent, sometimes it
>>> scans correct location, sometimes not. This can be resolved by running the
>>> job again.
>>>
>>> *When the job is running?*
>>> generateThirdPartyLicenses is added to :sdks:java:container and it is an
>>> upstream of the docker task. As such, whenever a docker is created, the job
>>> is triggered.
>>> :sdks:java:container:docker is added to Java PreSubmit job.
>>>
>>> *How to improve it?*
>>> According to some ideas provided above, how about doing this?
>>> Introduce a tag (ie: pull-licenses) to docker job to decide if pull the
>>> files. Default tag is NOT setting pull-licenses.
>>> When pull-licenses is not set, it checks if the licenses/notices/source
>>> code can be pull automaticall or they have urls to pull from, but don't
>>> really pull.
>>> When pull-license is set, files are pulled.
>>>
>>> For each PR (Presubmit): applying default option. The test would fail if
>>> the files cannot be pulled, so committers still need to fix dependency
>>> errors. I believe it would reduce the running time.
>>> For release: set the tag and pull the files and source code. Since it is
>>> checked for each PR, pulling should finish without problems.
>>>
>>> Please let me know what you think and if there are other things can be
>>> improved.
>>>
>>> Hannah
>>>
>>>
>>>
>>> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver  wrote:
>>>
 Looks like the same error as this Jira:
 https://issues.apache.org/jira/browse/BEAM-9764

 Even if/when we are able to fix this particular issue, I agree it is
 best not to run this job except for releases because of the inherent
 network cost and possible reliability issues. +Hannah Jiang
  What do you think?

 On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise  wrote:

> The new feature to assemble licenses is very useful but appears to add
> several minutes (7-8?)  build time to jobs that need to build a container.
>
> Does it also seem to cause occasional build failures?
>
>
> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>
> Would it be possible to perform this task only during release builds?
>
> Thanks,
> Thomas
>
>


Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-16 Thread Hannah Jiang
I tried to check if urls are valid instead of pulling the files and it
reduced only 1 min of running time. So, it's not an option here.

I tried with multi processing and it improved the performance a lot.
With 12 subprocesses, the running time reduced to 49 seconds, and with 16
cores, it reduced to 18 seconds.
The number of subprocesses is defined by the number of cores, and Jenkins
machine has 16 cores.
FYI: with my local machine (12 cores) and home network, it takes 1min 40
secs to create a Java docker image.

The caching approach mentioned by Robert brings many benefits, not only to
this use case.
However, we would like to include this work as part of 2.21.0, so I will
move with the multi processing approach this time.

Please let me know if you have objections.


On Wed, Apr 15, 2020 at 4:01 PM Robert Bradshaw  wrote:

> Is the cost primarily in pulling these remote licenses/sources? I'd guess
> that 99.9% of the URLs remain the same from run to run. Would a simple
> cache, or caching proxy, be sufficient?
>
> Otherwise, a tag to check that licenses can be pulled, but not really pull
> them, might be sufficient. (Making sure the default is cheap but we don't
> accidentally omit them when it matters is the tricky bit I see here.)
>
>
> On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang 
> wrote:
>
>> Thanks for providing feedback.
>>
>> Here is what happending now and I would discuss when to run the job.
>>
>> *Why it takes 7-8 mins for Java?*
>> When we list dependencies both in runtime and compile environment, there
>> are almost 1400 third party dependencies and we need to pull
>> licenses/notices for all of them.
>> In addition, we need to pull source code if license is CDDL, MPL, GLP or
>> LGPL. 69 of the dependencies need to pull the source code as of 4/14/2020.
>> Getting dependency list + pulling licenses/notices/source code takes 7-8
>> minutes.
>>
>> Now I see there are *two patterns of failures*.
>> 1. In valid URLs. In fact, the urls are not invalid, but occationally, it
>> returns URLError. This can be resolved by adding retries. However, it will
>> add runtime to the job.
>> 2. No artifacts available. Sometimes, when a new version of package is
>> released  and the plugin still looks for staging location. For example, new
>> zetasql packages were released on 4/14, and today I saw several failures
>> with looking for staging repo. The behavior is not consistent, sometimes it
>> scans correct location, sometimes not. This can be resolved by running the
>> job again.
>>
>> *When the job is running?*
>> generateThirdPartyLicenses is added to :sdks:java:container and it is an
>> upstream of the docker task. As such, whenever a docker is created, the job
>> is triggered.
>> :sdks:java:container:docker is added to Java PreSubmit job.
>>
>> *How to improve it?*
>> According to some ideas provided above, how about doing this?
>> Introduce a tag (ie: pull-licenses) to docker job to decide if pull the
>> files. Default tag is NOT setting pull-licenses.
>> When pull-licenses is not set, it checks if the licenses/notices/source
>> code can be pull automaticall or they have urls to pull from, but don't
>> really pull.
>> When pull-license is set, files are pulled.
>>
>> For each PR (Presubmit): applying default option. The test would fail if
>> the files cannot be pulled, so committers still need to fix dependency
>> errors. I believe it would reduce the running time.
>> For release: set the tag and pull the files and source code. Since it is
>> checked for each PR, pulling should finish without problems.
>>
>> Please let me know what you think and if there are other things can be
>> improved.
>>
>> Hannah
>>
>>
>>
>> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver  wrote:
>>
>>> Looks like the same error as this Jira:
>>> https://issues.apache.org/jira/browse/BEAM-9764
>>>
>>> Even if/when we are able to fix this particular issue, I agree it is
>>> best not to run this job except for releases because of the inherent
>>> network cost and possible reliability issues. +Hannah Jiang
>>>  What do you think?
>>>
>>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise  wrote:
>>>
 The new feature to assemble licenses is very useful but appears to add
 several minutes (7-8?)  build time to jobs that need to build a container.

 Does it also seem to cause occasional build failures?


 https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/

 Would it be possible to perform this task only during release builds?

 Thanks,
 Thomas




Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-15 Thread Robert Bradshaw
Is the cost primarily in pulling these remote licenses/sources? I'd guess
that 99.9% of the URLs remain the same from run to run. Would a simple
cache, or caching proxy, be sufficient?

Otherwise, a tag to check that licenses can be pulled, but not really pull
them, might be sufficient. (Making sure the default is cheap but we don't
accidentally omit them when it matters is the tricky bit I see here.)


On Wed, Apr 15, 2020 at 3:38 PM Hannah Jiang  wrote:

> Thanks for providing feedback.
>
> Here is what happending now and I would discuss when to run the job.
>
> *Why it takes 7-8 mins for Java?*
> When we list dependencies both in runtime and compile environment, there
> are almost 1400 third party dependencies and we need to pull
> licenses/notices for all of them.
> In addition, we need to pull source code if license is CDDL, MPL, GLP or
> LGPL. 69 of the dependencies need to pull the source code as of 4/14/2020.
> Getting dependency list + pulling licenses/notices/source code takes 7-8
> minutes.
>
> Now I see there are *two patterns of failures*.
> 1. In valid URLs. In fact, the urls are not invalid, but occationally, it
> returns URLError. This can be resolved by adding retries. However, it will
> add runtime to the job.
> 2. No artifacts available. Sometimes, when a new version of package is
> released  and the plugin still looks for staging location. For example, new
> zetasql packages were released on 4/14, and today I saw several failures
> with looking for staging repo. The behavior is not consistent, sometimes it
> scans correct location, sometimes not. This can be resolved by running the
> job again.
>
> *When the job is running?*
> generateThirdPartyLicenses is added to :sdks:java:container and it is an
> upstream of the docker task. As such, whenever a docker is created, the job
> is triggered.
> :sdks:java:container:docker is added to Java PreSubmit job.
>
> *How to improve it?*
> According to some ideas provided above, how about doing this?
> Introduce a tag (ie: pull-licenses) to docker job to decide if pull the
> files. Default tag is NOT setting pull-licenses.
> When pull-licenses is not set, it checks if the licenses/notices/source
> code can be pull automaticall or they have urls to pull from, but don't
> really pull.
> When pull-license is set, files are pulled.
>
> For each PR (Presubmit): applying default option. The test would fail if
> the files cannot be pulled, so committers still need to fix dependency
> errors. I believe it would reduce the running time.
> For release: set the tag and pull the files and source code. Since it is
> checked for each PR, pulling should finish without problems.
>
> Please let me know what you think and if there are other things can be
> improved.
>
> Hannah
>
>
>
> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver  wrote:
>
>> Looks like the same error as this Jira:
>> https://issues.apache.org/jira/browse/BEAM-9764
>>
>> Even if/when we are able to fix this particular issue, I agree it is best
>> not to run this job except for releases because of the inherent network
>> cost and possible reliability issues. +Hannah Jiang
>>  What do you think?
>>
>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise  wrote:
>>
>>> The new feature to assemble licenses is very useful but appears to add
>>> several minutes (7-8?)  build time to jobs that need to build a container.
>>>
>>> Does it also seem to cause occasional build failures?
>>>
>>>
>>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>>>
>>> Would it be possible to perform this task only during release builds?
>>>
>>> Thanks,
>>> Thomas
>>>
>>>


Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-15 Thread Hannah Jiang
Thanks for providing feedback.

Here is what happending now and I would discuss when to run the job.

*Why it takes 7-8 mins for Java?*
When we list dependencies both in runtime and compile environment, there
are almost 1400 third party dependencies and we need to pull
licenses/notices for all of them.
In addition, we need to pull source code if license is CDDL, MPL, GLP or
LGPL. 69 of the dependencies need to pull the source code as of 4/14/2020.
Getting dependency list + pulling licenses/notices/source code takes 7-8
minutes.

Now I see there are *two patterns of failures*.
1. In valid URLs. In fact, the urls are not invalid, but occationally, it
returns URLError. This can be resolved by adding retries. However, it will
add runtime to the job.
2. No artifacts available. Sometimes, when a new version of package is
released  and the plugin still looks for staging location. For example, new
zetasql packages were released on 4/14, and today I saw several failures
with looking for staging repo. The behavior is not consistent, sometimes it
scans correct location, sometimes not. This can be resolved by running the
job again.

*When the job is running?*
generateThirdPartyLicenses is added to :sdks:java:container and it is an
upstream of the docker task. As such, whenever a docker is created, the job
is triggered.
:sdks:java:container:docker is added to Java PreSubmit job.

*How to improve it?*
According to some ideas provided above, how about doing this?
Introduce a tag (ie: pull-licenses) to docker job to decide if pull the
files. Default tag is NOT setting pull-licenses.
When pull-licenses is not set, it checks if the licenses/notices/source
code can be pull automaticall or they have urls to pull from, but don't
really pull.
When pull-license is set, files are pulled.

For each PR (Presubmit): applying default option. The test would fail if
the files cannot be pulled, so committers still need to fix dependency
errors. I believe it would reduce the running time.
For release: set the tag and pull the files and source code. Since it is
checked for each PR, pulling should finish without problems.

Please let me know what you think and if there are other things can be
improved.

Hannah



On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver  wrote:

> Looks like the same error as this Jira:
> https://issues.apache.org/jira/browse/BEAM-9764
>
> Even if/when we are able to fix this particular issue, I agree it is best
> not to run this job except for releases because of the inherent network
> cost and possible reliability issues. +Hannah Jiang
>  What do you think?
>
> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise  wrote:
>
>> The new feature to assemble licenses is very useful but appears to add
>> several minutes (7-8?)  build time to jobs that need to build a container.
>>
>> Does it also seem to cause occasional build failures?
>>
>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>>
>> Would it be possible to perform this task only during release builds?
>>
>> Thanks,
>> Thomas
>>
>>


Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-15 Thread Robert Bradshaw
In terms of pre-commit, 7-8 minutes seems worth not having to debug
dependencies that broke this in post-commit. We should look at caching
(IIRC we've long wanted to do this for pip and maven packages anyway). We
could also consider whether, for development purposes, we could build
"lite" containers that these could build on, if caching is not sufficient...

On Wed, Apr 15, 2020 at 3:21 PM Kyle Weaver  wrote:

> Iirc we chose to put it in precommit so that it would block adding new
> dependencies without a license. Maybe we can look into caching as a way to
> mitigate run time and flakes?
>
> On Wed, Apr 15, 2020 at 6:16 PM Udi Meiri  wrote:
>
>> If this process is used in releases we would benefit from running it
>> regularly to ensure it isn't broken and thus delay releases (and add work
>> for the release manager).
>> Does it make sense to put it in postcommit?
>>
>> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver  wrote:
>>
>>> Looks like the same error as this Jira:
>>> https://issues.apache.org/jira/browse/BEAM-9764
>>>
>>> Even if/when we are able to fix this particular issue, I agree it is
>>> best not to run this job except for releases because of the inherent
>>> network cost and possible reliability issues. +Hannah Jiang
>>>  What do you think?
>>>
>>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise  wrote:
>>>
 The new feature to assemble licenses is very useful but appears to add
 several minutes (7-8?)  build time to jobs that need to build a container.

 Does it also seem to cause occasional build failures?


 https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/

 Would it be possible to perform this task only during release builds?

 Thanks,
 Thomas




Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-15 Thread Kyle Weaver
Iirc we chose to put it in precommit so that it would block adding new
dependencies without a license. Maybe we can look into caching as a way to
mitigate run time and flakes?

On Wed, Apr 15, 2020 at 6:16 PM Udi Meiri  wrote:

> If this process is used in releases we would benefit from running it
> regularly to ensure it isn't broken and thus delay releases (and add work
> for the release manager).
> Does it make sense to put it in postcommit?
>
> On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver  wrote:
>
>> Looks like the same error as this Jira:
>> https://issues.apache.org/jira/browse/BEAM-9764
>>
>> Even if/when we are able to fix this particular issue, I agree it is best
>> not to run this job except for releases because of the inherent network
>> cost and possible reliability issues. +Hannah Jiang
>>  What do you think?
>>
>> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise  wrote:
>>
>>> The new feature to assemble licenses is very useful but appears to add
>>> several minutes (7-8?)  build time to jobs that need to build a container.
>>>
>>> Does it also seem to cause occasional build failures?
>>>
>>>
>>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>>>
>>> Would it be possible to perform this task only during release builds?
>>>
>>> Thanks,
>>> Thomas
>>>
>>>


Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-15 Thread Udi Meiri
If this process is used in releases we would benefit from running it
regularly to ensure it isn't broken and thus delay releases (and add work
for the release manager).
Does it make sense to put it in postcommit?

On Wed, Apr 15, 2020 at 2:30 PM Kyle Weaver  wrote:

> Looks like the same error as this Jira:
> https://issues.apache.org/jira/browse/BEAM-9764
>
> Even if/when we are able to fix this particular issue, I agree it is best
> not to run this job except for releases because of the inherent network
> cost and possible reliability issues. +Hannah Jiang
>  What do you think?
>
> On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise  wrote:
>
>> The new feature to assemble licenses is very useful but appears to add
>> several minutes (7-8?)  build time to jobs that need to build a container.
>>
>> Does it also seem to cause occasional build failures?
>>
>> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>>
>> Would it be possible to perform this task only during release builds?
>>
>> Thanks,
>> Thomas
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-15 Thread Kyle Weaver
Looks like the same error as this Jira:
https://issues.apache.org/jira/browse/BEAM-9764

Even if/when we are able to fix this particular issue, I agree it is best
not to run this job except for releases because of the inherent network
cost and possible reliability issues. +Hannah Jiang
 What
do you think?

On Wed, Apr 15, 2020 at 5:20 PM Thomas Weise  wrote:

> The new feature to assemble licenses is very useful but appears to add
> several minutes (7-8?)  build time to jobs that need to build a container.
>
> Does it also seem to cause occasional build failures?
>
> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/
>
> Would it be possible to perform this task only during release builds?
>
> Thanks,
> Thomas
>
>


sdks:java:container:generateThirdPartyLicenses effect on build time / stability

2020-04-15 Thread Thomas Weise
The new feature to assemble licenses is very useful but appears to add
several minutes (7-8?)  build time to jobs that need to build a container.

Does it also seem to cause occasional build failures?

https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/131/

Would it be possible to perform this task only during release builds?

Thanks,
Thomas