Re: Credentials Rotation Failure on IO-Datastores cluster

2023-11-01 Thread Danny McCormick via dev
My guess is that this is due to running this both on GitHub Actions and
Jenkins. The Actions run succeeded, so I don't think we need to worry about
this - https://github.com/apache/beam/actions/runs/6714783844

It seems like for the metrics job the opposite happened - the Actions run
failing triggering an email
 and the
Jenkins job succeeded -
https://ci-beam.apache.org/job/Rotate%20Metrics%20Cluster%20Credentials/

I put up https://github.com/apache/beam/pull/29243 to remove the Jenkins
jobs in favor of the Actions jobs, which should fix the issue.

Thanks,
Danny

On Tue, Oct 31, 2023 at 10:41 PM Svetak Sundhar via dev 
wrote:

> I took a quick look -- the error is the following:
>
> *22:17:26* ERROR: (gcloud.container.clusters.update) ResponseError: code=400, 
> message=Operation 
> operation-1698804621818-e9c8fe33-d4a2-44cd-86aa-9c4e09dea259 is currently 
> upgrading cluster io-datastores. Please wait and try again once it is done.
>
>
>
>
> This is different than the last time this error happened 
> (https://lists.apache.org/thread/xw2hx8yycpfmhf64w0vyt96r0d8zwnyg)
>
>
> I noticed node pool pool-1 was still updating when this error was sent, so I 
> think it should succeed now.
>
>
> Should we retrigger the seed job manually?
>
>
>
> Svetak Sundhar
>
>   Data Engineer
> s vetaksund...@google.com
>
>
>
> On Tue, Oct 31, 2023 at 10:17 PM Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
>> Something went wrong during the automatic credentials rotation for
>> IO-Datastores Cluster, performed at Wed Nov 01 00:52:45 UTC 2023. It may be
>> necessary to check the state of the cluster certificates. For further
>> details refer to the following links:
>>  * Failing job:
>> https://ci-beam.apache.org/job/Rotate%20IO-Datastores%20Cluster%20Credentials/
>>  * Job configuration:
>> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_IODatastoresCredentialsRotation.groovy
>>  * Cluster URL:
>> https://pantheon.corp.google.com/kubernetes/clusters/details/us-central1-a/io-datastores/details?mods=dataflow_dev=apache-beam-testing
>
>


Re: Credentials Rotation Failure on IO-Datastores cluster

2023-10-31 Thread Svetak Sundhar via dev
I took a quick look -- the error is the following:

*22:17:26* ERROR: (gcloud.container.clusters.update) ResponseError:
code=400, message=Operation
operation-1698804621818-e9c8fe33-d4a2-44cd-86aa-9c4e09dea259 is
currently upgrading cluster io-datastores. Please wait and try again
once it is done.




This is different than the last time this error happened
(https://lists.apache.org/thread/xw2hx8yycpfmhf64w0vyt96r0d8zwnyg)


I noticed node pool pool-1 was still updating when this error was
sent, so I think it should succeed now.


Should we retrigger the seed job manually?



Svetak Sundhar

  Data Engineer
s vetaksund...@google.com



On Tue, Oct 31, 2023 at 10:17 PM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Something went wrong during the automatic credentials rotation for
> IO-Datastores Cluster, performed at Wed Nov 01 00:52:45 UTC 2023. It may be
> necessary to check the state of the cluster certificates. For further
> details refer to the following links:
>  * Failing job:
> https://ci-beam.apache.org/job/Rotate%20IO-Datastores%20Cluster%20Credentials/
>  * Job configuration:
> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_IODatastoresCredentialsRotation.groovy
>  * Cluster URL:
> https://pantheon.corp.google.com/kubernetes/clusters/details/us-central1-a/io-datastores/details?mods=dataflow_dev=apache-beam-testing


Credentials Rotation Failure on IO-Datastores cluster

2023-10-31 Thread Apache Jenkins Server
Something went wrong during the automatic credentials rotation for 
IO-Datastores Cluster, performed at Wed Nov 01 00:52:45 UTC 2023. It may be 
necessary to check the state of the cluster certificates. For further details 
refer to the following links:
 * Failing job: 
https://ci-beam.apache.org/job/Rotate%20IO-Datastores%20Cluster%20Credentials/ 
 * Job configuration: 
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_IODatastoresCredentialsRotation.groovy
 
 * Cluster URL: 
https://pantheon.corp.google.com/kubernetes/clusters/details/us-central1-a/io-datastores/details?mods=dataflow_dev=apache-beam-testing

Re: Credentials Rotation Failure on IO-Datastores cluster

2022-12-01 Thread Danny McCormick via dev
Update - the job has been successfully run and the permanent fix is merged.
I'll follow up with a PR to fix the links in the failure email.

On Thu, Dec 1, 2022 at 2:00 PM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Something went wrong during the automatic credentials rotation for
> IO-Datastores Cluster, performed at Thu Dec 01 18:58:27 UTC 2022. It may be
> necessary to check the state of the cluster certificates. For further
> details refer to the following links:
>  * https://ci-beam.apache.org/job/beam_SeedJob/
>  * https://ci-beam.apache.org/.


Credentials Rotation Failure on IO-Datastores cluster

2022-12-01 Thread Apache Jenkins Server
Something went wrong during the automatic credentials rotation for 
IO-Datastores Cluster, performed at Thu Dec 01 18:58:27 UTC 2022. It may be 
necessary to check the state of the cluster certificates. For further details 
refer to the following links:
 * https://ci-beam.apache.org/job/beam_SeedJob/ 
 * https://ci-beam.apache.org/.

Credentials Rotation Failure on IO-Datastores cluster

2022-12-01 Thread Apache Jenkins Server
Something went wrong during the automatic credentials rotation for 
IO-Datastores Cluster, performed at Thu Dec 01 18:58:27 UTC 2022. It may be 
necessary to check the state of the cluster certificates. For further details 
refer to the following links:
 * https://ci-beam.apache.org/job/beam_SeedJob/ 
 * https://ci-beam.apache.org/.

Re: Credentials Rotation Failure on IO-Datastores cluster

2022-12-01 Thread Danny McCormick via dev
Does that have potential to break other things? We could presumably also
update 
https://github.com/apache/beam/blob/4718cdff87fed4f92636e94dbf3a04c2315d6a95/.test-infra/jenkins/job_IODatastoresCredentialsRotation.groovy#L38
to pool-1 instead.

I put up https://github.com/apache/beam/pull/24466 in case that is
preferable.

On Thu, Dec 1, 2022 at 1:29 PM Yi Hu  wrote:

> Thanks for reporting. I have bumped the pool size of io-datastore as we
> have more tests being added and the default-pool frequently becomes
> unschedulable due to memory constraints. A simple fix is just rename the
> 'pool1' back to 'default-pool'.
>
> On Thu, Dec 1, 2022 at 1:26 PM Danny McCormick 
> wrote:
>
>> Yes, I was just starting to look into this. Looks like this is the result
>> of this job failing -
>> https://github.com/apache/beam/blob/ec2a07b38c1f640c62e7c3b96966f18b334a7ce9/.test-infra/jenkins/job_IODatastoresCredentialsRotation.groovy#L49
>>
>> The error is:
>>
>> ```
>>
>> *21:25:58* + gcloud container clusters upgrade io-datastores 
>> --node-pool=default-pool --zone=us-central1-a --quiet*21:25:59* ERROR: 
>> (gcloud.container.clusters.upgrade) No node pool found matching the name 
>> [default-pool].
>>
>> ```
>>
>>
>> from 
>> https://ci-beam.apache.org/job/Rotate%20IO-Datastores%20Cluster%20Credentials/6/console
>>
>>
>> It looks like there's been some change to the cluster that is causing the
>> job to fail. If we don't fix this and rerun, the cluster's creds will
>> expire (probably in like a monthish). I'm not sure what the impact of that
>> would be, I think probably broken IO integration tests.
>>
>> @John Casey  or @Yi Hu  might
>> know more about this, I think the cluster in question is
>> https://pantheon.corp.google.com/kubernetes/clusters/details/us-central1-a/io-datastores/details?mods=dataflow_dev=apache-beam-testing
>>
>> Next steps are:
>> 1) figuring out why there's no longer a default-pool
>> 2) Either recreating it or modifying the cred rotation logic
>> 3) (Minor) Fixing the url in the Jenkins job so it actually points to the
>> failing job when we get emails like this
>>
>> On Thu, Dec 1, 2022 at 1:18 PM Byron Ellis via dev 
>> wrote:
>>
>>> Is there something we need to do here?
>>>
>>> On Thu, Dec 1, 2022 at 10:10 AM Apache Jenkins Server <
>>> jenk...@builds.apache.org> wrote:
>>>
 Something went wrong during the automatic credentials rotation for
 IO-Datastores Cluster, performed at Thu Dec 01 15:00:47 UTC 2022. It may be
 necessary to check the state of the cluster certificates. For further
 details refer to the following links:
  * https://ci-beam.apache.org/job/beam_SeedJob_Standalone/
  * https://ci-beam.apache.org/.
>>>
>>>


Re: Credentials Rotation Failure on IO-Datastores cluster

2022-12-01 Thread Yi Hu via dev
Thanks for reporting. I have bumped the pool size of io-datastore as we
have more tests being added and the default-pool frequently becomes
unschedulable due to memory constraints. A simple fix is just rename the
'pool1' back to 'default-pool'.

On Thu, Dec 1, 2022 at 1:26 PM Danny McCormick 
wrote:

> Yes, I was just starting to look into this. Looks like this is the result
> of this job failing -
> https://github.com/apache/beam/blob/ec2a07b38c1f640c62e7c3b96966f18b334a7ce9/.test-infra/jenkins/job_IODatastoresCredentialsRotation.groovy#L49
>
> The error is:
>
> ```
>
> *21:25:58* + gcloud container clusters upgrade io-datastores 
> --node-pool=default-pool --zone=us-central1-a --quiet*21:25:59* ERROR: 
> (gcloud.container.clusters.upgrade) No node pool found matching the name 
> [default-pool].
>
> ```
>
>
> from 
> https://ci-beam.apache.org/job/Rotate%20IO-Datastores%20Cluster%20Credentials/6/console
>
>
> It looks like there's been some change to the cluster that is causing the
> job to fail. If we don't fix this and rerun, the cluster's creds will
> expire (probably in like a monthish). I'm not sure what the impact of that
> would be, I think probably broken IO integration tests.
>
> @John Casey  or @Yi Hu  might
> know more about this, I think the cluster in question is
> https://pantheon.corp.google.com/kubernetes/clusters/details/us-central1-a/io-datastores/details?mods=dataflow_dev=apache-beam-testing
>
> Next steps are:
> 1) figuring out why there's no longer a default-pool
> 2) Either recreating it or modifying the cred rotation logic
> 3) (Minor) Fixing the url in the Jenkins job so it actually points to the
> failing job when we get emails like this
>
> On Thu, Dec 1, 2022 at 1:18 PM Byron Ellis via dev 
> wrote:
>
>> Is there something we need to do here?
>>
>> On Thu, Dec 1, 2022 at 10:10 AM Apache Jenkins Server <
>> jenk...@builds.apache.org> wrote:
>>
>>> Something went wrong during the automatic credentials rotation for
>>> IO-Datastores Cluster, performed at Thu Dec 01 15:00:47 UTC 2022. It may be
>>> necessary to check the state of the cluster certificates. For further
>>> details refer to the following links:
>>>  * https://ci-beam.apache.org/job/beam_SeedJob_Standalone/
>>>  * https://ci-beam.apache.org/.
>>
>>


Re: Credentials Rotation Failure on IO-Datastores cluster

2022-12-01 Thread Danny McCormick via dev
Yes, I was just starting to look into this. Looks like this is the result
of this job failing -
https://github.com/apache/beam/blob/ec2a07b38c1f640c62e7c3b96966f18b334a7ce9/.test-infra/jenkins/job_IODatastoresCredentialsRotation.groovy#L49

The error is:

```

*21:25:58* + gcloud container clusters upgrade io-datastores
--node-pool=default-pool --zone=us-central1-a --quiet*21:25:59* ERROR:
(gcloud.container.clusters.upgrade) No node pool found matching the
name [default-pool].

```


from 
https://ci-beam.apache.org/job/Rotate%20IO-Datastores%20Cluster%20Credentials/6/console


It looks like there's been some change to the cluster that is causing the
job to fail. If we don't fix this and rerun, the cluster's creds will
expire (probably in like a monthish). I'm not sure what the impact of that
would be, I think probably broken IO integration tests.

@John Casey  or @Yi Hu  might know
more about this, I think the cluster in question is
https://pantheon.corp.google.com/kubernetes/clusters/details/us-central1-a/io-datastores/details?mods=dataflow_dev=apache-beam-testing

Next steps are:
1) figuring out why there's no longer a default-pool
2) Either recreating it or modifying the cred rotation logic
3) (Minor) Fixing the url in the Jenkins job so it actually points to the
failing job when we get emails like this

On Thu, Dec 1, 2022 at 1:18 PM Byron Ellis via dev 
wrote:

> Is there something we need to do here?
>
> On Thu, Dec 1, 2022 at 10:10 AM Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
>> Something went wrong during the automatic credentials rotation for
>> IO-Datastores Cluster, performed at Thu Dec 01 15:00:47 UTC 2022. It may be
>> necessary to check the state of the cluster certificates. For further
>> details refer to the following links:
>>  * https://ci-beam.apache.org/job/beam_SeedJob_Standalone/
>>  * https://ci-beam.apache.org/.
>
>


Re: Credentials Rotation Failure on IO-Datastores cluster

2022-12-01 Thread Byron Ellis via dev
Is there something we need to do here?

On Thu, Dec 1, 2022 at 10:10 AM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Something went wrong during the automatic credentials rotation for
> IO-Datastores Cluster, performed at Thu Dec 01 15:00:47 UTC 2022. It may be
> necessary to check the state of the cluster certificates. For further
> details refer to the following links:
>  * https://ci-beam.apache.org/job/beam_SeedJob_Standalone/
>  * https://ci-beam.apache.org/.


Credentials Rotation Failure on IO-Datastores cluster

2022-12-01 Thread Apache Jenkins Server
Something went wrong during the automatic credentials rotation for 
IO-Datastores Cluster, performed at Thu Dec 01 15:00:47 UTC 2022. It may be 
necessary to check the state of the cluster certificates. For further details 
refer to the following links:
 * https://ci-beam.apache.org/job/beam_SeedJob_Standalone/ 
 * https://ci-beam.apache.org/.

Credentials Rotation Failure on IO-Datastores cluster

2022-11-30 Thread Apache Jenkins Server
Something went wrong during the automatic credentials rotation for 
IO-Datastores Cluster, performed at Thu Dec 01 00:53:08 UTC 2022. It may be 
necessary to check the state of the cluster certificates. For further details 
refer to the following links:
 * https://ci-beam.apache.org/job/beam_SeedJob/ 
 * https://ci-beam.apache.org/.