Daniel Oliveira created BEAM-5973:
-------------------------------------

             Summary: [Flake] Various ValidatesRunner Post-commits flaking due 
to quota issues.
                 Key: BEAM-5973
                 URL: https://issues.apache.org/jira/browse/BEAM-5973
             Project: Beam
          Issue Type: Bug
          Components: test-failures
            Reporter: Daniel Oliveira


Multiple post-commits all seem to have failed at the same time due to extremely 
similar GCP errors:

beam_PostCommit_Java_GradleBuild: 
[https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1822/]

Several tests fail with one of the two following errors:
{noformat}
Nov 04, 2018 6:40:14 PM 
org.apache.beam.runners.dataflow.TestDataflowRunner$ErrorMonitorMessagesHandler 
process
INFO: Dataflow job 2018-11-04_10_37_12-7420261977214120411 threw exception. 
Failure message was: Startup of the worker pool in zone us-central1-b failed to 
bring up any of the desired 1 workers. QUOTA_EXCEEDED: Quota 'DISKS_TOTAL_GB' 
exceeded. Limit: 200000.0 in region us-central1.{noformat}
{noformat}
Nov 04, 2018 6:39:14 PM 
org.apache.beam.runners.dataflow.TestDataflowRunner$ErrorMonitorMessagesHandler 
process INFO: Dataflow job 2018-11-04_10_37_11-14433481609734431843 threw 
exception. Failure message was: Startup of the worker pool in zone 
us-central1-b failed to bring up any of the desired 1 workers. QUOTA_EXCEEDED: 
Quota 'CPUS' exceeded. Limit: 750.0 in region us-central1.
{noformat}
beam_PostCommit_Java_ValidatesRunner_PortabilityApi_Dataflow_Gradle: 
[https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_PortabilityApi_Dataflow_Gradle/31/]

Test failures include the errors pasted above, plus one new one:

 
{noformat}
Nov 04, 2018 6:38:13 PM 
org.apache.beam.runners.dataflow.util.MonitoringUtil$LoggingHandler process
SEVERE: 2018-11-04T18:38:04.612Z: Workflow failed. Causes: Project 
apache-beam-testing has insufficient quota(s) to execute this workflow with 1 
instances in region us-central1. Quota summary (required/available): 1/7192 
instances, 1/202 CPUs, 250/121 disk GB, 0/4046 SSD disk GB, 1/267 instance 
groups, 1/267 managed instance groups, 1/242 instance templates, 1/446 in-use 
IP addresses.{noformat}
 

beam_PostCommit_Java_PVR_Flink: 
[https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink/214/]

The error appears differently but is caused by a lack of memory, so it seems 
related to the quota issues above.

 
{noformat}
Java HotSpot(TM) 64-Bit Server VM warning:
INFO: os::commit_memory(0x00000003acd80000, 6654787584, 0) failed; 
error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation
(mmap) failed to map
6654787584
bytes
for
committing reserved memory.{noformat}
Project 
beam_PostCommit_Java_ValidatesRunner_Flink_Gradle:[https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/2101/]

I couldn't find a visible error with the failure in this job, but I'm grouping 
it together with the other failures due to it flaking at the same time as the 
other Flink VR Post-commit.

 

 

I may be grouping these failures a bit too aggressively. If anyone believes 
that the failures are caused by different reasons please split this into 
multiple bugs.

 

A possibility is that these errors are caused by us running all our 
post-commits at the same time, causing resources to be used up in bursts. Maybe 
if we stagger our post-commits some of these quota issues could be avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to