Daniel Oliveira created BEAM-5973: ------------------------------------- Summary: [Flake] Various ValidatesRunner Post-commits flaking due to quota issues. Key: BEAM-5973 URL: https://issues.apache.org/jira/browse/BEAM-5973 Project: Beam Issue Type: Bug Components: test-failures Reporter: Daniel Oliveira
Multiple post-commits all seem to have failed at the same time due to extremely similar GCP errors: beam_PostCommit_Java_GradleBuild: [https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1822/] Several tests fail with one of the two following errors: {noformat} Nov 04, 2018 6:40:14 PM org.apache.beam.runners.dataflow.TestDataflowRunner$ErrorMonitorMessagesHandler process INFO: Dataflow job 2018-11-04_10_37_12-7420261977214120411 threw exception. Failure message was: Startup of the worker pool in zone us-central1-b failed to bring up any of the desired 1 workers. QUOTA_EXCEEDED: Quota 'DISKS_TOTAL_GB' exceeded. Limit: 200000.0 in region us-central1.{noformat} {noformat} Nov 04, 2018 6:39:14 PM org.apache.beam.runners.dataflow.TestDataflowRunner$ErrorMonitorMessagesHandler process INFO: Dataflow job 2018-11-04_10_37_11-14433481609734431843 threw exception. Failure message was: Startup of the worker pool in zone us-central1-b failed to bring up any of the desired 1 workers. QUOTA_EXCEEDED: Quota 'CPUS' exceeded. Limit: 750.0 in region us-central1. {noformat} beam_PostCommit_Java_ValidatesRunner_PortabilityApi_Dataflow_Gradle: [https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_PortabilityApi_Dataflow_Gradle/31/] Test failures include the errors pasted above, plus one new one: {noformat} Nov 04, 2018 6:38:13 PM org.apache.beam.runners.dataflow.util.MonitoringUtil$LoggingHandler process SEVERE: 2018-11-04T18:38:04.612Z: Workflow failed. Causes: Project apache-beam-testing has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/7192 instances, 1/202 CPUs, 250/121 disk GB, 0/4046 SSD disk GB, 1/267 instance groups, 1/267 managed instance groups, 1/242 instance templates, 1/446 in-use IP addresses.{noformat} beam_PostCommit_Java_PVR_Flink: [https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink/214/] The error appears differently but is caused by a lack of memory, so it seems related to the quota issues above. {noformat} Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000003acd80000, 6654787584, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 6654787584 bytes for committing reserved memory.{noformat} Project beam_PostCommit_Java_ValidatesRunner_Flink_Gradle:[https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/2101/] I couldn't find a visible error with the failure in this job, but I'm grouping it together with the other failures due to it flaking at the same time as the other Flink VR Post-commit. I may be grouping these failures a bit too aggressively. If anyone believes that the failures are caused by different reasons please split this into multiple bugs. A possibility is that these errors are caused by us running all our post-commits at the same time, causing resources to be used up in bursts. Maybe if we stagger our post-commits some of these quota issues could be avoided. -- This message was sent by Atlassian JIRA (v7.6.3#76005)