[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957432#comment-16957432 ] Udi Meiri commented on BEAM-8196: - This is still happening, very frequently now for 3.7 postcommits. I investigated 6 semmingly long-running jobs on the apache-beam-testing project, they all were running "Apache Beam Python 3.7 SDK 2.17.0.dev" and all were showing the "ModuleNotFoundError: No module named 'endpoints_pb2'". One of these is still running after 16 hours. The rest failed after 1-2 hours. Perhaps the 16 hour one did not abort because it is a streaming job? > Python 3.5 post commit timed out at 100 minutes > --- > > Key: BEAM-8196 > URL: https://issues.apache.org/jira/browse/BEAM-8196 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Critical > Time Spent: 40m > Remaining Estimate: 0h > > https://builds.apache.org/job/beam_PostCommit_Python35/435/ > This post commit took 100 minutes and timedout. Should we increase the > timeout? We can also look into why this postcommit was slow. A later post > commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) > completed in 66 minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928752#comment-16928752 ] Udi Meiri commented on BEAM-8196: - #9547 was merged ~18 hours ago, but this postcommit timed out about 4 hours ago: https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/console > Python 3.5 post commit timed out at 100 minutes > --- > > Key: BEAM-8196 > URL: https://issues.apache.org/jira/browse/BEAM-8196 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Critical > Time Spent: 40m > Remaining Estimate: 0h > > https://builds.apache.org/job/beam_PostCommit_Python35/435/ > This post commit took 100 minutes and timedout. Should we increase the > timeout? We can also look into why this postcommit was slow. A later post > commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) > completed in 66 minutes. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927886#comment-16927886 ] Robert Bradshaw commented on BEAM-8196: --- Yes, I think we could make this an error if we're sure end users won't have issues (i.e. all the files are already there with the right timestamps). > Python 3.5 post commit timed out at 100 minutes > --- > > Key: BEAM-8196 > URL: https://issues.apache.org/jira/browse/BEAM-8196 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Critical > Time Spent: 0.5h > Remaining Estimate: 0h > > https://builds.apache.org/job/beam_PostCommit_Python35/435/ > This post commit took 100 minutes and timedout. Should we increase the > timeout? We can also look into why this postcommit was slow. A later post > commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) > completed in 66 minutes. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927877#comment-16927877 ] Ahmet Altay commented on BEAM-8196: --- Should we make issues with generating proto files an error here: https://github.com/apache/beam/blob/master/sdks/python/setup.py#L177 /cc [~robertwb] > Python 3.5 post commit timed out at 100 minutes > --- > > Key: BEAM-8196 > URL: https://issues.apache.org/jira/browse/BEAM-8196 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Critical > Time Spent: 0.5h > Remaining Estimate: 0h > > https://builds.apache.org/job/beam_PostCommit_Python35/435/ > This post commit took 100 minutes and timedout. Should we increase the > timeout? We can also look into why this postcommit was slow. A later post > commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) > completed in 66 minutes. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927874#comment-16927874 ] Udi Meiri commented on BEAM-8196: - I did download the tar file and it did contain the file. Also opened https://issues.apache.org/jira/browse/BEAM-8211 for setting a default timeout. > Python 3.5 post commit timed out at 100 minutes > --- > > Key: BEAM-8196 > URL: https://issues.apache.org/jira/browse/BEAM-8196 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Critical > Time Spent: 0.5h > Remaining Estimate: 0h > > https://builds.apache.org/job/beam_PostCommit_Python35/435/ > This post commit took 100 minutes and timedout. Should we increase the > timeout? We can also look into why this postcommit was slow. A later post > commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) > completed in 66 minutes. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927873#comment-16927873 ] Ahmet Altay commented on BEAM-8196: --- Setting `wait_until_finish_duration` sounds good. For the failing job, do you have a way to check what was staged as the sdk tarball and does it contain the endpoints_pb2 file or not? I suspect we have an issue with creating the tarball. > Python 3.5 post commit timed out at 100 minutes > --- > > Key: BEAM-8196 > URL: https://issues.apache.org/jira/browse/BEAM-8196 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Critical > Time Spent: 0.5h > Remaining Estimate: 0h > > https://builds.apache.org/job/beam_PostCommit_Python35/435/ > This post commit took 100 minutes and timedout. Should we increase the > timeout? We can also look into why this postcommit was slow. A later post > commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) > completed in 66 minutes. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927866#comment-16927866 ] Udi Meiri commented on BEAM-8196: - This looks like another case of https://issues.apache.org/jira/browse/BEAM-7527 ([~markflyhigh]), except that BigQueryQueryToTableIT does not set a timeout for its pipeline (wait_until_finish_duration) so Dataflow tries to launch workers for a full hour. I'll send a PR to set wait_until_finish_duration. > Python 3.5 post commit timed out at 100 minutes > --- > > Key: BEAM-8196 > URL: https://issues.apache.org/jira/browse/BEAM-8196 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Critical > > https://builds.apache.org/job/beam_PostCommit_Python35/435/ > This post commit took 100 minutes and timedout. Should we increase the > timeout? We can also look into why this postcommit was slow. A later post > commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) > completed in 66 minutes. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927851#comment-16927851 ] Ahmet Altay commented on BEAM-8196: --- Great, thank you [~udim]. That sounds like a promising start. [~alanmyrvold][~yifanzou] -- Question, for jobs that run on Dataflow could we introduce a timeout parameter to convert these jenkins level job timeouts to dataflow level job timeouts? Would that be a bad/good idea? Increase flakiness? > Python 3.5 post commit timed out at 100 minutes > --- > > Key: BEAM-8196 > URL: https://issues.apache.org/jira/browse/BEAM-8196 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Critical > > https://builds.apache.org/job/beam_PostCommit_Python35/435/ > This post commit took 100 minutes and timedout. Should we increase the > timeout? We can also look into why this postcommit was slow. A later post > commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) > completed in 66 minutes. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927832#comment-16927832 ] Udi Meiri commented on BEAM-8196: - Current theory is that test_big_query_standard_sql_kms_key_native is failing (though this involves a lot of guesswork so I'm 30% sure). In any case, I found this error in the worker-startup logs: {code} I 2019-09-11T13:07:04.210788Z Executing: /usr/local/bin/python -m dataflow_worker.start -Djob_id=2019-09-11_05_19_06-12909785153999113879 -Dproject_id=apache-beam-testing -Dreporting_enabled=True -Droot_url=https://dataflow.googleapis.com -Dservice_path=https://dataflow.googleapis.com/ -Dtemp_gcs_directory=gs://unused -Dworker_id=beamapp-jenkins-091112185-09110519-wcy3-harness-k5j8 -Ddataflow.worker.logging.location=/var/log/dataflow/python-dataflow-0-json.log -Dlocal_staging_directory=/var/opt/google/dataflow -Dsdk_pipeline_options={"display_data":[{"key":"runner","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"TestDataflowRunner"},{"key":"project","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"apache-beam-testing"},{"key":"job_name","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"beamapp-jenkins-0911121857-115043"},{"key":"staging_location","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-0911121857-115043.1568204337.115228"},{"key":"temp_location","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/temp-it/beamapp-jenkins-0911121857-115043.1568204337.115228"},{"key":"region","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"us-central1"},{"key":"dataflow_kms_key","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"projects/apache-beam-testing/locations/global/keyRings/beam-it/cryptoKeys/test"},{"key":"num_workers","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"INTEGER","value":1},{"key":"dataflow_worker_jar","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37/src/runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.16.0-SNAPSHOT.jar"},{"key":"experiments","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"['use_fastavro']"},{"key":"requirements_file","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"postcommit_requirements.txt"},{"key":"beam_plugins","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"['apache_beam.io.filesystem.FileSystem', 'apache_beam.io.hadoopfilesystem.HadoopFileSystem', 'apache_beam.io.localfilesystem.LocalFileSystem', 'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem', 'apache_beam.io.filesystem_test.TestingFileSystem', 'apache_beam.runners.interactive.display.pipeline_graph_renderer.PipelineGraphRenderer', 'apache_beam.runners.interactive.display.pipeline_graph_renderer.MuteRenderer', 'apache_beam.runners.interactive.display.pipeline_graph_renderer.TextRenderer', 'apache_beam.runners.interactive.display.pipeline_graph_renderer.PydotRenderer']"},{"key":"sdk_location","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37/src/sdks/python/build/apache-beam.tar.gz"}],"options":{"autoscalingAlgorithm":"NONE","beam_plugins":["apache_beam.io.filesystem.FileSystem","apache_beam.io.hadoopfilesystem.HadoopFileSystem","apache_beam.io.localfilesystem.LocalFileSystem","apache_beam.io.gcp.gcsfilesystem.GCSFileSystem","apache_beam.io.filesystem_test.TestingFileSystem","apache_beam.runners.interactive.display.pipeline_graph_renderer.PipelineGraphRenderer","apache_beam.runners.interactive.display.pipeline_graph_renderer.MuteRenderer","apache_beam.runners.interactive.display.pipeline_graph_renderer.TextRenderer","apache_beam.runners.interactive.display.pipeline_graph_renderer.PydotRenderer"],"dataflowJobId":"2019-09-11_05_19_06-12909785153999113879","dataflow_endpoint":"https://dataflow.googleapis.com","dataflow_kms_key":"projects/apache-beam-testing/locations/global/keyRings/beam-it/cryptoKeys/test","dataflow_worker_jar":"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37/src/runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.16.0-SNAPSHOT.jar","direct_num_workers":1,"direct_runner_bundle_repeat":0,"direct_runner_use_stacked_bundle":true,"dry_run":false,"enable_streaming_eng
[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes
[ https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927801#comment-16927801 ] Udi Meiri commented on BEAM-8196: - No idea why this happens. There isn't an easy way to tell what's going on. Increasing the timeout wouldn't help as the running time seems pretty stable at around 60-65m: https://builds.apache.org/job/beam_PostCommit_Python35/buildTimeTrend Another timeout: https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/console > Python 3.5 post commit timed out at 100 minutes > --- > > Key: BEAM-8196 > URL: https://issues.apache.org/jira/browse/BEAM-8196 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Udi Meiri >Priority: Critical > > https://builds.apache.org/job/beam_PostCommit_Python35/435/ > This post commit took 100 minutes and timedout. Should we increase the > timeout? We can also look into why this postcommit was slow. A later post > commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) > completed in 66 minutes. -- This message was sent by Atlassian Jira (v8.3.2#803003)