[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-10-22 Thread Udi Meiri (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957432#comment-16957432
 ] 

Udi Meiri commented on BEAM-8196:
-

This is still happening, very frequently now for 3.7 postcommits.
I investigated 6 semmingly long-running jobs on the apache-beam-testing 
project, they all were running "Apache Beam Python 3.7 SDK 2.17.0.dev" and all 
were showing the
"ModuleNotFoundError: No module named 'endpoints_pb2'".
One of these is still running after 16 hours. The rest failed after 1-2 hours. 
Perhaps the 16 hour one did not abort because it is a streaming job?


> Python 3.5 post commit timed out at 100 minutes
> ---
>
> Key: BEAM-8196
> URL: https://issues.apache.org/jira/browse/BEAM-8196
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Critical
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> https://builds.apache.org/job/beam_PostCommit_Python35/435/
> This post commit took 100 minutes and timedout. Should we increase the 
> timeout? We can also look into why this postcommit was slow. A later post 
> commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) 
> completed in 66 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-09-12 Thread Udi Meiri (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928752#comment-16928752
 ] 

Udi Meiri commented on BEAM-8196:
-

#9547 was merged ~18 hours ago, but this postcommit timed out about 4  hours 
ago:
https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/console

> Python 3.5 post commit timed out at 100 minutes
> ---
>
> Key: BEAM-8196
> URL: https://issues.apache.org/jira/browse/BEAM-8196
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Critical
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> https://builds.apache.org/job/beam_PostCommit_Python35/435/
> This post commit took 100 minutes and timedout. Should we increase the 
> timeout? We can also look into why this postcommit was slow. A later post 
> commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) 
> completed in 66 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-09-11 Thread Robert Bradshaw (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927886#comment-16927886
 ] 

Robert Bradshaw commented on BEAM-8196:
---

Yes, I think we could make this an error if we're sure end users won't have 
issues (i.e. all the files are already there with the right timestamps). 

> Python 3.5 post commit timed out at 100 minutes
> ---
>
> Key: BEAM-8196
> URL: https://issues.apache.org/jira/browse/BEAM-8196
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Critical
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> https://builds.apache.org/job/beam_PostCommit_Python35/435/
> This post commit took 100 minutes and timedout. Should we increase the 
> timeout? We can also look into why this postcommit was slow. A later post 
> commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) 
> completed in 66 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-09-11 Thread Ahmet Altay (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927877#comment-16927877
 ] 

Ahmet Altay commented on BEAM-8196:
---

Should we make issues with generating proto files an error here: 
https://github.com/apache/beam/blob/master/sdks/python/setup.py#L177

/cc [~robertwb]

> Python 3.5 post commit timed out at 100 minutes
> ---
>
> Key: BEAM-8196
> URL: https://issues.apache.org/jira/browse/BEAM-8196
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Critical
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> https://builds.apache.org/job/beam_PostCommit_Python35/435/
> This post commit took 100 minutes and timedout. Should we increase the 
> timeout? We can also look into why this postcommit was slow. A later post 
> commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) 
> completed in 66 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-09-11 Thread Udi Meiri (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927874#comment-16927874
 ] 

Udi Meiri commented on BEAM-8196:
-

I did download the tar file and it did contain the file.
Also opened https://issues.apache.org/jira/browse/BEAM-8211 for setting a 
default timeout.

> Python 3.5 post commit timed out at 100 minutes
> ---
>
> Key: BEAM-8196
> URL: https://issues.apache.org/jira/browse/BEAM-8196
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Critical
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> https://builds.apache.org/job/beam_PostCommit_Python35/435/
> This post commit took 100 minutes and timedout. Should we increase the 
> timeout? We can also look into why this postcommit was slow. A later post 
> commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) 
> completed in 66 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-09-11 Thread Ahmet Altay (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927873#comment-16927873
 ] 

Ahmet Altay commented on BEAM-8196:
---

Setting `wait_until_finish_duration` sounds good.

For the failing job, do you have a way to check what was staged as the sdk 
tarball and does it contain the endpoints_pb2 file or not? I suspect we have an 
issue with creating the tarball.

> Python 3.5 post commit timed out at 100 minutes
> ---
>
> Key: BEAM-8196
> URL: https://issues.apache.org/jira/browse/BEAM-8196
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Critical
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> https://builds.apache.org/job/beam_PostCommit_Python35/435/
> This post commit took 100 minutes and timedout. Should we increase the 
> timeout? We can also look into why this postcommit was slow. A later post 
> commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) 
> completed in 66 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-09-11 Thread Udi Meiri (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927866#comment-16927866
 ] 

Udi Meiri commented on BEAM-8196:
-

This looks like another case of https://issues.apache.org/jira/browse/BEAM-7527 
([~markflyhigh]), except that BigQueryQueryToTableIT does not set a timeout for 
its pipeline (wait_until_finish_duration) so Dataflow tries to launch workers 
for a full hour.
I'll send a PR to set wait_until_finish_duration.

> Python 3.5 post commit timed out at 100 minutes
> ---
>
> Key: BEAM-8196
> URL: https://issues.apache.org/jira/browse/BEAM-8196
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Critical
>
> https://builds.apache.org/job/beam_PostCommit_Python35/435/
> This post commit took 100 minutes and timedout. Should we increase the 
> timeout? We can also look into why this postcommit was slow. A later post 
> commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) 
> completed in 66 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-09-11 Thread Ahmet Altay (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927851#comment-16927851
 ] 

Ahmet Altay commented on BEAM-8196:
---

Great, thank you [~udim]. That sounds like a promising start.

[~alanmyrvold][~yifanzou] -- Question, for jobs that run on Dataflow could we 
introduce a timeout parameter to convert these jenkins level job timeouts to 
dataflow level job timeouts? Would that be a bad/good idea? Increase flakiness?

> Python 3.5 post commit timed out at 100 minutes
> ---
>
> Key: BEAM-8196
> URL: https://issues.apache.org/jira/browse/BEAM-8196
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Critical
>
> https://builds.apache.org/job/beam_PostCommit_Python35/435/
> This post commit took 100 minutes and timedout. Should we increase the 
> timeout? We can also look into why this postcommit was slow. A later post 
> commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) 
> completed in 66 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-09-11 Thread Udi Meiri (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927832#comment-16927832
 ] 

Udi Meiri commented on BEAM-8196:
-

Current theory is that test_big_query_standard_sql_kms_key_native is failing 
(though this involves a lot of guesswork so I'm 30% sure).
In any case, I found this error in the worker-startup logs:
{code}
I 2019-09-11T13:07:04.210788Z Executing: /usr/local/bin/python -m 
dataflow_worker.start -Djob_id=2019-09-11_05_19_06-12909785153999113879 
-Dproject_id=apache-beam-testing -Dreporting_enabled=True 
-Droot_url=https://dataflow.googleapis.com 
-Dservice_path=https://dataflow.googleapis.com/ 
-Dtemp_gcs_directory=gs://unused 
-Dworker_id=beamapp-jenkins-091112185-09110519-wcy3-harness-k5j8 
-Ddataflow.worker.logging.location=/var/log/dataflow/python-dataflow-0-json.log 
-Dlocal_staging_directory=/var/opt/google/dataflow 
-Dsdk_pipeline_options={"display_data":[{"key":"runner","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"TestDataflowRunner"},{"key":"project","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"apache-beam-testing"},{"key":"job_name","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"beamapp-jenkins-0911121857-115043"},{"key":"staging_location","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-0911121857-115043.1568204337.115228"},{"key":"temp_location","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/temp-it/beamapp-jenkins-0911121857-115043.1568204337.115228"},{"key":"region","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"us-central1"},{"key":"dataflow_kms_key","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"projects/apache-beam-testing/locations/global/keyRings/beam-it/cryptoKeys/test"},{"key":"num_workers","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"INTEGER","value":1},{"key":"dataflow_worker_jar","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37/src/runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.16.0-SNAPSHOT.jar"},{"key":"experiments","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"['use_fastavro']"},{"key":"requirements_file","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"postcommit_requirements.txt"},{"key":"beam_plugins","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"['apache_beam.io.filesystem.FileSystem',
 'apache_beam.io.hadoopfilesystem.HadoopFileSystem', 
'apache_beam.io.localfilesystem.LocalFileSystem', 
'apache_beam.io.gcp.gcsfilesystem.GCSFileSystem', 
'apache_beam.io.filesystem_test.TestingFileSystem', 
'apache_beam.runners.interactive.display.pipeline_graph_renderer.PipelineGraphRenderer',
 
'apache_beam.runners.interactive.display.pipeline_graph_renderer.MuteRenderer', 
'apache_beam.runners.interactive.display.pipeline_graph_renderer.TextRenderer', 
'apache_beam.runners.interactive.display.pipeline_graph_renderer.PydotRenderer']"},{"key":"sdk_location","namespace":"apache_beam.options.pipeline_options.PipelineOptions","type":"STRING","value":"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37/src/sdks/python/build/apache-beam.tar.gz"}],"options":{"autoscalingAlgorithm":"NONE","beam_plugins":["apache_beam.io.filesystem.FileSystem","apache_beam.io.hadoopfilesystem.HadoopFileSystem","apache_beam.io.localfilesystem.LocalFileSystem","apache_beam.io.gcp.gcsfilesystem.GCSFileSystem","apache_beam.io.filesystem_test.TestingFileSystem","apache_beam.runners.interactive.display.pipeline_graph_renderer.PipelineGraphRenderer","apache_beam.runners.interactive.display.pipeline_graph_renderer.MuteRenderer","apache_beam.runners.interactive.display.pipeline_graph_renderer.TextRenderer","apache_beam.runners.interactive.display.pipeline_graph_renderer.PydotRenderer"],"dataflowJobId":"2019-09-11_05_19_06-12909785153999113879","dataflow_endpoint":"https://dataflow.googleapis.com","dataflow_kms_key":"projects/apache-beam-testing/locations/global/keyRings/beam-it/cryptoKeys/test","dataflow_worker_jar":"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37/src/runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.16.0-SNAPSHOT.jar","direct_num_workers":1,"direct_runner_bundle_repeat":0,"direct_runner_use_stacked_bundle":true,"dry_run":false,"enable_streaming_eng

[jira] [Commented] (BEAM-8196) Python 3.5 post commit timed out at 100 minutes

2019-09-11 Thread Udi Meiri (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927801#comment-16927801
 ] 

Udi Meiri commented on BEAM-8196:
-

No idea why this happens. There isn't an easy way to tell what's going on.
Increasing the timeout wouldn't help as the running time seems pretty stable at 
around 60-65m: 
https://builds.apache.org/job/beam_PostCommit_Python35/buildTimeTrend

Another timeout:
https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/console


> Python 3.5 post commit timed out at 100 minutes
> ---
>
> Key: BEAM-8196
> URL: https://issues.apache.org/jira/browse/BEAM-8196
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Critical
>
> https://builds.apache.org/job/beam_PostCommit_Python35/435/
> This post commit took 100 minutes and timedout. Should we increase the 
> timeout? We can also look into why this postcommit was slow. A later post 
> commit (https://builds.apache.org/job/beam_PostCommit_Python35/437/) 
> completed in 66 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)