rmetzger commented on a change in pull request #12268:
URL: https://github.com/apache/flink/pull/12268#discussion_r429974586



##########
File path: tools/ci/watchdog.sh
##########
@@ -0,0 +1,122 @@
+#!/usr/bin/env bash
+################################################################################
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+################################################################################
+
+#
+# This file contains a watchdog tool to run monitor and potentially kill tasks
+# not producing any output for n seconds.
+#
+
+# Number of seconds w/o output before printing a stack trace and killing the 
watched process
+MAX_NO_OUTPUT=${MAX_NO_OUTPUT:-900}
+
+# Number of seconds to sleep before checking the output again
+SLEEP_TIME=${SLEEP_TIME:-20}
+
+CMD_OUT=${CMD_OUT:-"/tmp/watchdog.out"}
+CMD_PID=${CMD_PID:-"/tmp/watchdog.pid"}
+CMD_EXIT=${CMD_EXIT:-"/tmp/watchdog.exit"}
+
+
+# =============================================
+# Utility functions
+# ============================================= 
+
+mod_time () {
+       echo `stat -c "%Y" $CMD_OUT`
+}
+
+the_time() {
+       echo `date +%s`
+}
+
+# watchdog process
+
+watchdog () {
+       touch $CMD_OUT
+
+       while true; do
+               sleep $SLEEP_TIME
+
+               time_diff=$((`the_time` - `mod_time`))
+
+               if [ $time_diff -ge $MAX_NO_OUTPUT ]; then
+                       echo 
"=============================================================================="
+                       echo "Process produced no output for ${MAX_NO_OUTPUT} 
seconds."
+                       echo 
"=============================================================================="
+
+                       # run timeout callback
+                       $WATCHDOG_CALLBACK_ON_TIMEOUT
+
+                       echo "Killing process with pid=$(<$CMD_PID) and all 
descendants"
+                       pkill -P $(<$CMD_PID) # kill descendants
+                       kill $(<$CMD_PID) # kill process itself
+
+                       exit 1
+               fi
+       done
+}
+
+assume_available () {
+       VAR=$1
+       if [ -z "$VAR" ] ; then
+               echo "ERROR: Environment variable '$VAR' is not set but 
expected by watchdog.sh"
+               exit 1
+       fi
+}
+
+# =============================================
+# main function
+# =============================================
+
+# entrypoint
+function run_with_watchdog() {
+       local cmd="$1"
+
+       # check preconditions
+       assume_available CMD_OUT # used for writing the process output (to 
check for activity)
+       assume_available CMD_PID # location of file to write process id to
+       assume_available CMD_EXIT # location of file to writ exit code to
+       assume_available WATCHDOG_CALLBACK_ON_TIMEOUT # bash function to call 
on timeout
+
+       watchdog &

Review comment:
       `CMD_PID` is **not** a variable containing the command process id. It is 
the filename used to store and retrieve the command process id from. In that 
sense, this should be more considered as an internal configuration, not an 
argument to be passed around.
   
   There is probably no logical argument why the two approaches we are 
discussing are better or worse, because both approaches work.
   
   But in my opinion, the readability is a bit better when globally defining 
the names of helper files, instead of globally defining them for the main 
method but passing it for helper methods as an argument. Something that is 
passed as an argument to a method is usually considered part of the actual work 
being done.

##########
File path: tools/azure-pipelines/jobs-template.yml
##########
@@ -121,15 +128,34 @@ jobs:
     continueOnError: true # continue the build even if the cache fails.
     condition: not(eq('${{parameters.test_pool_definition.name}}', 'Default'))
     displayName: Cache Maven local repo
+
   - script: |
       echo "##vso[task.setvariable variable=JAVA_HOME]$JAVA_HOME_11_X64"
       echo "##vso[task.setvariable variable=PATH]$JAVA_HOME_11_X64/bin:$PATH"
     displayName: "Set to jdk11"
     condition: eq('${{parameters.jdk}}', 'jdk11')  
+
   - script: sudo sysctl -w kernel.core_pattern=core.%p
     displayName: Set coredump pattern
+
   # Test
-  - script: STAGE=test ${{parameters.environment}} 
./tools/azure-pipelines/azure_controller.sh $(module)
+  - script: |
+      ./tools/azure-pipelines/unpack_build_artifact.sh
+      export DEBUG_FILES="$AGENT_TEMPDIRECTORY/debug_files"

Review comment:
       Thanks for your feedback. I will drastically shorten the `- script: |` 
sections.
   

##########
File path: tools/azure-pipelines/build-python-wheels.yml
##########
@@ -24,7 +24,6 @@ jobs:
       # Compile
       - script: |
           ${{parameters.environment}} ./tools/ci/compile.sh
-          ./tools/azure-pipelines/create_build_artifact.sh

Review comment:
       Ah, I see. I will look into this :) 

##########
File path: 
flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java
##########
@@ -1068,8 +1067,8 @@ public static void teardown() throws Exception {
 
        }
 
-       public static boolean isOnTravis() {
-               return System.getenv("TRAVIS") != null && 
System.getenv("TRAVIS").equals("true");
+       public static boolean isOnCI() {
+               return System.getenv("IS_CI") != null && 
System.getenv("IS_CI").equals("true");

Review comment:
       Cool, thx. Will do.

##########
File path: tools/ci/maven-utils.sh
##########
@@ -73,7 +73,7 @@ function collect_coredumps {
        echo "Searching for .dump, .dumpstream and related files in 
'$SEARCHDIR'"
        for file in `find $SEARCHDIR -type f -regextype posix-extended -iregex 
'.*\.hprof|.*\.dump|.*\.dumpstream|.*hs.*\.log|.*/core(.[0-9]+)?$'`; do
                echo "Moving '$file' to target directory ('$TARGET_DIR')"
-               mv $file $TARGET_DIR/
+               mv $file $TARGET_DIR/$(echo $file | tr "/" "-")

Review comment:
       I just checked, and these dumpstream files just contain for example the 
following
   ```
   # Created at 2020-05-25T10:24:38.843
   Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
   ```
   
   I guess the surefire plugin is creating a dumpstream file if it receives any 
output, but it only fails if it receives unexpected output?
   

##########
File path: tools/azure-pipelines/build-python-wheels.yml
##########
@@ -24,7 +24,6 @@ jobs:
       # Compile
       - script: |
           ${{parameters.environment}} ./tools/ci/compile.sh
-          ./tools/azure-pipelines/create_build_artifact.sh

Review comment:
       Initial tests suggest that we can really remove all exclusions :) 

##########
File path: tools/ci/watchdog.sh
##########
@@ -0,0 +1,122 @@
+#!/usr/bin/env bash
+################################################################################
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+################################################################################
+
+#
+# This file contains a watchdog tool to run monitor and potentially kill tasks
+# not producing any output for n seconds.
+#
+
+# Number of seconds w/o output before printing a stack trace and killing the 
watched process
+MAX_NO_OUTPUT=${MAX_NO_OUTPUT:-900}
+
+# Number of seconds to sleep before checking the output again
+SLEEP_TIME=${SLEEP_TIME:-20}
+
+CMD_OUT=${CMD_OUT:-"/tmp/watchdog.out"}
+CMD_PID=${CMD_PID:-"/tmp/watchdog.pid"}
+CMD_EXIT=${CMD_EXIT:-"/tmp/watchdog.exit"}
+
+
+# =============================================
+# Utility functions
+# ============================================= 
+
+mod_time () {
+       echo `stat -c "%Y" $CMD_OUT`
+}
+
+the_time() {
+       echo `date +%s`
+}
+
+# watchdog process
+
+watchdog () {
+       touch $CMD_OUT
+
+       while true; do
+               sleep $SLEEP_TIME
+
+               time_diff=$((`the_time` - `mod_time`))
+
+               if [ $time_diff -ge $MAX_NO_OUTPUT ]; then
+                       echo 
"=============================================================================="
+                       echo "Process produced no output for ${MAX_NO_OUTPUT} 
seconds."
+                       echo 
"=============================================================================="
+
+                       # run timeout callback
+                       $WATCHDOG_CALLBACK_ON_TIMEOUT
+
+                       echo "Killing process with pid=$(<$CMD_PID) and all 
descendants"
+                       pkill -P $(<$CMD_PID) # kill descendants
+                       kill $(<$CMD_PID) # kill process itself
+
+                       exit 1
+               fi
+       done
+}
+
+assume_available () {
+       VAR=$1
+       if [ -z "$VAR" ] ; then
+               echo "ERROR: Environment variable '$VAR' is not set but 
expected by watchdog.sh"
+               exit 1
+       fi
+}
+
+# =============================================
+# main function
+# =============================================
+
+# entrypoint
+function run_with_watchdog() {
+       local cmd="$1"
+
+       # check preconditions
+       assume_available CMD_OUT # used for writing the process output (to 
check for activity)
+       assume_available CMD_PID # location of file to write process id to
+       assume_available CMD_EXIT # location of file to writ exit code to
+       assume_available WATCHDOG_CALLBACK_ON_TIMEOUT # bash function to call 
on timeout
+
+       watchdog &

Review comment:
       Thank you for your clarifications! I now understand your concerns. 
   I agree that the use of `assume_available` for something internal is weird.
   
   I will push an updated version of the script.

##########
File path: tools/azure-pipelines/jobs-template.yml
##########
@@ -64,14 +65,16 @@ jobs:
     displayName: "Set to jdk11"
     condition: eq('${{parameters.jdk}}', 'jdk11')
   # Compile
-  - script: STAGE=compile ${{parameters.environment}} 
./tools/azure_controller.sh compile
-    displayName: Build
+  - script: |
+      ${{parameters.environment}} ./tools/ci/compile.sh || exit $?
+      ./tools/azure-pipelines/create_build_artifact.sh
+    displayName: Compile
 
   # upload artifacts for next stage
   - task: PublishPipelineArtifact@1
     inputs:
-      path: $(CACHE_FLINK_DIR)

Review comment:
       The documentation does not mention the parameter anymore. Since it still 
works, I assume it's deprecated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to