[jira] [Commented] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11
[ https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422356#comment-17422356 ] Ryan Scudellari commented on FLINK-24156: - [~chesnay] - would it be possible to backport this into the `1.14.x` line? I see that `1.14.0` was just released. What are the policies around including these sorts of fixes in a patch release? > BlobServer crashes due to SocketTimeoutException in Java 11 > --- > > Key: FLINK-24156 > URL: https://issues.apache.org/jira/browse/FLINK-24156 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.12.4, 1.13.2 > Environment: Java 11 > CentOS 7.6 >Reporter: Ryan Scudellari >Assignee: Ryan Scudellari >Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > h3. Overview > We have seen the BlobServer crash due to a *SocketTimeoutException* while > running on JRE 11. This is likely caused by a [JDK bug present in JDK > 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) > that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ > is interrupted by any UNIX signal. The BlobServer calls _accept()_ when > establishing connections with clients and is expected to block indefinitely. > [The BlobServer currently shuts down when it catches a > Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267]. > We do not see this behavior when running the same steps in JRE 8. > h3. Reproducing the issue > To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be > running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available > to find the relevant pid. > One-liner: > {code:bash} > kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint > | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print > $2}' | xargs printf "%d") > {code} > > # Run > {code:bash} > jstack [PID] | grep BLOB > {code} > where *PID* is the process ID of the job manager. > # Find the *nid=[HEX]* value and convert the HEX to decimal. > # Run > {code:bash} > kill -SIGPIPE [DNID] > {code} > where *DNID* is the converted decimal value of *HEX nid* from the previous > step. > # Observe the following error in the job manager logs: > {noformat} > 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR > org.apache.flink.runtime.blob.BlobServer - BLOB server stopped working. > Shutting down > at java.base/java.net.PlainSocketImpl.socketAccept > at java.base/java.net.AbstractPlainSocketImpl.accept > at java.base/java.net.ServerSocket.implAccept > at java.base/java.net.ServerSocket.accept > at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266) > 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO > org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at > 0.0.0.0:6124 > {noformat} > h3. Proposed Fix > To protect ourselves from this JDK bug, we propose the workaround of catching > _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call > indefinitely. > > Thanks to [~bsanders-wf] for helping track this down. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11
[ https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415574#comment-17415574 ] Ryan Scudellari edited comment on FLINK-24156 at 9/16/21, 12:47 PM: [~chesnay] [ https://github.com/apache/flink/pull/17227|https://github.com/apache/flink/pull/17227] should be ready for review. I believe I have addressed all comments. Let me know if you have questions! was (Author: scudellari): [https://github.com/apache/flink/pull/17227] should be ready for review. I believe I have addressed all comments. Let me know if you have questions! > BlobServer crashes due to SocketTimeoutException in Java 11 > --- > > Key: FLINK-24156 > URL: https://issues.apache.org/jira/browse/FLINK-24156 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.12.4, 1.13.2 > Environment: Java 11 > CentOS 7.6 >Reporter: Ryan Scudellari >Assignee: Ryan Scudellari >Priority: Major > Labels: pull-request-available > > h3. Overview > We have seen the BlobServer crash due to a *SocketTimeoutException* while > running on JRE 11. This is likely caused by a [JDK bug present in JDK > 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) > that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ > is interrupted by any UNIX signal. The BlobServer calls _accept()_ when > establishing connections with clients and is expected to block indefinitely. > [The BlobServer currently shuts down when it catches a > Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267]. > We do not see this behavior when running the same steps in JRE 8. > h3. Reproducing the issue > To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be > running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available > to find the relevant pid. > One-liner: > {code:bash} > kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint > | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print > $2}' | xargs printf "%d") > {code} > > # Run > {code:bash} > jstack [PID] | grep BLOB > {code} > where *PID* is the process ID of the job manager. > # Find the *nid=[HEX]* value and convert the HEX to decimal. > # Run > {code:bash} > kill -SIGPIPE [DNID] > {code} > where *DNID* is the converted decimal value of *HEX nid* from the previous > step. > # Observe the following error in the job manager logs: > {noformat} > 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR > org.apache.flink.runtime.blob.BlobServer - BLOB server stopped working. > Shutting down > at java.base/java.net.PlainSocketImpl.socketAccept > at java.base/java.net.AbstractPlainSocketImpl.accept > at java.base/java.net.ServerSocket.implAccept > at java.base/java.net.ServerSocket.accept > at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266) > 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO > org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at > 0.0.0.0:6124 > {noformat} > h3. Proposed Fix > To protect ourselves from this JDK bug, we propose the workaround of catching > _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call > indefinitely. > > Thanks to [~bsanders-wf] for helping track this down. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11
[ https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415574#comment-17415574 ] Ryan Scudellari commented on FLINK-24156: - [https://github.com/apache/flink/pull/17227] should be ready for review. I believe I have addressed all comments. Let me know if you have questions! > BlobServer crashes due to SocketTimeoutException in Java 11 > --- > > Key: FLINK-24156 > URL: https://issues.apache.org/jira/browse/FLINK-24156 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.12.4, 1.13.2 > Environment: Java 11 > CentOS 7.6 >Reporter: Ryan Scudellari >Assignee: Ryan Scudellari >Priority: Major > Labels: pull-request-available > > h3. Overview > We have seen the BlobServer crash due to a *SocketTimeoutException* while > running on JRE 11. This is likely caused by a [JDK bug present in JDK > 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) > that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ > is interrupted by any UNIX signal. The BlobServer calls _accept()_ when > establishing connections with clients and is expected to block indefinitely. > [The BlobServer currently shuts down when it catches a > Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267]. > We do not see this behavior when running the same steps in JRE 8. > h3. Reproducing the issue > To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be > running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available > to find the relevant pid. > One-liner: > {code:bash} > kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint > | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print > $2}' | xargs printf "%d") > {code} > > # Run > {code:bash} > jstack [PID] | grep BLOB > {code} > where *PID* is the process ID of the job manager. > # Find the *nid=[HEX]* value and convert the HEX to decimal. > # Run > {code:bash} > kill -SIGPIPE [DNID] > {code} > where *DNID* is the converted decimal value of *HEX nid* from the previous > step. > # Observe the following error in the job manager logs: > {noformat} > 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR > org.apache.flink.runtime.blob.BlobServer - BLOB server stopped working. > Shutting down > at java.base/java.net.PlainSocketImpl.socketAccept > at java.base/java.net.AbstractPlainSocketImpl.accept > at java.base/java.net.ServerSocket.implAccept > at java.base/java.net.ServerSocket.accept > at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266) > 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO > org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at > 0.0.0.0:6124 > {noformat} > h3. Proposed Fix > To protect ourselves from this JDK bug, we propose the workaround of catching > _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call > indefinitely. > > Thanks to [~bsanders-wf] for helping track this down. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11
[ https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410753#comment-17410753 ] Ryan Scudellari commented on FLINK-24156: - (y) > BlobServer crashes due to SocketTimeoutException in Java 11 > --- > > Key: FLINK-24156 > URL: https://issues.apache.org/jira/browse/FLINK-24156 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.12.4, 1.13.2 > Environment: Java 11 > CentOS 7.6 >Reporter: Ryan Scudellari >Priority: Major > > h3. Overview > We have seen the BlobServer crash due to a *SocketTimeoutException* while > running on JRE 11. This is likely caused by a [JDK bug present in JDK > 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) > that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ > is interrupted by any UNIX signal. The BlobServer calls _accept()_ when > establishing connections with clients and is expected to block indefinitely. > [The BlobServer currently shuts down when it catches a > Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267]. > We do not see this behavior when running the same steps in JRE 8. > h3. Reproducing the issue > To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be > running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available > to find the relevant pid. > One-liner: > {code:bash} > kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint > | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print > $2}' | xargs printf "%d") > {code} > > # Run > {code:bash} > jstack [PID] | grep BLOB > {code} > where *PID* is the process ID of the job manager. > # Find the *nid=[HEX]* value and convert the HEX to decimal. > # Run > {code:bash} > kill -SIGPIPE [DNID] > {code} > where *DNID* is the converted decimal value of *HEX nid* from the previous > step. > # Observe the following error in the job manager logs: > {noformat} > 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR > org.apache.flink.runtime.blob.BlobServer - BLOB server stopped working. > Shutting down > at java.base/java.net.PlainSocketImpl.socketAccept > at java.base/java.net.AbstractPlainSocketImpl.accept > at java.base/java.net.ServerSocket.implAccept > at java.base/java.net.ServerSocket.accept > at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266) > 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO > org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at > 0.0.0.0:6124 > {noformat} > h3. Proposed Fix > To protect ourselves from this JDK bug, we propose the workaround of catching > _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call > indefinitely. > > Thanks to [~bsanders-wf] for helping track this down. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11
[ https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410678#comment-17410678 ] Ryan Scudellari commented on FLINK-24156: - Yes, I can provide a fix. A utility method in {{NetUtils}} sounds good. Should I migrate the {{ServerSocker#accept()}} calls in the integration tests or can we assume they are only ever run using Java 8? > BlobServer crashes due to SocketTimeoutException in Java 11 > --- > > Key: FLINK-24156 > URL: https://issues.apache.org/jira/browse/FLINK-24156 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.12.4, 1.13.2 > Environment: Java 11 > CentOS 7.6 >Reporter: Ryan Scudellari >Priority: Major > > h3. Overview > We have seen the BlobServer crash due to a *SocketTimeoutException* while > running on JRE 11. This is likely caused by a [JDK bug present in JDK > 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) > that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ > is interrupted by any UNIX signal. The BlobServer calls _accept()_ when > establishing connections with clients and is expected to block indefinitely. > [The BlobServer currently shuts down when it catches a > Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267]. > We do not see this behavior when running the same steps in JRE 8. > h3. Reproducing the issue > To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be > running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available > to find the relevant pid. > One-liner: > {code:bash} > kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint > | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print > $2}' | xargs printf "%d") > {code} > > # Run > {code:bash} > jstack [PID] | grep BLOB > {code} > where *PID* is the process ID of the job manager. > # Find the *nid=[HEX]* value and convert the HEX to decimal. > # Run > {code:bash} > kill -SIGPIPE [DNID] > {code} > where *DNID* is the converted decimal value of *HEX nid* from the previous > step. > # Observe the following error in the job manager logs: > {noformat} > 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR > org.apache.flink.runtime.blob.BlobServer - BLOB server stopped working. > Shutting down > at java.base/java.net.PlainSocketImpl.socketAccept > at java.base/java.net.AbstractPlainSocketImpl.accept > at java.base/java.net.ServerSocket.implAccept > at java.base/java.net.ServerSocket.accept > at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266) > 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO > org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at > 0.0.0.0:6124 > {noformat} > h3. Proposed Fix > To protect ourselves from this JDK bug, we propose the workaround of catching > _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call > indefinitely. > > Thanks to [~bsanders-wf] for helping track this down. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11
Ryan Scudellari created FLINK-24156: --- Summary: BlobServer crashes due to SocketTimeoutException in Java 11 Key: FLINK-24156 URL: https://issues.apache.org/jira/browse/FLINK-24156 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.13.2, 1.12.4 Environment: Java 11 CentOS 7.6 Reporter: Ryan Scudellari h3. Overview We have seen the BlobServer crash due to a *SocketTimeoutException* while running on JRE 11. This is likely caused by a [JDK bug present in JDK 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ is interrupted by any UNIX signal. The BlobServer calls _accept()_ when establishing connections with clients and is expected to block indefinitely. [The BlobServer currently shuts down when it catches a Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267]. We do not see this behavior when running the same steps in JRE 8. h3. Reproducing the issue To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available to find the relevant pid. One-liner: {code:bash} kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print $2}' | xargs printf "%d") {code} # Run {code:bash} jstack [PID] | grep BLOB {code} where *PID* is the process ID of the job manager. # Find the *nid=[HEX]* value and convert the HEX to decimal. # Run {code:bash} kill -SIGPIPE [DNID] {code} where *DNID* is the converted decimal value of *HEX nid* from the previous step. # Observe the following error in the job manager logs: {noformat} 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR org.apache.flink.runtime.blob.BlobServer - BLOB server stopped working. Shutting down at java.base/java.net.PlainSocketImpl.socketAccept at java.base/java.net.AbstractPlainSocketImpl.accept at java.base/java.net.ServerSocket.implAccept at java.base/java.net.ServerSocket.accept at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266) 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6124 {noformat} h3. Proposed Fix To protect ourselves from this JDK bug, we propose the workaround of catching _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call indefinitely. Thanks to [~bsanders-wf] for helping track this down. -- This message was sent by Atlassian Jira (v8.3.4#803005)