[jira] [Commented] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11

2021-09-29 Thread Ryan Scudellari (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422356#comment-17422356
 ] 

Ryan Scudellari commented on FLINK-24156:
-

[~chesnay] - would it be possible to backport this into the `1.14.x` line? I 
see that `1.14.0` was just released. What are the policies around including 
these sorts of fixes in a patch release?

> BlobServer crashes due to SocketTimeoutException in Java 11
> ---
>
> Key: FLINK-24156
> URL: https://issues.apache.org/jira/browse/FLINK-24156
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.12.4, 1.13.2
> Environment: Java 11
> CentOS 7.6
>Reporter: Ryan Scudellari
>Assignee: Ryan Scudellari
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
>
> h3. Overview
> We have seen the BlobServer crash due to a *SocketTimeoutException* while 
> running on JRE 11. This is likely caused by a [JDK bug present in JDK 
> 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) 
> that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ 
> is interrupted by any UNIX signal. The BlobServer calls _accept()_ when 
> establishing connections with clients and is expected to block indefinitely. 
> [The BlobServer currently shuts down when it catches a 
> Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267].
>  We do not see this behavior when running the same steps in JRE 8.
> h3. Reproducing the issue
> To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be 
> running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available 
> to find the relevant pid.
> One-liner:
> {code:bash}
> kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint 
> | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print 
> $2}' | xargs printf "%d")
> {code}
>  
>  # Run
> {code:bash}
> jstack [PID] | grep BLOB
> {code}
> where *PID* is the process ID of the job manager.
>  # Find the *nid=[HEX]* value and convert the HEX to decimal.
>  # Run
> {code:bash}
> kill -SIGPIPE [DNID]
> {code}
> where *DNID* is the converted decimal value of *HEX nid* from the previous 
> step.
>  # Observe the following error in the job manager logs:
> {noformat}
> 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR 
> org.apache.flink.runtime.blob.BlobServer  - BLOB server stopped working. 
> Shutting down
>   at java.base/java.net.PlainSocketImpl.socketAccept
>   at java.base/java.net.AbstractPlainSocketImpl.accept
>    at java.base/java.net.ServerSocket.implAccept
>   at java.base/java.net.ServerSocket.accept
>   at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266)
> 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO  
> org.apache.flink.runtime.blob.BlobServer  - Stopped BLOB server at 
> 0.0.0.0:6124
> {noformat}
> h3. Proposed Fix
> To protect ourselves from this JDK bug, we propose the workaround of catching 
> _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call 
> indefinitely.
>  
> Thanks to [~bsanders-wf] for helping track this down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11

2021-09-16 Thread Ryan Scudellari (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415574#comment-17415574
 ] 

Ryan Scudellari edited comment on FLINK-24156 at 9/16/21, 12:47 PM:


[~chesnay] [ 
https://github.com/apache/flink/pull/17227|https://github.com/apache/flink/pull/17227]
 should be ready for review. I believe I have addressed all comments. Let me 
know if you have questions!


was (Author: scudellari):
[https://github.com/apache/flink/pull/17227] should be ready for review. I 
believe I have addressed all comments. Let me know if you have questions!

> BlobServer crashes due to SocketTimeoutException in Java 11
> ---
>
> Key: FLINK-24156
> URL: https://issues.apache.org/jira/browse/FLINK-24156
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.12.4, 1.13.2
> Environment: Java 11
> CentOS 7.6
>Reporter: Ryan Scudellari
>Assignee: Ryan Scudellari
>Priority: Major
>  Labels: pull-request-available
>
> h3. Overview
> We have seen the BlobServer crash due to a *SocketTimeoutException* while 
> running on JRE 11. This is likely caused by a [JDK bug present in JDK 
> 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) 
> that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ 
> is interrupted by any UNIX signal. The BlobServer calls _accept()_ when 
> establishing connections with clients and is expected to block indefinitely. 
> [The BlobServer currently shuts down when it catches a 
> Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267].
>  We do not see this behavior when running the same steps in JRE 8.
> h3. Reproducing the issue
> To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be 
> running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available 
> to find the relevant pid.
> One-liner:
> {code:bash}
> kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint 
> | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print 
> $2}' | xargs printf "%d")
> {code}
>  
>  # Run
> {code:bash}
> jstack [PID] | grep BLOB
> {code}
> where *PID* is the process ID of the job manager.
>  # Find the *nid=[HEX]* value and convert the HEX to decimal.
>  # Run
> {code:bash}
> kill -SIGPIPE [DNID]
> {code}
> where *DNID* is the converted decimal value of *HEX nid* from the previous 
> step.
>  # Observe the following error in the job manager logs:
> {noformat}
> 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR 
> org.apache.flink.runtime.blob.BlobServer  - BLOB server stopped working. 
> Shutting down
>   at java.base/java.net.PlainSocketImpl.socketAccept
>   at java.base/java.net.AbstractPlainSocketImpl.accept
>    at java.base/java.net.ServerSocket.implAccept
>   at java.base/java.net.ServerSocket.accept
>   at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266)
> 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO  
> org.apache.flink.runtime.blob.BlobServer  - Stopped BLOB server at 
> 0.0.0.0:6124
> {noformat}
> h3. Proposed Fix
> To protect ourselves from this JDK bug, we propose the workaround of catching 
> _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call 
> indefinitely.
>  
> Thanks to [~bsanders-wf] for helping track this down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11

2021-09-15 Thread Ryan Scudellari (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415574#comment-17415574
 ] 

Ryan Scudellari commented on FLINK-24156:
-

[https://github.com/apache/flink/pull/17227] should be ready for review. I 
believe I have addressed all comments. Let me know if you have questions!

> BlobServer crashes due to SocketTimeoutException in Java 11
> ---
>
> Key: FLINK-24156
> URL: https://issues.apache.org/jira/browse/FLINK-24156
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.12.4, 1.13.2
> Environment: Java 11
> CentOS 7.6
>Reporter: Ryan Scudellari
>Assignee: Ryan Scudellari
>Priority: Major
>  Labels: pull-request-available
>
> h3. Overview
> We have seen the BlobServer crash due to a *SocketTimeoutException* while 
> running on JRE 11. This is likely caused by a [JDK bug present in JDK 
> 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) 
> that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ 
> is interrupted by any UNIX signal. The BlobServer calls _accept()_ when 
> establishing connections with clients and is expected to block indefinitely. 
> [The BlobServer currently shuts down when it catches a 
> Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267].
>  We do not see this behavior when running the same steps in JRE 8.
> h3. Reproducing the issue
> To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be 
> running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available 
> to find the relevant pid.
> One-liner:
> {code:bash}
> kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint 
> | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print 
> $2}' | xargs printf "%d")
> {code}
>  
>  # Run
> {code:bash}
> jstack [PID] | grep BLOB
> {code}
> where *PID* is the process ID of the job manager.
>  # Find the *nid=[HEX]* value and convert the HEX to decimal.
>  # Run
> {code:bash}
> kill -SIGPIPE [DNID]
> {code}
> where *DNID* is the converted decimal value of *HEX nid* from the previous 
> step.
>  # Observe the following error in the job manager logs:
> {noformat}
> 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR 
> org.apache.flink.runtime.blob.BlobServer  - BLOB server stopped working. 
> Shutting down
>   at java.base/java.net.PlainSocketImpl.socketAccept
>   at java.base/java.net.AbstractPlainSocketImpl.accept
>    at java.base/java.net.ServerSocket.implAccept
>   at java.base/java.net.ServerSocket.accept
>   at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266)
> 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO  
> org.apache.flink.runtime.blob.BlobServer  - Stopped BLOB server at 
> 0.0.0.0:6124
> {noformat}
> h3. Proposed Fix
> To protect ourselves from this JDK bug, we propose the workaround of catching 
> _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call 
> indefinitely.
>  
> Thanks to [~bsanders-wf] for helping track this down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11

2021-09-06 Thread Ryan Scudellari (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410753#comment-17410753
 ] 

Ryan Scudellari commented on FLINK-24156:
-

(y)

> BlobServer crashes due to SocketTimeoutException in Java 11
> ---
>
> Key: FLINK-24156
> URL: https://issues.apache.org/jira/browse/FLINK-24156
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.12.4, 1.13.2
> Environment: Java 11
> CentOS 7.6
>Reporter: Ryan Scudellari
>Priority: Major
>
> h3. Overview
> We have seen the BlobServer crash due to a *SocketTimeoutException* while 
> running on JRE 11. This is likely caused by a [JDK bug present in JDK 
> 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) 
> that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ 
> is interrupted by any UNIX signal. The BlobServer calls _accept()_ when 
> establishing connections with clients and is expected to block indefinitely. 
> [The BlobServer currently shuts down when it catches a 
> Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267].
>  We do not see this behavior when running the same steps in JRE 8.
> h3. Reproducing the issue
> To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be 
> running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available 
> to find the relevant pid.
> One-liner:
> {code:bash}
> kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint 
> | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print 
> $2}' | xargs printf "%d")
> {code}
>  
>  # Run
> {code:bash}
> jstack [PID] | grep BLOB
> {code}
> where *PID* is the process ID of the job manager.
>  # Find the *nid=[HEX]* value and convert the HEX to decimal.
>  # Run
> {code:bash}
> kill -SIGPIPE [DNID]
> {code}
> where *DNID* is the converted decimal value of *HEX nid* from the previous 
> step.
>  # Observe the following error in the job manager logs:
> {noformat}
> 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR 
> org.apache.flink.runtime.blob.BlobServer  - BLOB server stopped working. 
> Shutting down
>   at java.base/java.net.PlainSocketImpl.socketAccept
>   at java.base/java.net.AbstractPlainSocketImpl.accept
>    at java.base/java.net.ServerSocket.implAccept
>   at java.base/java.net.ServerSocket.accept
>   at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266)
> 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO  
> org.apache.flink.runtime.blob.BlobServer  - Stopped BLOB server at 
> 0.0.0.0:6124
> {noformat}
> h3. Proposed Fix
> To protect ourselves from this JDK bug, we propose the workaround of catching 
> _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call 
> indefinitely.
>  
> Thanks to [~bsanders-wf] for helping track this down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11

2021-09-06 Thread Ryan Scudellari (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410678#comment-17410678
 ] 

Ryan Scudellari commented on FLINK-24156:
-

Yes, I can provide a fix.

A utility method in {{NetUtils}} sounds good. Should I migrate the 
{{ServerSocker#accept()}} calls in the integration tests or can we assume they 
are only ever run using Java 8?

> BlobServer crashes due to SocketTimeoutException in Java 11
> ---
>
> Key: FLINK-24156
> URL: https://issues.apache.org/jira/browse/FLINK-24156
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.12.4, 1.13.2
> Environment: Java 11
> CentOS 7.6
>Reporter: Ryan Scudellari
>Priority: Major
>
> h3. Overview
> We have seen the BlobServer crash due to a *SocketTimeoutException* while 
> running on JRE 11. This is likely caused by a [JDK bug present in JDK 
> 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) 
> that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ 
> is interrupted by any UNIX signal. The BlobServer calls _accept()_ when 
> establishing connections with clients and is expected to block indefinitely. 
> [The BlobServer currently shuts down when it catches a 
> Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267].
>  We do not see this behavior when running the same steps in JRE 8.
> h3. Reproducing the issue
> To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be 
> running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available 
> to find the relevant pid.
> One-liner:
> {code:bash}
> kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint 
> | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print 
> $2}' | xargs printf "%d")
> {code}
>  
>  # Run
> {code:bash}
> jstack [PID] | grep BLOB
> {code}
> where *PID* is the process ID of the job manager.
>  # Find the *nid=[HEX]* value and convert the HEX to decimal.
>  # Run
> {code:bash}
> kill -SIGPIPE [DNID]
> {code}
> where *DNID* is the converted decimal value of *HEX nid* from the previous 
> step.
>  # Observe the following error in the job manager logs:
> {noformat}
> 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR 
> org.apache.flink.runtime.blob.BlobServer  - BLOB server stopped working. 
> Shutting down
>   at java.base/java.net.PlainSocketImpl.socketAccept
>   at java.base/java.net.AbstractPlainSocketImpl.accept
>    at java.base/java.net.ServerSocket.implAccept
>   at java.base/java.net.ServerSocket.accept
>   at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266)
> 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO  
> org.apache.flink.runtime.blob.BlobServer  - Stopped BLOB server at 
> 0.0.0.0:6124
> {noformat}
> h3. Proposed Fix
> To protect ourselves from this JDK bug, we propose the workaround of catching 
> _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call 
> indefinitely.
>  
> Thanks to [~bsanders-wf] for helping track this down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11

2021-09-03 Thread Ryan Scudellari (Jira)
Ryan Scudellari created FLINK-24156:
---

 Summary: BlobServer crashes due to SocketTimeoutException in Java 
11
 Key: FLINK-24156
 URL: https://issues.apache.org/jira/browse/FLINK-24156
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.13.2, 1.12.4
 Environment: Java 11

CentOS 7.6
Reporter: Ryan Scudellari


h3. Overview

We have seen the BlobServer crash due to a *SocketTimeoutException* while 
running on JRE 11. This is likely caused by a [JDK bug present in JDK 
11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) that 
erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ is 
interrupted by any UNIX signal. The BlobServer calls _accept()_ when 
establishing connections with clients and is expected to block indefinitely. 
[The BlobServer currently shuts down when it catches a 
Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267].
 We do not see this behavior when running the same steps in JRE 8.
h3. Reproducing the issue

To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be 
running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available 
to find the relevant pid.

One-liner:
{code:bash}
kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint | 
cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print $2}' | 
xargs printf "%d")
{code}
 
 # Run
{code:bash}
jstack [PID] | grep BLOB
{code}
where *PID* is the process ID of the job manager.

 # Find the *nid=[HEX]* value and convert the HEX to decimal.
 # Run
{code:bash}
kill -SIGPIPE [DNID]
{code}
where *DNID* is the converted decimal value of *HEX nid* from the previous step.

 # Observe the following error in the job manager logs:
{noformat}
2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR 
org.apache.flink.runtime.blob.BlobServer  - BLOB server stopped working. 
Shutting down
  at java.base/java.net.PlainSocketImpl.socketAccept
  at java.base/java.net.AbstractPlainSocketImpl.accept
   at java.base/java.net.ServerSocket.implAccept
  at java.base/java.net.ServerSocket.accept
  at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266)
2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO  
org.apache.flink.runtime.blob.BlobServer  - Stopped BLOB server at 0.0.0.0:6124
{noformat}

h3. Proposed Fix

To protect ourselves from this JDK bug, we propose the workaround of catching 
_SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call 
indefinitely.

 

Thanks to [~bsanders-wf] for helping track this down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)