[ 
https://issues.apache.org/jira/browse/FLINK-21416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302508#comment-17302508
 ] 

Piotr Nowojski edited comment on FLINK-21416 at 3/16/21, 1:15 PM:
------------------------------------------------------------------

The amount of those failures is a bit suspicious. Has something changed 
recently either in the tests setup or the blocking partition that could be 
related?

I have suspicion, especially after:
{noformat}
Caused by: 
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandshakeTimeoutException:
 handshake timed out after 10000ms
        at 
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2054)
        ... 8 more
{noformat}
that this might be cause by us doing blocking io in the netty threads when 
using blocking partition. This handshake time out could be easily explained by 
that. Also in the 5 reported failures that I checked, "connection reset by 
peer" happens around 60s into the test. Maybe this is also a problem where 
server frozen for x seconds, causing the client to timeout (error got lost? 
maybe it's in the logs?), and server side detected this and failed with 
"connection reset by peer"?

Another pointer, it seems like all of the failures happened with SSL enabled.

I don't know why has it started to fail now so frequently. Maybe something 
changed in the test setup or in the environment. I was always afraid that doing 
a blocking IO in the netty threads can cause problems, but that's not something 
we can easily change (assuming this is causing those issues). Maybe we can 
speed up the test? Make it lighter? Or maybe we can increase some timeouts 
(related to ssl?) either in Netty or in tcp stack? What worries me is that 
apparently this issue is happening not only in the ITCase here, but also on 
(cluster?) benchmarks?


was (Author: pnowojski):
The amount of those failures is a bit suspicious. Has something changed 
recently either in the tests setup or the blocking partition that could be 
related?

I have suspicion, especially after:
{noformat}
Caused by: 
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandshakeTimeoutException:
 handshake timed out after 10000ms
        at 
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2054)
        ... 8 more
{noformat}
that this might be cause by us doing blocking io in the netty threads when 
using blocking partition. This handshake time out could be easily explained by 
that. Also in the 5 reported failures that I checked, "connection reset by 
peer" happens around 60s into the test. Maybe this is also a problem where 
server frozen for x seconds, causing the client to timeout (error got lost? 
maybe it's in the logs?), and server side detected this and failed with 
"connection reset by peer"?

Another pointer, it seems like all of the failures happened with SSL enabled.

I don't know why has it started to fail now so frequently. Maybe something 
changed in the test setup or in the environment. I was always afraid that 
blocking IO in the netty threads can cause problems, but that's not something 
we can easily change (assuming this is causing those issues). Maybe we can 
speed up the test? Make it lighter? Or maybe we can increase some timeouts 
(related to ssl?) either in Netty or in tcp stack? What worries me is that 
apparently this issue is happening not only in the ITCase here, but also on 
(cluster?) benchmarks?

> FileBufferReaderITCase.testSequentialReading fails on azure
> -----------------------------------------------------------
>
>                 Key: FLINK-21416
>                 URL: https://issues.apache.org/jira/browse/FLINK-21416
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.13.0
>            Reporter: Dawid Wysakowicz
>            Assignee: Guo Weijie
>            Priority: Critical
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13473&view=logs&j=59c257d0-c525-593b-261d-e96a86f1926b&t=b93980e3-753f-5433-6a19-13747adae66a
> {code}
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>       at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
>       at 
> org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:811)
>       at 
> org.apache.flink.runtime.io.network.partition.FileBufferReaderITCase.testSequentialReading(FileBufferReaderITCase.java:128)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at org.junit.runners.Suite.runChild(Suite.java:128)
>       at org.junit.runners.Suite.runChild(Suite.java:27)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
>       at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:117)
>       at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:79)
>       at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:221)
>       at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:212)
>       at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:203)
>       at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:650)
>       at 
> org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:81)
>       at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:435)
>       at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to