[ 
https://issues.apache.org/jira/browse/FLINK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474404#comment-17474404
 ] 

Yufei Zhang commented on FLINK-23944:
-------------------------------------

I failed to find a way to trigger this failure case 100%. I had some 
observations: # the mismatch begins in the middle. (not necessarily from the 
start)
 # It only happens in the second send and write after the failover call.
 # The first send and write always succeed (at least from the reported 
pipelines)

Since we cannot reproduce this reliably, my suggestion is to first determine 
whether it's caused by data loss or ordering. (I think it's most likely data 
loss, but can't confirm).So I would suggest: # assert on the size of the result 
set first, and then compare ordering so we can know if this mismatch is caused 
by data loss or wrong ordering or both.
 # eliminate undeterministic test data. Currently in the 
generateStringTestData() we generate random strings, and it's better to avoid 
undeterminism in testing (except in random testing). We can add more debugging 
info in the test string so we can better debug the issue next time it occurs.

What do you think [~syhily] 

> PulsarSourceITCase.testTaskManagerFailure is instable
> -----------------------------------------------------
>
>                 Key: FLINK-23944
>                 URL: https://issues.apache.org/jira/browse/FLINK-23944
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Pulsar
>    Affects Versions: 1.14.0
>            Reporter: Dian Fu
>            Assignee: Yufan Sheng
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>             Fix For: 1.15.0
>
>
> [https://dev.azure.com/dianfu/Flink/_build/results?buildId=430&view=logs&j=f3dc9b18-b77a-55c1-591e-264c46fe44d1&t=2d3cd81e-1c37-5c31-0ee4-f5d5cdb9324d]
> It's from my personal azure pipeline, however, I'm pretty sure that I have 
> not touched any code related to this. 
> {code:java}
> Aug 24 10:44:13 [ERROR] testTaskManagerFailure{TestEnvironment, 
> ExternalContext, ClusterControllable}[1] Time elapsed: 258.397 s <<< FAILURE! 
> Aug 24 10:44:13 java.lang.AssertionError: Aug 24 10:44:13 Aug 24 10:44:13 
> Expected: Records consumed by Flink should be identical to test data and 
> preserve the order in split Aug 24 10:44:13 but: Mismatched record at 
> position 7: Expected '0W6SzacX7MNL4xLL3BZ8C3ljho4iCydbvxIl' but was 
> 'wVi5JaJpNvgkDEOBRC775qHgw0LyRW2HBxwLmfONeEmr' Aug 24 10:44:13 at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) Aug 24 10:44:13 
> at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8) Aug 24 
> 10:44:13 at 
> org.apache.flink.connectors.test.common.testsuites.SourceTestSuiteBase.testTaskManagerFailure(SourceTestSuiteBase.java:271)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to