[ https://issues.apache.org/jira/browse/FLINK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474404#comment-17474404 ]
Yufei Zhang commented on FLINK-23944: ------------------------------------- I failed to find a way to trigger this failure case 100%. I had some observations: # the mismatch begins in the middle. (not necessarily from the start) # It only happens in the second send and write after the failover call. # The first send and write always succeed (at least from the reported pipelines) Since we cannot reproduce this reliably, my suggestion is to first determine whether it's caused by data loss or ordering. (I think it's most likely data loss, but can't confirm).So I would suggest: # assert on the size of the result set first, and then compare ordering so we can know if this mismatch is caused by data loss or wrong ordering or both. # eliminate undeterministic test data. Currently in the generateStringTestData() we generate random strings, and it's better to avoid undeterminism in testing (except in random testing). We can add more debugging info in the test string so we can better debug the issue next time it occurs. What do you think [~syhily] > PulsarSourceITCase.testTaskManagerFailure is instable > ----------------------------------------------------- > > Key: FLINK-23944 > URL: https://issues.apache.org/jira/browse/FLINK-23944 > Project: Flink > Issue Type: Bug > Components: Connectors / Pulsar > Affects Versions: 1.14.0 > Reporter: Dian Fu > Assignee: Yufan Sheng > Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.15.0 > > > [https://dev.azure.com/dianfu/Flink/_build/results?buildId=430&view=logs&j=f3dc9b18-b77a-55c1-591e-264c46fe44d1&t=2d3cd81e-1c37-5c31-0ee4-f5d5cdb9324d] > It's from my personal azure pipeline, however, I'm pretty sure that I have > not touched any code related to this. > {code:java} > Aug 24 10:44:13 [ERROR] testTaskManagerFailure{TestEnvironment, > ExternalContext, ClusterControllable}[1] Time elapsed: 258.397 s <<< FAILURE! > Aug 24 10:44:13 java.lang.AssertionError: Aug 24 10:44:13 Aug 24 10:44:13 > Expected: Records consumed by Flink should be identical to test data and > preserve the order in split Aug 24 10:44:13 but: Mismatched record at > position 7: Expected '0W6SzacX7MNL4xLL3BZ8C3ljho4iCydbvxIl' but was > 'wVi5JaJpNvgkDEOBRC775qHgw0LyRW2HBxwLmfONeEmr' Aug 24 10:44:13 at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) Aug 24 10:44:13 > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8) Aug 24 > 10:44:13 at > org.apache.flink.connectors.test.common.testsuites.SourceTestSuiteBase.testTaskManagerFailure(SourceTestSuiteBase.java:271) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)