[ https://issues.apache.org/jira/browse/BEAM-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lukasz Gajowy updated BEAM-8207: -------------------------------- Status: Open (was: Triage Needed) > KafkaIOITs generate different hashes each run, sometimes dropping records > ------------------------------------------------------------------------- > > Key: BEAM-8207 > URL: https://issues.apache.org/jira/browse/BEAM-8207 > Project: Beam > Issue Type: Bug > Components: io-java-kafka, testing > Reporter: Michal Walenia > Priority: Major > > While working to adapt Java's KafkaIOIT to work with a large dataset > generated by a SyntheticSource I encountered a problem. I want to push 100M > records through a Kafka topic, verify data correctness and at the same time > check the performance of KafkaIO.Write and KafkaIO.Read. > > To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam > repo > ([here|https://github.com/apache/beam/tree/master/.test-infra/kubernetes/kafka-cluster]). > > The expected result would be that first the records are generated in a > deterministic way (using hashes of list positions as Random seeds), next they > are written to Kafka - this concludes the write pipeline. > As for reading and correctness checking - first, the data is read from the > topic and after being decoded into String representations, a hashcode of the > whole PCollection is calculated (For details, check KafkaIOIT.java). > > During the testing I ran into several problems: > 1. When all the records are read from the Kafka topic, the hash is different > each time. > 2. Sometimes not all the records are read and the Dataflow task waits for the > input indefinitely, occasionally throwing exceptions. > > I believe there are two possible causes of this behavior: > > either there is something wrong with the Kafka cluster configuration > or KafkaIO behaves erratically on high data volumes, duplicating and/or > dropping records. > Second option seems troubling and I would be grateful for help with the first. -- This message was sent by Atlassian Jira (v8.3.4#803005)