Abacn commented on PR #24633: URL: https://github.com/apache/beam/pull/24633#issuecomment-1364132421
Multiple Issues of SDF read tests in both Java and Python xlang when bumping up the number of records: * Java Read -> (without ReShuffle) Count element have duplicates because a single kafka read can fail intermittently. Result in count > expected number (https://ci-beam.apache.org/job/beam_PerformanceTests_Kafka_IO/3484/consoleFull). <img width="1088" alt="image" src="https://user-images.githubusercontent.com/8010435/209372205-16f4d2c7-32a2-4b0a-bc69-a6abc932992e.png"> This is flaky rather than perma-red (streaming test succeeded in https://ci-beam.apache.org/job/beam_PerformanceTests_Kafka_IO/3482/consoleFull) This was not observed previously because the test was run on small dataset. Note that run 3482 failed because of another flake in batch test. The write had intermittent fail and retried, and then cause read hash check fail. * Java Read -> ReShuffle -> Count. The performance degrades significantly with a reshuffle inserted. Previously throughput ~ 100k/s (see above screenshot) add a ReShuffle it becomes ~20k/s (https://ci-beam.apache.org/job/beam_PerformanceTests_Kafka_IO/3488/console) <img width="1085" alt="image" src="https://user-images.githubusercontent.com/8010435/209373106-5d8b2469-9c64-45e8-9650-f1d571779a86.png"> * Python xlang Read -> (without ReShuffle) The pipeline does not scale, only one worker active (https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_xlang_KafkaIO_Python/5/console) and throughput is ~50k/s <img width="812" alt="image" src="https://user-images.githubusercontent.com/8010435/209373443-5a2927e3-4eb2-436c-a4d2-7b7c40a15127.png"> <img width="784" alt="image" src="https://user-images.githubusercontent.com/8010435/209373341-0c168c80-9560-43bc-b598-5a56c0669332.png"> * Python xlang Read -> ReShuffle -> Count. The performance has further reduced to only 1k/s throughput (https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_xlang_KafkaIO_Python/7/console) <img width="986" alt="image" src="https://user-images.githubusercontent.com/8010435/209373894-095653f6-b3e0-4294-a833-78c9fa1aa613.png"> (note to myself: still need to cancel the streaming pipeline after waituntil reached) * There is still a case that Python read can work, that is do not set `--streaming` pipeline option, then the job runs on batch mode (which is wierd to me either, ReadFromKafka transform itself does not induce a streaming pipeline). The succeeded job is https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_xlang_KafkaIO_Python/3/console with a pretty high throughput. --------- Based on these attempts, Summary of change to make test work after bumping num_records * Prune kafka topic after test done. Otherwise the second test fail with kafka server run out of source. * For write pipeline, add a ReShuffle after generate records. Otherwise intermittent fail in kafka write will cause retry generate records and the hash value will change. * For Python pipeline, do not add `--streaming` for now. Issues revealed and need investigation: * For Java, investigate the performance degration when ReShuffle used downstream of Kafka read. Note that Java ReShuffle is marked as "deprecated". * For Python xlang, why CountMetrics transform downstream causing no parallelism when bundle is fused (this does not happen in Java pipeline). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
