MartijnVisser opened a new pull request, #28629:
URL: https://github.com/apache/flink/pull/28629
## What is the purpose of the change
FLINK-40048 removed the flink-sql-connector-kafka dependency and its
maven-dependency-plugin copy from flink-sql-client-test, so no Kafka sql-jar is
staged into target/sql-jars anymore. test_pyflink.sh still executed
`KAFKA_SQL_JAR=$(find "$SQL_JARS_DIR" | grep "kafka")` under `set -Eeuo
pipefail`, which now exits 1 and aborts the "PyFlink end-to-end test" nightly
leg (e2e_4_ci) before any test runs (first seen in build 76685).
The jar and all Kafka scaffolding in the script only served the "Test
PyFlink DataStream job" case, disabled since FLINK-36185 because
data_stream_job.py used the legacy FlinkKafkaConsumer/FlinkKafkaProducer. This
PR rewrites that job to the in-repo filesystem connector and re-enables the
case, fixing the nightly leg and restoring PyFlink DataStream coverage
(KeyedProcessFunction + event-time timers) dormant since September 2024. No
active Kafka e2e coverage is lost; PyFlink-Kafka e2e coverage belongs in the
externalized flink-connector-kafka repository.
## Brief change log
- Rewrite data_stream_job.py onto the bounded FileSource and unified
FileSink; JSON lines are parsed in a map function
- Disable periodic watermarks so all timers fire at MIN_VALUE + 1500 on
the end-of-input MAX_WATERMARK, making the output deterministic (the same
output the Kafka-based test asserted)
- Use blocking env.execute(); the committer commits all pending part files
at end of input, so no checkpointing is needed
- Re-enable the DataStream case in test_pyflink.sh with a sorted diff over
the committed part files, and remove the Kafka scaffolding (kafka_sql_common.sh
stays; test_confluent_schema_registry.sh still sources it)
## Verifying this change
This change added tests and can be verified as follows:
- Ran the rewritten job twice standalone on a local minicluster
(source-built PyFlink, Python 3.12); both runs produced exactly the 16 expected
lines, byte-identical, with a cleanly committed part file and no in-progress
leftovers
- The re-enabled case fails loudly on missing or wrong output (empty
part-file set diffs against the expected lines and exits 1)
- The nightly e2e_4 leg runs the full script in CI
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes (Claude Code (Fable 5))
Generated-by: Claude Code (Fable 5)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]