MartijnVisser opened a new pull request, #28629:
URL: https://github.com/apache/flink/pull/28629

   ## What is the purpose of the change
   
   FLINK-40048 removed the flink-sql-connector-kafka dependency and its 
maven-dependency-plugin copy from flink-sql-client-test, so no Kafka sql-jar is 
staged into target/sql-jars anymore. test_pyflink.sh still executed 
`KAFKA_SQL_JAR=$(find "$SQL_JARS_DIR" | grep "kafka")` under `set -Eeuo 
pipefail`, which now exits 1 and aborts the "PyFlink end-to-end test" nightly 
leg (e2e_4_ci) before any test runs (first seen in build 76685).
   
   The jar and all Kafka scaffolding in the script only served the "Test 
PyFlink DataStream job" case, disabled since FLINK-36185 because 
data_stream_job.py used the legacy FlinkKafkaConsumer/FlinkKafkaProducer. This 
PR rewrites that job to the in-repo filesystem connector and re-enables the 
case, fixing the nightly leg and restoring PyFlink DataStream coverage 
(KeyedProcessFunction + event-time timers) dormant since September 2024. No 
active Kafka e2e coverage is lost; PyFlink-Kafka e2e coverage belongs in the 
externalized flink-connector-kafka repository.
   
   ## Brief change log
   
     - Rewrite data_stream_job.py onto the bounded FileSource and unified 
FileSink; JSON lines are parsed in a map function
     - Disable periodic watermarks so all timers fire at MIN_VALUE + 1500 on 
the end-of-input MAX_WATERMARK, making the output deterministic (the same 
output the Kafka-based test asserted)
     - Use blocking env.execute(); the committer commits all pending part files 
at end of input, so no checkpointing is needed
     - Re-enable the DataStream case in test_pyflink.sh with a sorted diff over 
the committed part files, and remove the Kafka scaffolding (kafka_sql_common.sh 
stays; test_confluent_schema_registry.sh still sources it)
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Ran the rewritten job twice standalone on a local minicluster 
(source-built PyFlink, Python 3.12); both runs produced exactly the 16 expected 
lines, byte-identical, with a cleanly committed part file and no in-progress 
leftovers
     - The re-enabled case fails loudly on missing or wrong output (empty 
part-file set diffs against the expected lines and exits 1)
     - The nightly e2e_4 leg runs the full script in CI
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Code (Fable 5))
   
   Generated-by: Claude Code (Fable 5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to