lincoln-lil commented on code in PR #800: URL: https://github.com/apache/flink-web/pull/800#discussion_r2227626623
########## docs/content/posts/2025-07-31-release-2.1.0.md: ########## @@ -0,0 +1,458 @@ +--- +authors: + - reswqa: + name: "Ron Liu" + twitter: "Ron999" + +date: "2025-07-31T08:00:00Z" +subtitle: "" +title: Ushers in a New Era of Unified Real-Time Data + AI with Comprehensive Upgrades +aliases: + - /news/2025/07/31/release-2.1.0.html +--- + +The Apache Flink PMC is proud to announce the release of Apache Flink 2.1.0. This marks a significant milestone +in the evolution of the real-time data processing engine into a unified Data + AI platform. This release brings +together 116 global contributors, implements 15 FLIPs (Flink Improvement Proposals), and resolves over 220 issues, +with a strong focus on deepening the integration of real-time AI and intelligent stream processing: + +1. **Breakthroughs in Real-Time AI**: + - Introduces AI Model DDL, enabling flexible management of AI models through Flink SQL and the Table API. + + - Extends the `ML_PREDICT` Table-Valued Function (TVF), empowering real-time invocation of AI models within Flink SQL, + laying the foundation for building end-to-end real-time AI workflows. + +2. **Enhanced Real-Time Data Processing**: + - Adds the `VARIANT` data type for efficient handling of semi-structured data like JSON. Combined with the `PARSE_JSON` function + and lakehouse formats (e.g., Apache Paimon), it enables dynamic schema data analysis. + + - Significantly optimizes streaming joins with the innovative introduction of `DeltaJoin` and `MultiJoin` strategies, + eliminating state bottlenecks and improving resource utilization and job stability. + +Flink 2.1.0 seamlessly integrates real-time data processing with AI models, empowering enterprises to advance from +real-time analytics to real-time intelligent decision-making, meeting the evolving demands of modern data applications. +We extend our gratitude to all contributors for their invaluable support! + +Let's dive into the highlights. + +# Flink SQL Improvements + +## Model DDLs using Table API +Since Flink 2.0, we have introduced dedicated syntax for AI models, enabling users to define models +as easily as creating catalog objects and invoke them like standard functions or table functions in SQL statements. +In Flink 2.1, we have also added Model DDLs Table API support, enabling users to define and manage AI models programmatically +via the Table API in both Java and Python. This provides a flexible, code-driven alternative to SQL for model management and +integration within Flink applications. + +Example: Defining a Model via Table API (Java) +```java +table_env.create_model( + "MyModel", + ModelDescriptor.for_provider("OPENAI") + .input_schema(Schema.new_builder() + .column("f0", DataTypes.STRING()) + .build()) + .output_schema(Schema.new_builder() + .column("label", DataTypes.STRING()) + .build()) + .option("task", "classification") + .option("type", "remote") + .option("provider", "openai") + .option("openai.endpoint", "remote") + .option("openai.api_key", "abcdefg") + .build(), + true); +``` + +**More Information** +* [FLINK-37548](https://issues.apache.org/jira/browse/FLINK-37548) +* [FLIP-507](https://cwiki.apache.org/confluence/display/FLINK/FLIP-507%3A+Add+Model+DDL+methods+in+TABLE+API) Review Comment: link FLIP-437 as well ########## docs/content/posts/2025-07-31-release-2.1.0.md: ########## @@ -0,0 +1,458 @@ +--- +authors: + - reswqa: + name: "Ron Liu" + twitter: "Ron999" + +date: "2025-07-31T08:00:00Z" +subtitle: "" +title: Ushers in a New Era of Unified Real-Time Data + AI with Comprehensive Upgrades +aliases: + - /news/2025/07/31/release-2.1.0.html +--- + +The Apache Flink PMC is proud to announce the release of Apache Flink 2.1.0. This marks a significant milestone +in the evolution of the real-time data processing engine into a unified Data + AI platform. This release brings +together 116 global contributors, implements 15 FLIPs (Flink Improvement Proposals), and resolves over 220 issues, +with a strong focus on deepening the integration of real-time AI and intelligent stream processing: + +1. **Breakthroughs in Real-Time AI**: + - Introduces AI Model DDL, enabling flexible management of AI models through Flink SQL and the Table API. + + - Extends the `ML_PREDICT` Table-Valued Function (TVF), empowering real-time invocation of AI models within Flink SQL, + laying the foundation for building end-to-end real-time AI workflows. + +2. **Enhanced Real-Time Data Processing**: + - Adds the `VARIANT` data type for efficient handling of semi-structured data like JSON. Combined with the `PARSE_JSON` function + and lakehouse formats (e.g., Apache Paimon), it enables dynamic schema data analysis. + + - Significantly optimizes streaming joins with the innovative introduction of `DeltaJoin` and `MultiJoin` strategies, + eliminating state bottlenecks and improving resource utilization and job stability. + +Flink 2.1.0 seamlessly integrates real-time data processing with AI models, empowering enterprises to advance from +real-time analytics to real-time intelligent decision-making, meeting the evolving demands of modern data applications. +We extend our gratitude to all contributors for their invaluable support! + +Let's dive into the highlights. + +# Flink SQL Improvements + +## Model DDLs using Table API Review Comment: Should we either remove the restriction of the Table API here or explicitly include SQL, since the release testing[1] also describes a direct usage of SQL DDL? 1. https://issues.apache.org/jira/browse/FLINK-38066 ########## docs/content/posts/2025-07-31-release-2.1.0.md: ########## @@ -0,0 +1,458 @@ +--- +authors: + - reswqa: + name: "Ron Liu" + twitter: "Ron999" + +date: "2025-07-31T08:00:00Z" +subtitle: "" +title: Ushers in a New Era of Unified Real-Time Data + AI with Comprehensive Upgrades +aliases: + - /news/2025/07/31/release-2.1.0.html +--- + +The Apache Flink PMC is proud to announce the release of Apache Flink 2.1.0. This marks a significant milestone +in the evolution of the real-time data processing engine into a unified Data + AI platform. This release brings +together 116 global contributors, implements 15 FLIPs (Flink Improvement Proposals), and resolves over 220 issues, +with a strong focus on deepening the integration of real-time AI and intelligent stream processing: + +1. **Breakthroughs in Real-Time AI**: + - Introduces AI Model DDL, enabling flexible management of AI models through Flink SQL and the Table API. + + - Extends the `ML_PREDICT` Table-Valued Function (TVF), empowering real-time invocation of AI models within Flink SQL, + laying the foundation for building end-to-end real-time AI workflows. + +2. **Enhanced Real-Time Data Processing**: + - Adds the `VARIANT` data type for efficient handling of semi-structured data like JSON. Combined with the `PARSE_JSON` function + and lakehouse formats (e.g., Apache Paimon), it enables dynamic schema data analysis. + + - Significantly optimizes streaming joins with the innovative introduction of `DeltaJoin` and `MultiJoin` strategies, + eliminating state bottlenecks and improving resource utilization and job stability. + +Flink 2.1.0 seamlessly integrates real-time data processing with AI models, empowering enterprises to advance from +real-time analytics to real-time intelligent decision-making, meeting the evolving demands of modern data applications. +We extend our gratitude to all contributors for their invaluable support! + +Let's dive into the highlights. + +# Flink SQL Improvements + +## Model DDLs using Table API +Since Flink 2.0, we have introduced dedicated syntax for AI models, enabling users to define models +as easily as creating catalog objects and invoke them like standard functions or table functions in SQL statements. +In Flink 2.1, we have also added Model DDLs Table API support, enabling users to define and manage AI models programmatically +via the Table API in both Java and Python. This provides a flexible, code-driven alternative to SQL for model management and +integration within Flink applications. + +Example: Defining a Model via Table API (Java) +```java +table_env.create_model( + "MyModel", + ModelDescriptor.for_provider("OPENAI") + .input_schema(Schema.new_builder() + .column("f0", DataTypes.STRING()) + .build()) + .output_schema(Schema.new_builder() + .column("label", DataTypes.STRING()) + .build()) + .option("task", "classification") + .option("type", "remote") + .option("provider", "openai") + .option("openai.endpoint", "remote") + .option("openai.api_key", "abcdefg") + .build(), + true); +``` + +**More Information** +* [FLINK-37548](https://issues.apache.org/jira/browse/FLINK-37548) +* [FLIP-507](https://cwiki.apache.org/confluence/display/FLINK/FLIP-507%3A+Add+Model+DDL+methods+in+TABLE+API) + +## Realtime AI Function + +Based on the AI model DDL, In Flink 2.1, we expanded the `ML_PREDICT` table-valued function (TVF) to perform realtime model inference in SQL queries, applying machine learning models to data streams seamlessly. +The implementation supports both Flink builtin model providers (OpenAI) and interfaces for users to define custom model providers, accelerating Flink's evolution from a real-time +data processing engine to a unified realtime AI platform. Looking ahead, we plan to introduce more AI functions such as `ML_EVALUATE`, `VECTOR_SEARCH` to unlock end-to-end experience +for real-time data processing, model training, and inference. + +Take the following SQL statements as an example: +```sql +-- Declare a AI model +CREATE MODEL `my_model` +INPUT (text STRING) +OUTPUT (response STRING) +WITH( + 'provider' = 'openai', + 'endpoint' = 'https://api.openai.com/v1/llm/v1/chat', + 'api-key' = 'abcdefg', + 'system-prompt' = 'translate to Chinese', + 'model' = 'gpt-4o' +); + +-- Basic usage +SELECT * FROM ML_PREDICT( + TABLE input_table, + MODEL my_model, + DESCRIPTOR(text) +); + +-- With configuration options +SELECT * FROM ML_PREDICT( + TABLE input_table, + MODEL my_model, + DESCRIPTOR(text) + MAP['async', 'true', 'timeout', '100s'] +); + +-- Using named parameters +SELECT * FROM ML_PREDICT( + INPUT => TABLE input_table, + MODEL => MODEL my_model, + ARGS => DESCRIPTOR(text), + CONFIG => MAP['async', 'true'] +); +``` + +**More Information** +* [FLINK-34992](https://issues.apache.org/jira/browse/FLINK-34992) +* [FLINK-37777](https://issues.apache.org/jira/browse/FLINK-37777) +* [FLIP-437](https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL) +* [FLIP-525](https://cwiki.apache.org/confluence/display/FLINK/FLIP-525%3A+Model+ML_PREDICT%2C+ML_EVALUATE+Implementation+Design) +* [Model Inference](https://nightlies.apache.org/flink/flink-docs-release-2.1/docs/dev/table/sql/queries/model-inference/) + +## Variant Type + +Variant is a new data type for semi-structured data(e.g. JSON), it supports storing any +semi-structured data, including ARRAY, MAP(with STRING keys), and scalar types—while preserving +field type information in a JSON-like structure. Unlike ROW and STRUCTURED types, VARIANT provides +superior flexibility for handling deeply nested and evolving schemas. + +Users can use `PARSE_JSON` or`TRY_PARSE_JSON` to convert JSON-formatted VARCHAR data to VARIANT. In +addition, table formats like Apache Paimon now support the VARIANT type, this enable +users to efficiently process semi-structured data in lakehouse using Flink SQL. + +Take the following SQL statements as an example: +```sql +CREATE TABLE t1 ( + id INTEGER, + v STRING -- a json string +) WITH ( + 'connector' = 'mysql-cdc', + ... +); + +CREATE TABLE t2 ( + id INTEGER, + v VARIANT +) WITH ( + 'connector' = 'paimon' + ... +); + +-- write to t2 with VARIANT type +INSERT INTO t2 SELECT id, PARSE_JSON(v) FROM t1; +``` + +**More Information** +* [FLINK-37922](https://issues.apache.org/jira/browse/FLINK-37922) +* [FLIP-521](https://cwiki.apache.org/confluence/display/FLINK/FLIP-521%3A+Integrating+Variant+Type+into+Flink%3A+Enabling+Efficient+Semi-Structured+Data+Processing) +* [Variant](https://nightlies.apache.org/flink/flink-docs-release-2.1/docs/dev/table/types/#other-data-types) + +## Structured Type Enhancements + +In Flink 2.1, we enabled declare user-defined objects via STRUCTURED TYPE directly in `CREATE TABLE` DDL +statements, resolving critical type equivalence issues and significantly improving API usability. + +Take the following SQL statements as an example: +```sql +CREATE TABLE MyTable ( + uid BIGINT, + user STRUCTURED<'com.example.User', name STRING, age INT NOT NULL> +); + +-- Casts a row type into a structured type +INSERT INTO MyTable SELECT 1, CAST(('Bob', 42) AS STRUCTURED<'com.example.User', name STRING, age INT>); +``` + +**More Information** +* [FLINK-37861](https://issues.apache.org/jira/browse/FLINK-37861) +* [FLIP-520](https://cwiki.apache.org/confluence/display/FLINK/FLIP-520%3A+Simplify+StructuredType+handling) +* [STRUCTURED](https://nightlies.apache.org/flink/flink-docs-release-2.1/docs/dev/table/types/#user-defined-data-types) + +## Delta Join + +Introduced a new DeltaJoin operator in stream processing jobs, along with optimizations for simple +streaming join pipeline. Compared to traditional streaming join, delta join requires significantly +less state, effectively mitigating issues related to large state, including resource bottlenecks, +slow checkpointing, and lengthy job recovery times. This feature is enabled by default. + +**More Information** +* [FLINK-37836](https://issues.apache.org/jira/browse/FLINK-37836) +* [Delta Join](https://cwiki.apache.org/confluence/display/FLINK/FLIP-486%3A+Introduce+A+New+DeltaJoin) + +## Multiple Regular Joins + +Streaming Flink jobs with multiple cascaded streaming joins often experience operational +instability and performance degradation due to large state sizes. This release introduces a +multi-join operator (`StreamingMultiJoinOperator`) that drastically reduces state size +by eliminating intermediate results. The operator achieves this by processing joins across all input +streams simultaneously within a single operator instance, storing only raw input records instead of +propagated join output. + +This "zero intermediate state" approach primarily targets state reduction, offering substantial +benefits in resource consumption and operational stability. This feature is now available for +pipelines with multiple INNER/LEFT joins that share at least one common join key, enable with +`SET 'table.optimizer.multi-join.enabled' = 'true'`. + +**Benchmark**: we conducted a benchmark comparing the benefits of the multi-join operator with default binary joins, more detail can see +[MultiJoin Benchmark](https://nightlies.apache.org/flink/flink-docs-release-2.1/docs/dev/table/tuning/#multiple-regular-joins). + + +**More Information** +* [FLINK-37859](https://issues.apache.org/jira/browse/FLINK-37859) +* [MultiJoin](https://cwiki.apache.org/confluence/display/FLINK/FLIP-516%3A+Multi-Way+Join+Operator) + +## Async Lookup Join Enhancements + +Support handling records in order based on upsert key (the unique key in the input stream deduced by Review Comment: How about: "In previous versions of async lookup join, even if users set `table.exec.async-lookup.output-mode` to `ALLOW_UNORDERED`, the engine would still forcibly fallback to ordered mode when processing update streams to ensure correctness. Starting from version 2.1, the engine allows parallel processing of unrelated update records while still ensuring correctness, thereby achieving higher throughput when handling update streams." ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org