godfreyhe commented on a change in pull request #17651: URL: https://github.com/apache/flink/pull/17651#discussion_r744651757
########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} Review comment: NO "Batch" ########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} + +Window Deduplication is a special [Deduplication]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) which removes rows that duplicate over a set of columns, keeping only the first one or the last one for each window and other partitioned keys. + +For streaming queries, unlike regular Deduplicate on continuous tables, window Deduplication does not emit intermediate results but only a final result at the end of the window. Moreover, window Deduplication purges all intermediate state when no longer needed. +Therefore, window Deduplication queries have better performance if users don't need results updated per record. Usually, Window Deduplication is used with [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) directly. Besides, Window Deduplication could be used with other operations based on [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}), such as [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}), [Window TopN]({{< ref "docs/dev/table/sql/queries/window-topn">}}) and [Window Join]({{< ref "docs/dev/table/sql/queries/window-join">}}). Review comment: does window deduplication support outputing result per record ? ########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} + +Window Deduplication is a special [Deduplication]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) which removes rows that duplicate over a set of columns, keeping only the first one or the last one for each window and other partitioned keys. + +For streaming queries, unlike regular Deduplicate on continuous tables, window Deduplication does not emit intermediate results but only a final result at the end of the window. Moreover, window Deduplication purges all intermediate state when no longer needed. Review comment: `window Deduplication` -> `Window Deduplication` ? ########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} + +Window Deduplication is a special [Deduplication]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) which removes rows that duplicate over a set of columns, keeping only the first one or the last one for each window and other partitioned keys. Review comment: "keeping only the first one or the last one for each window and other partitioned keys. " need to improve ########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} + +Window Deduplication is a special [Deduplication]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) which removes rows that duplicate over a set of columns, keeping only the first one or the last one for each window and other partitioned keys. + +For streaming queries, unlike regular Deduplicate on continuous tables, window Deduplication does not emit intermediate results but only a final result at the end of the window. Moreover, window Deduplication purges all intermediate state when no longer needed. +Therefore, window Deduplication queries have better performance if users don't need results updated per record. Usually, Window Deduplication is used with [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) directly. Besides, Window Deduplication could be used with other operations based on [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}), such as [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}), [Window TopN]({{< ref "docs/dev/table/sql/queries/window-topn">}}) and [Window Join]({{< ref "docs/dev/table/sql/queries/window-join">}}). + +Window Deduplication can be defined in the same syntax as regular Deduplication, see [Deduplication documentation]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) for more information. +Besides that, Window Deduplication requires the `PARTITION BY` clause contains `window_start` and `window_end` columns of the relation applied [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) or [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}). Review comment: only `Window Aggregation` is supported ? ########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} + +Window Deduplication is a special [Deduplication]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) which removes rows that duplicate over a set of columns, keeping only the first one or the last one for each window and other partitioned keys. + +For streaming queries, unlike regular Deduplicate on continuous tables, window Deduplication does not emit intermediate results but only a final result at the end of the window. Moreover, window Deduplication purges all intermediate state when no longer needed. +Therefore, window Deduplication queries have better performance if users don't need results updated per record. Usually, Window Deduplication is used with [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) directly. Besides, Window Deduplication could be used with other operations based on [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}), such as [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}), [Window TopN]({{< ref "docs/dev/table/sql/queries/window-topn">}}) and [Window Join]({{< ref "docs/dev/table/sql/queries/window-join">}}). + +Window Deduplication can be defined in the same syntax as regular Deduplication, see [Deduplication documentation]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) for more information. +Besides that, Window Deduplication requires the `PARTITION BY` clause contains `window_start` and `window_end` columns of the relation applied [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) or [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}). +Otherwise, the optimizer won’t be able to translate the query. + +Flink uses `ROW_NUMBER()` to remove duplicates, just like the way of [Window Top-N query]({{< ref "docs/dev/table/sql/queries/window-topn" >}}). In theory, Window Deduplication is a special case of Window Top-N in which the N is one and order by the processing time or event time. + +The following shows the syntax of the Window Deduplication statement: + +```sql +SELECT [column_list] +FROM ( + SELECT [column_list], + ROW_NUMBER() OVER (PARTITION BY window_start, window_end [, col_key1...] + ORDER BY time_attr [asc|desc]) AS rownum + FROM table_name) -- relation applied windowing TVF +WHERE rownum = 1 [AND conditions] +``` + +**Parameter Specification:** + +- `ROW_NUMBER()`: Assigns an unique, sequential number to each row, starting with one. +- `PARTITION BY window_start, window_end [, col_key1...]`: Specifies the partition columns which contain `window_start` and `window_end` columns i.e. the window deduplicate key. +- `ORDER BY time_attr [asc|desc]`: Specifies the ordering column, it must be a [time attribute]({{< ref "docs/dev/table/concepts/time_attributes" >}}). Currently Flink supports [processing time attribute]({{< ref "docs/dev/table/concepts/time_attributes" >}}#processing-time) and [event time attribute]({{< ref "docs/dev/table/concepts/time_attributes" >}}#event-time). Ordering by ASC means keeping the first row, ordering by DESC means keeping the last row. +- `WHERE rownum = 1`: The `rownum = 1` is required for Flink to recognize this query is deduplication. + +{{< hint info >}} +Note: the above pattern must be followed exactly, otherwise the optimizer won’t be able to translate the query. +{{< /hint >}} + +## Example + +The following example shows how to keep last record for every tumbling 10 minutes window. + +```sql +Flink SQL> SELECT * + FROM ( + SELECT bidtime, price, item, supplier_id, window_start, window_end, + ROW_NUMBER() OVER (PARTITION BY window_start, window_end ORDER BY bidtime DESC) as rownum + FROM TABLE( + TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) + ) WHERE rownum <= 1; ++------------------+-------+------+-------------+------------------+------------------+--------+ +| bidtime | price | item | supplier_id | window_start | window_end | rownum | ++------------------+-------+------+-------------+------------------+------------------+--------+ +| 2020-04-15 08:09 | 5.00 | D | supplier4 | 2020-04-15 08:00 | 2020-04-15 08:10 | 1 | +| 2020-04-15 08:17 | 6.00 | F | supplier5 | 2020-04-15 08:10 | 2020-04-15 08:20 | 1 | ++------------------+-------+------+-------------+------------------+------------------+--------+ Review comment: it's better we can provide the source table ddl and source table data, which could help users understand it more easily ########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} + +Window Deduplication is a special [Deduplication]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) which removes rows that duplicate over a set of columns, keeping only the first one or the last one for each window and other partitioned keys. + +For streaming queries, unlike regular Deduplicate on continuous tables, window Deduplication does not emit intermediate results but only a final result at the end of the window. Moreover, window Deduplication purges all intermediate state when no longer needed. +Therefore, window Deduplication queries have better performance if users don't need results updated per record. Usually, Window Deduplication is used with [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) directly. Besides, Window Deduplication could be used with other operations based on [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}), such as [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}), [Window TopN]({{< ref "docs/dev/table/sql/queries/window-topn">}}) and [Window Join]({{< ref "docs/dev/table/sql/queries/window-join">}}). + +Window Deduplication can be defined in the same syntax as regular Deduplication, see [Deduplication documentation]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) for more information. +Besides that, Window Deduplication requires the `PARTITION BY` clause contains `window_start` and `window_end` columns of the relation applied [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) or [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}). +Otherwise, the optimizer won’t be able to translate the query. + +Flink uses `ROW_NUMBER()` to remove duplicates, just like the way of [Window Top-N query]({{< ref "docs/dev/table/sql/queries/window-topn" >}}). In theory, Window Deduplication is a special case of Window Top-N in which the N is one and order by the processing time or event time. + +The following shows the syntax of the Window Deduplication statement: + +```sql +SELECT [column_list] +FROM ( + SELECT [column_list], + ROW_NUMBER() OVER (PARTITION BY window_start, window_end [, col_key1...] + ORDER BY time_attr [asc|desc]) AS rownum + FROM table_name) -- relation applied windowing TVF +WHERE rownum = 1 [AND conditions] +``` + +**Parameter Specification:** + +- `ROW_NUMBER()`: Assigns an unique, sequential number to each row, starting with one. +- `PARTITION BY window_start, window_end [, col_key1...]`: Specifies the partition columns which contain `window_start` and `window_end` columns i.e. the window deduplicate key. +- `ORDER BY time_attr [asc|desc]`: Specifies the ordering column, it must be a [time attribute]({{< ref "docs/dev/table/concepts/time_attributes" >}}). Currently Flink supports [processing time attribute]({{< ref "docs/dev/table/concepts/time_attributes" >}}#processing-time) and [event time attribute]({{< ref "docs/dev/table/concepts/time_attributes" >}}#event-time). Ordering by ASC means keeping the first row, ordering by DESC means keeping the last row. +- `WHERE rownum = 1`: The `rownum = 1` is required for Flink to recognize this query is deduplication. + +{{< hint info >}} +Note: the above pattern must be followed exactly, otherwise the optimizer won’t be able to translate the query. +{{< /hint >}} + +## Example + +The following example shows how to keep last record for every tumbling 10 minutes window. Review comment: 10 minutes tumbling window ########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} + +Window Deduplication is a special [Deduplication]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) which removes rows that duplicate over a set of columns, keeping only the first one or the last one for each window and other partitioned keys. + +For streaming queries, unlike regular Deduplicate on continuous tables, window Deduplication does not emit intermediate results but only a final result at the end of the window. Moreover, window Deduplication purges all intermediate state when no longer needed. +Therefore, window Deduplication queries have better performance if users don't need results updated per record. Usually, Window Deduplication is used with [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) directly. Besides, Window Deduplication could be used with other operations based on [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}), such as [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}), [Window TopN]({{< ref "docs/dev/table/sql/queries/window-topn">}}) and [Window Join]({{< ref "docs/dev/table/sql/queries/window-join">}}). + +Window Deduplication can be defined in the same syntax as regular Deduplication, see [Deduplication documentation]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) for more information. +Besides that, Window Deduplication requires the `PARTITION BY` clause contains `window_start` and `window_end` columns of the relation applied [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) or [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}). +Otherwise, the optimizer won’t be able to translate the query. + +Flink uses `ROW_NUMBER()` to remove duplicates, just like the way of [Window Top-N query]({{< ref "docs/dev/table/sql/queries/window-topn" >}}). In theory, Window Deduplication is a special case of Window Top-N in which the N is one and order by the processing time or event time. + +The following shows the syntax of the Window Deduplication statement: + +```sql +SELECT [column_list] +FROM ( + SELECT [column_list], + ROW_NUMBER() OVER (PARTITION BY window_start, window_end [, col_key1...] + ORDER BY time_attr [asc|desc]) AS rownum + FROM table_name) -- relation applied windowing TVF +WHERE rownum = 1 [AND conditions] +``` + +**Parameter Specification:** + +- `ROW_NUMBER()`: Assigns an unique, sequential number to each row, starting with one. +- `PARTITION BY window_start, window_end [, col_key1...]`: Specifies the partition columns which contain `window_start` and `window_end` columns i.e. the window deduplicate key. Review comment: `window_start` and `window_end` columns i.e. the window deduplicate key -> `window_start`, `window_end` and other partition keys ? ########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} + +Window Deduplication is a special [Deduplication]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) which removes rows that duplicate over a set of columns, keeping only the first one or the last one for each window and other partitioned keys. + +For streaming queries, unlike regular Deduplicate on continuous tables, window Deduplication does not emit intermediate results but only a final result at the end of the window. Moreover, window Deduplication purges all intermediate state when no longer needed. +Therefore, window Deduplication queries have better performance if users don't need results updated per record. Usually, Window Deduplication is used with [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) directly. Besides, Window Deduplication could be used with other operations based on [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}), such as [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}), [Window TopN]({{< ref "docs/dev/table/sql/queries/window-topn">}}) and [Window Join]({{< ref "docs/dev/table/sql/queries/window-join">}}). + +Window Deduplication can be defined in the same syntax as regular Deduplication, see [Deduplication documentation]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) for more information. +Besides that, Window Deduplication requires the `PARTITION BY` clause contains `window_start` and `window_end` columns of the relation applied [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) or [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}). +Otherwise, the optimizer won’t be able to translate the query. + +Flink uses `ROW_NUMBER()` to remove duplicates, just like the way of [Window Top-N query]({{< ref "docs/dev/table/sql/queries/window-topn" >}}). In theory, Window Deduplication is a special case of Window Top-N in which the N is one and order by the processing time or event time. + +The following shows the syntax of the Window Deduplication statement: + +```sql +SELECT [column_list] +FROM ( + SELECT [column_list], + ROW_NUMBER() OVER (PARTITION BY window_start, window_end [, col_key1...] + ORDER BY time_attr [asc|desc]) AS rownum + FROM table_name) -- relation applied windowing TVF +WHERE rownum = 1 [AND conditions] Review comment: actually, `rownum <= 1` and `rownum < 2` are all accpted ########## File path: docs/content.zh/docs/dev/table/sql/queries/window-deduplication.md ########## @@ -0,0 +1,93 @@ +--- +title: "窗口去重" +weight: 16 +type: docs +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Window Deduplication +{{< label Batch >}} {{< label Streaming >}} + +Window Deduplication is a special [Deduplication]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) which removes rows that duplicate over a set of columns, keeping only the first one or the last one for each window and other partitioned keys. + +For streaming queries, unlike regular Deduplicate on continuous tables, window Deduplication does not emit intermediate results but only a final result at the end of the window. Moreover, window Deduplication purges all intermediate state when no longer needed. +Therefore, window Deduplication queries have better performance if users don't need results updated per record. Usually, Window Deduplication is used with [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) directly. Besides, Window Deduplication could be used with other operations based on [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}), such as [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}), [Window TopN]({{< ref "docs/dev/table/sql/queries/window-topn">}}) and [Window Join]({{< ref "docs/dev/table/sql/queries/window-join">}}). + +Window Deduplication can be defined in the same syntax as regular Deduplication, see [Deduplication documentation]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) for more information. +Besides that, Window Deduplication requires the `PARTITION BY` clause contains `window_start` and `window_end` columns of the relation applied [Windowing TVF]({{< ref "docs/dev/table/sql/queries/window-tvf" >}}) or [Window Aggregation]({{< ref "docs/dev/table/sql/queries/window-agg" >}}). +Otherwise, the optimizer won’t be able to translate the query. + +Flink uses `ROW_NUMBER()` to remove duplicates, just like the way of [Window Top-N query]({{< ref "docs/dev/table/sql/queries/window-topn" >}}). In theory, Window Deduplication is a special case of Window Top-N in which the N is one and order by the processing time or event time. + +The following shows the syntax of the Window Deduplication statement: + +```sql +SELECT [column_list] +FROM ( + SELECT [column_list], + ROW_NUMBER() OVER (PARTITION BY window_start, window_end [, col_key1...] + ORDER BY time_attr [asc|desc]) AS rownum + FROM table_name) -- relation applied windowing TVF +WHERE rownum = 1 [AND conditions] +``` + +**Parameter Specification:** + +- `ROW_NUMBER()`: Assigns an unique, sequential number to each row, starting with one. +- `PARTITION BY window_start, window_end [, col_key1...]`: Specifies the partition columns which contain `window_start` and `window_end` columns i.e. the window deduplicate key. +- `ORDER BY time_attr [asc|desc]`: Specifies the ordering column, it must be a [time attribute]({{< ref "docs/dev/table/concepts/time_attributes" >}}). Currently Flink supports [processing time attribute]({{< ref "docs/dev/table/concepts/time_attributes" >}}#processing-time) and [event time attribute]({{< ref "docs/dev/table/concepts/time_attributes" >}}#event-time). Ordering by ASC means keeping the first row, ordering by DESC means keeping the last row. +- `WHERE rownum = 1`: The `rownum = 1` is required for Flink to recognize this query is deduplication. + +{{< hint info >}} +Note: the above pattern must be followed exactly, otherwise the optimizer won’t be able to translate the query. Review comment: `the optimizer won’t be able to translate the query` -> `the optimizer won’t be able to translate the query to window deduplication` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
