[GitHub] [flink] swuferhong commented on a diff in pull request #20513: [FLINK-28858][docs] Add document to describe join hints for batch sql

GitBox Mon, 15 Aug 2022 18:08:36 -0700


swuferhong commented on code in PR #20513:
URL: https://github.com/apache/flink/pull/20513#discussion_r946254106



##########
docs/content/docs/dev/table/sql/queries/hints.md:
##########
@@ -84,4 +84,222 @@ insert into kafka_table1 /*+ 
OPTIONS('sink.partitioner'='round-robin') */ select
 
 ```
 
+## Join Hints
+
+{{< label Batch >}}
+
+Join hints allow users to manually choose the join strategy in optimizer, and 
manually optimize the execution plan to improve the performance of the query. 
Now only support in batch mode.
+
+### Join Hints Strategies
+The following Join strategies are currently supported in Flink SQL for batch 
job:
+
+
+**BroadCast Join**
+
+
+In this Join strategy, the data on the build side (usually a small table) will 
be broadcast to each downstream operator, 
+and the data on the probe side (usually a large table) will be sent directly 
to the downstream operator with Forward. 
+Then the data on the build side (small table) will be built into a Hash Table 
for the probe side to query.
+
+**Hash Shuffle Join**
+
+
+In this Join strategy, the data on the Build side and the Probe side will be 
shuffled with the join key, and the data 
+with the same key will be distributed to the same downstream operator. Then 
the data on the build side (smaller table) 
+will be built into a Hash Table for the probe side to query.
+
+
+Compared with the `Hash Shuffle Join` strategy, the `BroadCast Join` strategy 
does not need to shuffle the probe side, which saves a lot of 
+shuffle time. Therefore, when a table is tiny, the `BroadCast Join` strategy 
is usually selected to avoid the shuffle 
+cost and improve computing performance. However, when the scale of the small 
table is large, the `BroadCast Join` strategy 
+is not applicable, because the overhead of redundant data transmission will 
exceed the overhead of shuffle.
+
+**Sort Merge Join**
+
+This Join strategy is aimed at the scenario of joining between two large 
tables or the scenario that the data at both 
+sides of the join is already in order. This strategy first shuffles the data 
on both sides of the join to the downstream 
+operator according to the Join Key. Then the downstream operator sorts the 
data before joining, and finally joins the data at both ends. 
+This strategy eliminates the need to load all data on one side into memory, 
thus relieving the pressure on computational memory.
+
+**Nested Loop Join**
+
+In this Join strategy, the probe side is used as an Outer Loop, and the build 
side is used as an Inner Loop, and the data is joined through two-layer loops.
+
+### Syntax
+The join hints syntax in Flink:
+```sql
+# query:
+SELECT /*+ hint_content[, hint_content] */ ...
+
+hint_content:
+    hint_strategy_name(hint_item[, hint_item])
+
+hint_strategy_name:
+    supported_join_hint_name
+
+hint_item:
+    string_literal
+```
+
+### Examples
+
+**BroadCast Join**
+
+
+*NOTE:* `BroadCast Join` strategy only supports join with equivalence join 
condition. And it not supports `Full Outer Join` join type.
+```sql
+CREATE TABLE t1 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t2 (id BIGINT, name STRING, age INT) WITH (...);
+CREATE TABLE t3 (id BIGINT, name STRING, age INT) WITH (...);
+
+-- Select broadcast join strategy with t1, t1 will be the broadcast table.
+SELECT /*+ BROADCAST(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id;
+
+-- Select broadcast join strategy with t1 and t3.
+-- For this case, when join between t1 and t2, t1 will be the broadcast table,
+-- after that, when join between the result after t1 joins t2 and t3, t3 will 
be the broadcast table.

Review Comment:
   > ", after that," -> ". And"
   
   done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] swuferhong commented on a diff in pull request #20513: [FLINK-28858][docs] Add document to describe join hints for batch sql

Reply via email to