This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new a575e80  [SPARK-34446][SS][DOCS] Update doc for stream-stream join 
(full outer + left semi)
a575e80 is described below

commit a575e805a18a515f8707f74cf2b22777474f2f06
Author: Cheng Su <chen...@fb.com>
AuthorDate: Thu Feb 18 09:34:33 2021 +0900

    [SPARK-34446][SS][DOCS] Update doc for stream-stream join (full outer + 
left semi)
    
    ### What changes were proposed in this pull request?
    
    Per discussion in 
https://issues.apache.org/jira/browse/SPARK-32883?focusedCommentId=17285057&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17285057,
 we should add documentation for added new features of full outer and left semi 
joins into SS programming guide.
    
    * Reworded the section for "Outer Joins with Watermarking", to make it work 
for full outer join. Updated the code snippet to show up full outer and left 
semi join.
    * Added one section for "Semi Joins with Watermarking", similar to "Outer 
Joins with Watermarking".
    * Updated "Support matrix for joins in streaming queries" to reflect latest 
fact for full outer and left semi join.
    
    ### Why are the changes needed?
    
    Good for users and developers to follow guide to try out these two new 
features.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. They will see the corresponding updated guide.
    
    ### How was this patch tested?
    
    No, just documentation change. Previewed the markdown file in browser.
    Also attached here for the change to the "Support matrix for joins in 
streaming queries" table.
    
    <img width="896" alt="Screen Shot 2021-02-16 at 8 12 07 PM" 
src="https://user-images.githubusercontent.com/4629931/108155275-73c92e80-7093-11eb-9f0b-c8b4bb7321e5.png";>
    
    Closes #31572 from c21/ss-doc.
    
    Authored-by: Cheng Su <chen...@fb.com>
    Signed-off-by: Jungtaek Lim <kabhwan.opensou...@gmail.com>
---
 docs/structured-streaming-programming-guide.md | 60 +++++++++++++++++++-------
 1 file changed, 45 insertions(+), 15 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index bea38ed..9c6d47b 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1098,7 +1098,7 @@ likely is the engine going to process it.
 Structured Streaming supports joining a streaming Dataset/DataFrame with a 
static Dataset/DataFrame
 as well as another streaming Dataset/DataFrame. The result of the streaming 
join is generated
 incrementally, similar to the results of streaming aggregations in the 
previous section. In this
-section we will explore what type of joins (i.e. inner, outer, etc.) are 
supported in the above
+section we will explore what type of joins (i.e. inner, outer, semi, etc.) are 
supported in the above
 cases. Note that in all the supported join types, the result of the join with 
a streaming
 Dataset/DataFrame will be the exactly the same as if it was with a static 
Dataset/DataFrame
 containing the same data in the stream.
@@ -1318,8 +1318,8 @@ A watermark delay of "2 hours" guarantees that the engine 
will never drop any da
  2 hours delayed. But data delayed by more than 2 hours may or may not get 
processed.
 
 ##### Outer Joins with Watermarking
-While the watermark + event-time constraints is optional for inner joins, for 
left and right outer
-joins they must be specified. This is because for generating the NULL results 
in outer join, the
+While the watermark + event-time constraints is optional for inner joins, for 
outer joins
+they must be specified. This is because for generating the NULL results in 
outer join, the
 engine must know when an input row is not going to match with anything in 
future. Hence, the
 watermark + event-time constraints must be specified for generating correct 
results. Therefore,
 a query with outer-join will look quite like the ad-monetization example 
earlier, except that
@@ -1337,7 +1337,7 @@ impressionsWithWatermark.join(
     clickTime >= impressionTime AND
     clickTime <= impressionTime + interval 1 hour
     """),
-  joinType = "leftOuter"      // can be "inner", "leftOuter", "rightOuter"
+  joinType = "leftOuter"      // can be "inner", "leftOuter", "rightOuter", 
"fullOuter", "leftSemi"
  )
 
 {% endhighlight %}
@@ -1352,7 +1352,7 @@ impressionsWithWatermark.join(
     "clickAdId = impressionAdId AND " +
     "clickTime >= impressionTime AND " +
     "clickTime <= impressionTime + interval 1 hour "),
-  "leftOuter"                 // can be "inner", "leftOuter", "rightOuter"
+  "leftOuter"                 // can be "inner", "leftOuter", "rightOuter", 
"fullOuter", "leftSemi"
 );
 
 {% endhighlight %}
@@ -1369,7 +1369,7 @@ impressionsWithWatermark.join(
     clickTime >= impressionTime AND
     clickTime <= impressionTime + interval 1 hour
     """),
-  "leftOuter"                 # can be "inner", "leftOuter", "rightOuter"
+  "leftOuter"                 # can be "inner", "leftOuter", "rightOuter", 
"fullOuter", "leftSemi"
 )
 
 {% endhighlight %}
@@ -1386,7 +1386,7 @@ joined <- join(
       "clickAdId = impressionAdId AND",
       "clickTime >= impressionTime AND",
       "clickTime <= impressionTime + interval 1 hour"),
-  "left_outer"                 # can be "inner", "left_outer", "right_outer"
+  "left_outer"                 # can be "inner", "left_outer", "right_outer", 
"full_outer", "left_semi"
 ))
 
 {% endhighlight %}
@@ -1415,6 +1415,18 @@ generation of the outer result may get delayed if there 
no new data being receiv
 *In short, if any of the two input streams being joined does not receive data 
for a while, the
 outer (both cases, left or right) output may get delayed.*
 
+##### Semi Joins with Watermarking
+A semi join returns values from the left side of the relation that has a match 
with the right.
+It is also referred to as a left semi join. Similar to outer joins, watermark 
+ event-time
+constraints must be specified for semi join. This is to evict unmatched input 
rows on left side,
+the engine must know when an input row on left side is not going to match with 
anything on right
+side in future.
+
+###### Semantic Guarantees of Stream-stream Semi Joins with Watermarking
+{:.no_toc}
+Semi joins have the same guarantees as [inner 
joins](#semantic-guarantees-of-stream-stream-inner-joins-with-watermarking)
+regarding watermark delays and whether data will be dropped or not.
+
 ##### Support matrix for joins in streaming queries
 
 <table class ="table">
@@ -1434,8 +1446,8 @@ outer (both cases, left or right) output may get delayed.*
       </td>
   </tr>
   <tr>
-    <td rowspan="4" style="vertical-align: middle;">Stream</td>
-    <td rowspan="4" style="vertical-align: middle;">Static</td>
+    <td rowspan="5" style="vertical-align: middle;">Stream</td>
+    <td rowspan="5" style="vertical-align: middle;">Static</td>
     <td style="vertical-align: middle;">Inner</td>
     <td style="vertical-align: middle;">Supported, not stateful</td>
   </tr>
@@ -1452,8 +1464,12 @@ outer (both cases, left or right) output may get 
delayed.*
     <td style="vertical-align: middle;">Not supported</td>
   </tr>
   <tr>
-    <td rowspan="4" style="vertical-align: middle;">Static</td>
-    <td rowspan="4" style="vertical-align: middle;">Stream</td>
+    <td style="vertical-align: middle;">Left Semi</td>
+    <td style="vertical-align: middle;">Supported, not stateful</td>
+  </tr>
+  <tr>
+    <td rowspan="5" style="vertical-align: middle;">Static</td>
+    <td rowspan="5" style="vertical-align: middle;">Stream</td>
     <td style="vertical-align: middle;">Inner</td>
     <td style="vertical-align: middle;">Supported, not stateful</td>
   </tr>
@@ -1470,8 +1486,12 @@ outer (both cases, left or right) output may get 
delayed.*
     <td style="vertical-align: middle;">Not supported</td>
   </tr>
   <tr>
-    <td rowspan="4" style="vertical-align: middle;">Stream</td>
-    <td rowspan="4" style="vertical-align: middle;">Stream</td>
+    <td style="vertical-align: middle;">Left Semi</td>
+    <td style="vertical-align: middle;">Not supported</td>
+  </tr>
+  <tr>
+    <td rowspan="5" style="vertical-align: middle;">Stream</td>
+    <td rowspan="5" style="vertical-align: middle;">Stream</td>
     <td style="vertical-align: middle;">Inner</td>
     <td style="vertical-align: middle;">
       Supported, optionally specify watermark on both sides +
@@ -1494,9 +1514,19 @@ outer (both cases, left or right) output may get 
delayed.*
   </tr>
   <tr>
     <td style="vertical-align: middle;">Full Outer</td>
-    <td style="vertical-align: middle;">Not supported</td>
+    <td style="vertical-align: middle;">
+      Conditionally supported, must specify watermark on one side + time 
constraints for correct
+      results, optionally specify watermark on the other side for all state 
cleanup
+    </td>
+  </tr>
+  <tr>
+    <td style="vertical-align: middle;">Left Semi</td>
+    <td style="vertical-align: middle;">
+      Conditionally supported, must specify watermark on right + time 
constraints for correct
+      results, optionally specify watermark on left for all state cleanup
+    </td>
   </tr>
- <tr>
+  <tr>
     <td></td>
     <td></td>
     <td></td>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to