[
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386583#comment-16386583
]
Yuriy Bondaruk commented on SPARK-18791:
----------------------------------------
Shouldn't it be marked as resolved? Stream-stream joins are already supported
in Spark 2.3:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins
> Stream-Stream Joins
> -------------------
>
> Key: SPARK-18791
> URL: https://issues.apache.org/jira/browse/SPARK-18791
> Project: Spark
> Issue Type: New Feature
> Components: Structured Streaming
> Reporter: Michael Armbrust
> Assignee: Tathagata Das
> Priority: Major
>
> Stream stream join is a much requested, but missing feature in Structured
> Streaming. While the join API exists in Datasets and DataFrames, it throws
> UnsupportedOperationException when applied between two streaming
> Datasets/DataFrames. To support this, we have to maintain the same semantics
> as other Structured Streaming operations - the result of the operation after
> consuming two data streams data till positions/offsets X and Y, respectively,
> must be the same as a single batch join operation on all the data till
> positions X and Y, respectively. To achieve this, the execution has to buffer
> past data (i.e. streaming state) from each stream, so that future data can be
> matched against past data. Here is the set of a few high-level requirements.
> - Buffer past rows as streaming state (using StateStore), and joining with
> the past rows.
> - Support state cleanup using the event time watermark when possible.
> - Support different types of joins (inner, left outer, right outer is in
> highest demand for ETL/enrichment type use cases [kafka -> best-effort enrich
> -> write to S3])
> - Support cascading join operations (i.e. joining more than 2 streams)
> - Support multiple output modes (Append mode is in highest demand for
> enabling ETL/enrichment type use cases)
> All the work to incrementally build this is going represented by this JIRA,
> with specific subtasks for each step. At this point, this is the rough
> direction as follows:
> - Implement stream-stream inner join in Append Mode, supporting multiple
> cascaded joins.
> - Extends it stream-stream left/right outer join in Append Mode
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]