Re: What is the current canonical way to join more than 2 watermarked streams (Spark 3.5.6)?

Jungtaek Lim Thu, 26 Jun 2025 05:07:55 -0700

Hi,

Starting from Spark 4.0.0, we support multiple stateful operators in append
mode. You can perform the chain of stream-stream joins.

One thing you need to care about is, the output of stream-stream join will
have two different event time columns, which is ambiguous w.r.t. which
column has to be taken as event time column. Ideally the engine should try
to track the column which is used for the event time column in the
following operator, but we haven't had a time to do this.

So as of now, you need to exclude one of two columns from the event time
column, via calling `withMetadata(<column to be excluded>, Metadata.empty)`
against the join output, before joining to another stream.

Please give it a try and let me know if this does not work for you.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Jun 25, 2025 at 4:03 PM cheapsolutionarchit...@gmail.com <
cheapsolutionarchit...@gmail.com> wrote:

> Hi,
>
> Given two Spark-Structured streams and using them as
>
> https://spark.apache.org/docs/3.5.6/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking,
>
> just works.
>
> Now if I want to join three streams using the same technique, Spark
> complains about multiple possible watermarks. I have a rough
> understanding of what happened, and concluded this works as designed.
>
> But as I am certainly not the only one who tried that, what is the
> canonical way of doing this? My first idea was like: I'm going to join
> S1,S2 with their corresponding watermarks, then write that result to
> disk, possibly into a delta table, read the result with another stream
> and join this one with the remaining stream and third watermark.
>
> Is there some other way? Or is this the current canonical way of joining
> more than two streams that carry a watermark?
>
> Best Regards
>
> M.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: What is the current canonical way to join more than 2 watermarked streams (Spark 3.5.6)?

Reply via email to