Streams have no end until watermarked or closed. Joins need bounded
datasets, et voila. Something tells me you should consider the streaming
nature of your data and whether your joins need to use increments/snippets
of infinite streams or to re-join the entire contents of the streams
accumulated at checkpoints.


From: Joe Ammann <j...@pyx.ch> <j...@pyx.ch>
Reply: Joe Ammann <j...@pyx.ch> <j...@pyx.ch>
Date: May 6, 2019 at 6:45:13 AM
To: user@spark.apache.org <user@spark.apache.org> <user@spark.apache.org>
Subject:  Spark structured streaming watermarks on nested attributes

Hi all

I'm pretty new to Spark and implementing my first non-trivial structured
streaming job with outer joins. My environment is a Hortonworks HDP 3.1
cluster with Spark 2.3.2, working with Python.

I understood that I need to provide watermarks and join conditions for left
outer joins to work. All my incoming Kafka streams have an attribute
"LAST_MODIFICATION" which is well suited to indicate the event time, so I
chose that for watermarking. Since I'm joining from multiple topics where
the incoming messages have common attributes, I though I'd prefix/nest all
incoming messages. Something like

entity1DF.select(struct("*").alias("entity1")).withWatermark("entity1.LAST_MODIFICATION")

entity2DF.select(struct("*").alias("entity2")).withWatermark("entity2.LAST_MODIFICATION")


Now when I try to join such 2 streams, it would fail and tell me that I
need to use watermarks

When I leave the watermarking attribute "at the top level", everything
works as expected, e.g.

entity1DF.select(struct("*").alias("entity1"),
col("LAST_MODIFICATION").alias("entity1_LAST_MODIFICATION")).withWatermark("entity1_LAST_MODIFICATION")


Before I hunt this down any further, is this kind of a known limitation? Or
am I doing something fundamentally wrong?

-- 
CU, Joe

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to