Streams have no end until watermarked or closed. Joins need bounded datasets, et voila. Something tells me you should consider the streaming nature of your data and whether your joins need to use increments/snippets of infinite streams or to re-join the entire contents of the streams accumulated at checkpoints.
From: Joe Ammann <j...@pyx.ch> <j...@pyx.ch> Reply: Joe Ammann <j...@pyx.ch> <j...@pyx.ch> Date: May 6, 2019 at 6:45:13 AM To: user@spark.apache.org <user@spark.apache.org> <user@spark.apache.org> Subject: Spark structured streaming watermarks on nested attributes Hi all I'm pretty new to Spark and implementing my first non-trivial structured streaming job with outer joins. My environment is a Hortonworks HDP 3.1 cluster with Spark 2.3.2, working with Python. I understood that I need to provide watermarks and join conditions for left outer joins to work. All my incoming Kafka streams have an attribute "LAST_MODIFICATION" which is well suited to indicate the event time, so I chose that for watermarking. Since I'm joining from multiple topics where the incoming messages have common attributes, I though I'd prefix/nest all incoming messages. Something like entity1DF.select(struct("*").alias("entity1")).withWatermark("entity1.LAST_MODIFICATION") entity2DF.select(struct("*").alias("entity2")).withWatermark("entity2.LAST_MODIFICATION") Now when I try to join such 2 streams, it would fail and tell me that I need to use watermarks When I leave the watermarking attribute "at the top level", everything works as expected, e.g. entity1DF.select(struct("*").alias("entity1"), col("LAST_MODIFICATION").alias("entity1_LAST_MODIFICATION")).withWatermark("entity1_LAST_MODIFICATION") Before I hunt this down any further, is this kind of a known limitation? Or am I doing something fundamentally wrong? -- CU, Joe --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org