Hi all I'm pretty new to Spark and implementing my first non-trivial structured streaming job with outer joins. My environment is a Hortonworks HDP 3.1 cluster with Spark 2.3.2, working with Python.
I understood that I need to provide watermarks and join conditions for left outer joins to work. All my incoming Kafka streams have an attribute "LAST_MODIFICATION" which is well suited to indicate the event time, so I chose that for watermarking. Since I'm joining from multiple topics where the incoming messages have common attributes, I though I'd prefix/nest all incoming messages. Something like entity1DF.select(struct("*").alias("entity1")).withWatermark("entity1.LAST_MODIFICATION") entity2DF.select(struct("*").alias("entity2")).withWatermark("entity2.LAST_MODIFICATION") Now when I try to join such 2 streams, it would fail and tell me that I need to use watermarks When I leave the watermarking attribute "at the top level", everything works as expected, e.g. entity1DF.select(struct("*").alias("entity1"), col("LAST_MODIFICATION").alias("entity1_LAST_MODIFICATION")).withWatermark("entity1_LAST_MODIFICATION") Before I hunt this down any further, is this kind of a known limitation? Or am I doing something fundamentally wrong? -- CU, Joe --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org