Spark structured streaming watermarks on nested attributes

Joe Ammann Mon, 06 May 2019 06:45:08 -0700

Hi all

I'm pretty new to Spark and implementing my first non-trivial structured 
streaming job with outer joins. My environment is a Hortonworks HDP 3.1 cluster 
with Spark 2.3.2, working with Python.


I understood that I need to provide watermarks and join conditions for left 
outer joins to work. All my incoming Kafka streams have an attribute 
"LAST_MODIFICATION" which is well suited to indicate the event time, so I chose 
that for watermarking. Since I'm joining from multiple topics where the 
incoming messages have common attributes, I though I'd prefix/nest all incoming 
messages. Something like

        
entity1DF.select(struct("*").alias("entity1")).withWatermark("entity1.LAST_MODIFICATION")
        
entity2DF.select(struct("*").alias("entity2")).withWatermark("entity2.LAST_MODIFICATION")

Now when I try to join such 2 streams, it would fail and tell me that I need to 
use watermarks

When I leave the watermarking attribute "at the top level", everything works as 
expected, e.g.

        entity1DF.select(struct("*").alias("entity1"), 
col("LAST_MODIFICATION").alias("entity1_LAST_MODIFICATION")).withWatermark("entity1_LAST_MODIFICATION")

Before I hunt this down any further, is this kind of a known limitation? Or am 
I doing something fundamentally wrong?

-- 
CU, Joe

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark structured streaming watermarks on nested attributes

Reply via email to