Re: Spark structured streaming leftOuter join not working as I expect

2019-06-10 Thread Joe Ammann
Hi all it took me some time to get the issues extracted into a piece of standalone code. I created the following gist https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17 I has messages for 4 topics A/B/C/D and a simple Python program which shows 6 use cases, with my expectations

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-05 Thread Joe Ammann
a same > case, but should be good to know once you're dealing with > streaming-streaming join. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 1. https://issues.apache.org/jira/browse/SPARK-26154 > > On Tue, Jun 4, 2019 at 9:31 PM Joe Ammann wrote: > > > Hi all >

Spark structured streaming leftOuter join not working as I expect

2019-06-04 Thread Joe Ammann
Hi all sorry, tl;dr I'm on my first Python Spark structured streaming app, in the end joining messages from ~10 different Kafka topics. I've recently upgraded to Spark 2.4.3, which has resolved all my issues with the time handling (watermarks, join windows) I had before with Spark 2.3.2. My

Watermark handling on initial query start (Structured Streaming)

2019-05-20 Thread Joe Ammann
Hi all I'm currently developing a Spark structured streaming application which joins/aggregates messages from ~7 Kafka topics and produces messages onto another Kafka topic. Quite often in my development cycle, I want to "reprocess from scratch": I stop the program, delete the target topic

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
t;sharedId", window("20 seconds", "10 seconds") > > // ProcessingTime trigger with two-seconds micro-batch interval > > |df.writeStream .format("console") .trigger(Trigger.ProcessingTime("2 > seconds")) .start()| > >

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
ress, if no new data is coming in. This is how I came to my suspsicion on how it works internally. I understand that it is quite uncommon to have such "slowly moving topics", but unfortunately in my use case I have them. > On Tue, May 14, 2019 at 3:49 PM Joe Ammann mailto:j...@

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi Suket Sorry, this was a typo in the pseudo-code I sent. Of course that what you suggested (using the same eventtime attribute for the watermark and the window) is what my code does in reality. Sorry, to confuse people. On 5/14/19 4:14 PM, suket arora wrote: > Hi Joe, > As per the spark

Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi all I'm fairly new to Spark structured streaming and I'm only starting to develop an understanding for the watermark handling. Our application reads data from a Kafka input topic and as one of the first steps, it has to group incoming messages. Those messages come in bulks, e.g. 5 messages

Re: Spark structured streaming watermarks on nested attributes

2019-05-07 Thread Joe Ammann
Hi Yuanjian On 5/7/19 4:55 AM, Yuanjian Li wrote: > Hi Joe > > I think you met this issue: https://issues.apache.org/jira/browse/SPARK-27340 > You can check the description in Jira and PR. We also met this in our > production env and fixed by the providing PR. > > The PR is still in review. cc

Re: Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Joe Ammann
hen I nest the attributes and use "entityX.LAST_MODIFICATION" (notice the dot for the nesting) the joins fail. I have a feeling that the Spark execution plan get's somewhat confused, because in the latter case, there are multiple fields called "LAST_MODIFICATION" with differing nes

Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Joe Ammann
Hi all I'm pretty new to Spark and implementing my first non-trivial structured streaming job with outer joins. My environment is a Hortonworks HDP 3.1 cluster with Spark 2.3.2, working with Python. I understood that I need to provide watermarks and join conditions for left outer joins to