Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi Suket, Anastasios Many thanks for your time and your suggestions! I tried again with various settings for the watermarks and the trigger time - watermark 20sec, trigger 2sec - watermark 10sec, trigger 1sec - watermark 20sec, trigger 0sec I also tried continuous processing mode, but since I

Re: Handling of watermark in structured streaming

2019-05-14 Thread Suket Arora
df = inputStream.withWatermark("eventtime", "20 seconds").groupBy("sharedId", window("20 seconds", "10 seconds") // ProcessingTime trigger with two-seconds micro-batch interval df.writeStream .format("console") .trigger(Trigger.ProcessingTime("2 seconds")) .start() On Tue, 14 May

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi Anastasios On 5/14/19 4:15 PM, Anastasios Zouzias wrote: > Hi Joe, > > How often do you trigger your mini-batch? Maybe you can specify the trigger > time explicitly to a low value or even better set it off. > > See: >

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi Suket Sorry, this was a typo in the pseudo-code I sent. Of course that what you suggested (using the same eventtime attribute for the watermark and the window) is what my code does in reality. Sorry, to confuse people. On 5/14/19 4:14 PM, suket arora wrote: > Hi Joe, > As per the spark

Re: Handling of watermark in structured streaming

2019-05-14 Thread Anastasios Zouzias
Hi Joe, How often do you trigger your mini-batch? Maybe you can specify the trigger time explicitly to a low value or even better set it off. See: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers Best, Anastasios On Tue, May 14, 2019 at 3:49 PM Joe

Re: Handling of watermark in structured streaming

2019-05-14 Thread suket arora
Hi Joe, As per the spark structured streaming documentation and I quote "withWatermark must be called on the same column as the timestamp column used in the aggregate. For example, df.withWatermark("time", "1 min").groupBy("time2").count() is invalid in Append output mode, as watermark is defined

Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi all I'm fairly new to Spark structured streaming and I'm only starting to develop an understanding for the watermark handling. Our application reads data from a Kafka input topic and as one of the first steps, it has to group incoming messages. Those messages come in bulks, e.g. 5 messages