Just checking if anyone has more details on how watermark works in cases where
event time is earlier than processing time stamp.
On Friday, February 2, 2018 8:47 AM, M Singh <[email protected]> wrote:
Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is processing time.
Vishnu - Spark documentation
(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
does indicate that it can dedup using watermark. So I believe there are more
use cases for watermark and that is what I am trying to find.
I am hoping that TD can clarify or point me to the documentation.
Thanks
On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath
<[email protected]> wrote:
Hi Mans,
Watermark is Spark is used to decide when to clear the state, so if the even it
delayed more than when the state is cleared by Spark, then it will be ignored.I
recently wrote a blog post on this :
http://vishnuviswanath.com/spark_structured_streaming.html#watermark
Yes, this State is applicable for aggregation only. If you are having only a
map function and don't want to process it, you could do a filter based on its
EventTime field, but I guess you will have to compare it with the processing
time since there is no API to access Watermark by the user.
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh <[email protected]> wrote:
Hi:
I am trying to filter out records which are lagging behind (based on event
time) by a certain amount of time.
Is the watermark api applicable to this scenario (ie, filtering lagging
records) or it is only applicable with aggregation ? I could not get a clear
understanding from the documentation which only refers to it's usage with
aggregation.
Thanks
Mans