Hi Jacek:
Thanks for your response.
I am just trying to understand the fundamentals of watermarking and how it
behaves in aggregation vs non-aggregation scenarios.
On Tuesday, February 6, 2018 9:04 AM, Jacek Laskowski <[email protected]>
wrote:
Hi,
What would you expect? The data is simply dropped as that's the purpose of
watermarking it. That's my understanding at least.
Pozdrawiam,Jacek Laskowski----https://about.me/JacekLaskowskiMastering Spark
SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski
On Mon, Feb 5, 2018 at 8:11 PM, M Singh <[email protected]> wrote:
Just checking if anyone has more details on how watermark works in cases where
event time is earlier than processing time stamp.
On Friday, February 2, 2018 8:47 AM, M Singh <[email protected]> wrote:
Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is processing time.
Vishnu - Spark documentation (https://spark.apache.org/ docs/latest/structured-
streaming-programming-guide. html) does indicate that it can dedup using
watermark. So I believe there are more use cases for watermark and that is
what I am trying to find.
I am hoping that TD can clarify or point me to the documentation.
Thanks
On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath
<[email protected]> wrote:
Hi Mans,
Watermark is Spark is used to decide when to clear the state, so if the even it
delayed more than when the state is cleared by Spark, then it will be ignored.I
recently wrote a blog post on this : http://vishnuviswanath.com/
spark_structured_streaming. html#watermark
Yes, this State is applicable for aggregation only. If you are having only a
map function and don't want to process it, you could do a filter based on its
EventTime field, but I guess you will have to compare it with the processing
time since there is no API to access Watermark by the user.
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh <[email protected]> wrote:
Hi:
I am trying to filter out records which are lagging behind (based on event
time) by a certain amount of time.
Is the watermark api applicable to this scenario (ie, filtering lagging
records) or it is only applicable with aggregation ? I could not get a clear
understanding from the documentation which only refers to it's usage with
aggregation.
Thanks
Mans