few basic questions on structured streaming

kant kodali Thu, 08 Dec 2016 03:53:00 -0800

Hi All,

I read the documentation on Structured Streaming based on event time and I
have the following questions.


1. what happens if an event arrives few days late? Looks like we have an
unbound table with sorted time intervals as keys but I assume spark doesn't
keep several days worth of data in memory but rather it would checkpoint
parts of the unbound table to a storage at a specified interval such that
if an event comes few days late it would update the part of the table that
is in memory plus the parts of the table that are in storage which contains
the interval (Again this is just my assumption, I don't know what it really
does). is this correct so far?

2.  Say I am running a Spark Structured streaming Job for 90 days with a
window interval of 10 mins and a slide interval of 5 mins. Does the output
of this Job always return the entire history in a table? other words the
does the output on 90th day contains a table of 10 minute time intervals
from day 1 to day 90? If so, wouldn't that be too big to return as an
output?

3. For Structured Streaming is it required to have a distributed storage
such as HDFS? my guess would be yes (based on what I said in #1) but I
would like to confirm.

4. I briefly heard about watermarking. Are there any pointers where I can
know them more in detail? Specifically how watermarks could help in
structured streaming and so on.

Thanks,
kant

few basic questions on structured streaming

Reply via email to