>
> 1. what happens if an event arrives few days late? Looks like we have an
> unbound table with sorted time intervals as keys but I assume spark doesn't
> keep several days worth of data in memory but rather it would checkpoint
> parts of the unbound table to a storage at a specified interval such that
> if an event comes few days late it would update the part of the table that
> is in memory plus the parts of the table that are in storage which contains
> the interval (Again this is just my assumption, I don't know what it really
> does). is this correct so far?
>

The state we need to keep will be unbounded, unless you specify a
watermark.  This watermark tells us how long to wait for late data to
arrive and thus allows us to bound the amount of state that we keep in
memory.  Since we purge state for aggregations that are below the
watermark, we must also drop data that arrives even later than your
specified watermark (if any).  Note that the watermark is calculated based
on observed data, not on the actual time of processing.  So we should be
robust to cases where the stream is down for extended periods of time.


> 2.  Say I am running a Spark Structured streaming Job for 90 days with a
> window interval of 10 mins and a slide interval of 5 mins. Does the output
> of this Job always return the entire history in a table? other words the
> does the output on 90th day contains a table of 10 minute time intervals
> from day 1 to day 90? If so, wouldn't that be too big to return as an
> output?
>

This depends on the output mode.  In complete mode, we output the entire
result every time (thus, complete mode probably doesn't make sense for this
use case).  In update mode
<https://issues.apache.org/jira/browse/SPARK-18234>, we will output
continually updated estimates of the final answer as the stream progresses
(useful if you are for example updating a database).  In append mode
(supported in 2.1) we only output finalized aggregations that have fallen
beneath the watermark.

Relatedly, SPARK-16738
<https://issues.apache.org/jira/browse/SPARK-16738> talks
about making the distributed state store queryable.  With this feature, you
could run your query in complete mode (given enough machines).  Even though
the results are large, you can still interact with the complete results of
the aggregation as a distributed DataFrame.


> 3. For Structured Streaming is it required to have a distributed storage
> such as HDFS? my guess would be yes (based on what I said in #1) but I
> would like to confirm.
>

Currently this is the only place that we can write the offset log (records
what data is in each batch) and the state checkpoints.  I think its likely
that we'll add support for other storage systems here in the future.


> 4. I briefly heard about watermarking. Are there any pointers where I can
> know them more in detail? Specifically how watermarks could help in
> structured streaming and so on.
>

Here's the best docs available: https://github.com/apache/spark/pull/15702

We are working on something for the programming guide / a blog post in the
next few weeks.

Reply via email to