Jay,

Sorry, I didn't explain it very well.  I'm talking about a stream-table
join where the table comes from a compacted topic that is used to populate
a local data store.  As the stream events are processed, they are joined
with dimension data from the local store.

If you want to kick off another version of this job that starts back in
time, the new job cannot reliably recreate the same state of the local
store that the original had because old values may have been compacted away.

Does that make sense?

Roger

On Fri, Feb 20, 2015 at 2:52 PM, Jay Kreps <jay.kr...@gmail.com> wrote:

> Hey Roger,
>
> I'm not sure if I understand the case you are describing.
>
> As Chris says we don't yet give you fined grained control over when history
> starts to disappear (though we designed with the intention of making that
> configurable later). However I'm not sure if you need that for the case you
> describe.
>
> Say you have a job J that takes inputs I1...IN and produces output O1...ON
> and in the process accumulates state in a topic S. I think the approach is
> to launch a J' (changed or improved in some way) that reprocesses I1...IN
> from the beginning of time (or some past point) into O1'...ON' and
> accumulates state in S'. So the state for J and the state for J' are
> totally independent. J' can't reuse J's state in general because the code
> that generates that state may have changed.
>
> -Jay
>
> On Thu, Feb 19, 2015 at 9:30 AM, Roger Hoover <roger.hoo...@gmail.com>
> wrote:
>
> > Chris + Samza Devs,
> >
> > I was wondering whether Samza could support re-processing as described by
> > the Kappa architecture or Liquid (
> > http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf).
> >
> > It seems that a changelog is not sufficient to be able to restore state
> > backward in time.  Kafka compaction will guarantee that local state can
> be
> > restored from where it left off but I don't see how it can restore past
> > state.
> >
> > Imagine the case where a stream job has a lot of state in it's local
> store
> > but it has not updated any keys in a long time.
> >
> > Time t1: All of the data would be in the tail of the Kafka log (past the
> > cleaner point).
> > Time t2:  The job updates some keys.   Now we're in a state where the
> next
> > compaction will blow away the old values for those keys.
> > Time t3:  Compaction occurs and old values are discarded.
> >
> > Say we want to launch a re-processing job that would begin from t1.  If
> we
> > launch that job before t3, it will correctly restore it's state.
> However,
> > if we launch the job after t3, it will be missing old values, right?
> >
> > Unless I'm misunderstanding something, the only way around this is to
> keep
> > snapshots in addition to the changelog.  Has there been any discussion of
> > providing an option in Samza of taking RocksDB snapshots and persisting
> > them to an object store or HDFS?
> >
> > Thanks,
> >
> > Roger
> >
>

Reply via email to