I think the nuance in Roger's example is that the stream that's being
rewound is an event stream not a primary data stream. As such, going back
to the earliest offer might only bring you back a week. If you want a
consistent view of that time, you'd want your table join to be the view as
of a week ago.

On Friday, February 20, 2015, Roger Hoover <roger.hoo...@gmail.com> wrote:

> Jay,
>
> Sorry, I didn't explain it very well.  I'm talking about a stream-table
> join where the table comes from a compacted topic that is used to populate
> a local data store.  As the stream events are processed, they are joined
> with dimension data from the local store.
>
> If you want to kick off another version of this job that starts back in
> time, the new job cannot reliably recreate the same state of the local
> store that the original had because old values may have been compacted
> away.
>
> Does that make sense?
>
> Roger
>
> On Fri, Feb 20, 2015 at 2:52 PM, Jay Kreps <jay.kr...@gmail.com
> <javascript:;>> wrote:
>
> > Hey Roger,
> >
> > I'm not sure if I understand the case you are describing.
> >
> > As Chris says we don't yet give you fined grained control over when
> history
> > starts to disappear (though we designed with the intention of making that
> > configurable later). However I'm not sure if you need that for the case
> you
> > describe.
> >
> > Say you have a job J that takes inputs I1...IN and produces output
> O1...ON
> > and in the process accumulates state in a topic S. I think the approach
> is
> > to launch a J' (changed or improved in some way) that reprocesses I1...IN
> > from the beginning of time (or some past point) into O1'...ON' and
> > accumulates state in S'. So the state for J and the state for J' are
> > totally independent. J' can't reuse J's state in general because the code
> > that generates that state may have changed.
> >
> > -Jay
> >
> > On Thu, Feb 19, 2015 at 9:30 AM, Roger Hoover <roger.hoo...@gmail.com
> <javascript:;>>
> > wrote:
> >
> > > Chris + Samza Devs,
> > >
> > > I was wondering whether Samza could support re-processing as described
> by
> > > the Kappa architecture or Liquid (
> > > http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf).
> > >
> > > It seems that a changelog is not sufficient to be able to restore state
> > > backward in time.  Kafka compaction will guarantee that local state can
> > be
> > > restored from where it left off but I don't see how it can restore past
> > > state.
> > >
> > > Imagine the case where a stream job has a lot of state in it's local
> > store
> > > but it has not updated any keys in a long time.
> > >
> > > Time t1: All of the data would be in the tail of the Kafka log (past
> the
> > > cleaner point).
> > > Time t2:  The job updates some keys.   Now we're in a state where the
> > next
> > > compaction will blow away the old values for those keys.
> > > Time t3:  Compaction occurs and old values are discarded.
> > >
> > > Say we want to launch a re-processing job that would begin from t1.  If
> > we
> > > launch that job before t3, it will correctly restore it's state.
> > However,
> > > if we launch the job after t3, it will be missing old values, right?
> > >
> > > Unless I'm misunderstanding something, the only way around this is to
> > keep
> > > snapshots in addition to the changelog.  Has there been any discussion
> of
> > > providing an option in Samza of taking RocksDB snapshots and persisting
> > > them to an object store or HDFS?
> > >
> > > Thanks,
> > >
> > > Roger
> > >
> >
>

Reply via email to