Hi Gregory,

IMO, that would be a viable approach.
You have to ensure that all operators (except the sources) have the same
UIDs and state types but I guess you don't want to change the application
logic and just replace the sources.

What might be tricky is to perform the savepoint at the right point in time
when all historic data has been processed and before the job is shutdown.
You might need to add an idle source, that ensures that the job keeps
running even after all files were read.
Another challenge could be to have a seamless handover between historic and
live data which also depends on how you persist the historic data. For
example, do you know the offset until which point the files are written?

Let me know if you have more questions,
Fabian

2018-01-25 20:52 GMT+01:00 Gregory Fee <g...@lyft.com>:

> Hi group, I want to bootstrap some aggregates based on historic data in S3
> and then keep them updated based on a stream. To do this I was thinking of
> doing something like processing all of the historic data, doing a save
> point, then restoring my program from that save point but with a stream
> source instead. Does this seem like a reasonable approach or is there a
> better way to approach this functionality? There does not appear to be a
> straightforward way of doing it the way I was thinking so any advice would
> be appreciated. Thanks!

Reply via email to