Hi Gregory, IMO, that would be a viable approach. You have to ensure that all operators (except the sources) have the same UIDs and state types but I guess you don't want to change the application logic and just replace the sources.
What might be tricky is to perform the savepoint at the right point in time when all historic data has been processed and before the job is shutdown. You might need to add an idle source, that ensures that the job keeps running even after all files were read. Another challenge could be to have a seamless handover between historic and live data which also depends on how you persist the historic data. For example, do you know the offset until which point the files are written? Let me know if you have more questions, Fabian 2018-01-25 20:52 GMT+01:00 Gregory Fee <g...@lyft.com>: > Hi group, I want to bootstrap some aggregates based on historic data in S3 > and then keep them updated based on a stream. To do this I was thinking of > doing something like processing all of the historic data, doing a save > point, then restoring my program from that save point but with a stream > source instead. Does this seem like a reasonable approach or is there a > better way to approach this functionality? There does not appear to be a > straightforward way of doing it the way I was thinking so any advice would > be appreciated. Thanks!