Take a look at the drain, update and inplace snapshot proposal threads, they make it easier to address some of your problems. https://lists.apache.org/thread.html/37c9aa4aa5011801a7f060bf3f53e687e539fa6154cc9f1c544d4f7a@%3Cdev.beam.apache.org%3E https://lists.apache.org/thread.html/3cfbd650a46327afc752a220b20a6081570000725c96541c21265e7b@%3Cdev.beam.apache.org%3E
JB is correct in pointing out that good testing will help a lot and TestPipeline/PAssert are the right places to look there. I had worked on a team that was in a similar position, 100+ upstream systems producing data, 100+ downstream systems consuming it and we were one in the middle that had to consume parts of the data, pass most of it through, and add more data to it. The couple of things that were important were to: * use a technology that allows for arbitrary structured data (like JSON) * make existing data immutable, you can only append new data * have each piece/group of data owned by the producer and have them publish a schema * use the schemas to validate that structured data as soon as it enters your domain and dead letter queue anything that is bad allowing you to fix the data or your pipeline and inform the upstream producer On Tue, Jan 2, 2018 at 9:22 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Charles, > > Maybe you can setup data sets and use the TestPipeline to validate (with > PAssert) that it works as expected in your pipeline. > > The data sets can be store somewhere (database or filesystem) and loaded > in tests (basically as we do in the Beam ITs). > > Thought ? > > Regards > JB > > > On 01/03/2018 12:37 AM, Charles Allen wrote: > >> Hello Beam list! >> >> We are looking at adopting some more advanced use cases with Beam code at >> its core including automated testing and data dependency tracking. >> >> Specifically I'm interested in things like making sure data changes don't >> break pipelines, or things that depend on pipeline output, especially if >> the Beam code isn't managed by the same team that is producing the data or >> the systems that consume the Beam output. >> >> This becomes more complex if you consider certain runners with non-zero >> replacement time doing a rolling or staged restart/upgrade/replacement that >> depend on data producers that ALSO have non-zero replacement time. Are >> there any best practices for Beam code management / data dependency >> management when the code in /master is not necessarily what is running live >> in your production systems? Is it all just "pretend all data is bad and try >> to be backwards compatible", or are there any Beam features that help with >> this? >> >> Thanks, >> Charles Allen >> > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >