Hi Charles,

Maybe you can setup data sets and use the TestPipeline to validate (with PAssert) that it works as expected in your pipeline.

The data sets can be store somewhere (database or filesystem) and loaded in tests (basically as we do in the Beam ITs).

Thought ?

Regards
JB

On 01/03/2018 12:37 AM, Charles Allen wrote:
Hello Beam list!

We are looking at adopting some more advanced use cases with Beam code at its core including automated testing and data dependency tracking.

Specifically I'm interested in things like making sure data changes don't break pipelines, or things that depend on pipeline output, especially if the Beam code isn't managed by the same team that is producing the data or the systems that consume the Beam output.

This becomes more complex if you consider certain runners with non-zero replacement time doing a rolling or staged restart/upgrade/replacement that depend on data producers that ALSO have non-zero replacement time. Are there any best practices for Beam code management / data dependency management when the code in /master is not necessarily what is running live in your production systems? Is it all just "pretend all data is bad and try to be backwards compatible", or are there any Beam features that help with this?

Thanks,
Charles Allen

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to