Hi Charles,
Maybe you can setup data sets and use the TestPipeline to validate (with
PAssert) that it works as expected in your pipeline.
The data sets can be store somewhere (database or filesystem) and loaded in
tests (basically as we do in the Beam ITs).
Thought ?
Regards
JB
On 01/03/2018 12:37 AM, Charles Allen wrote:
Hello Beam list!
We are looking at adopting some more advanced use cases with Beam code at its
core including automated testing and data dependency tracking.
Specifically I'm interested in things like making sure data changes don't break
pipelines, or things that depend on pipeline output, especially if the Beam code
isn't managed by the same team that is producing the data or the systems that
consume the Beam output.
This becomes more complex if you consider certain runners with non-zero
replacement time doing a rolling or staged restart/upgrade/replacement that
depend on data producers that ALSO have non-zero replacement time. Are there any
best practices for Beam code management / data dependency management when the
code in /master is not necessarily what is running live in your production
systems? Is it all just "pretend all data is bad and try to be backwards
compatible", or are there any Beam features that help with this?
Thanks,
Charles Allen
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com