Hi, I was thinking about how best to do testing on Samza jobs. The ability to replay streams appears to help a lot here as by pushing some data into the consumed streams then rewinding it is always possible to get the same data fed through the tasks. So that helps a lot in terms of dealing with known input data and such.
But then I started thinking about message format evolution over time which in honesty wasn't something I had considered before. My primary use cases for Samza are pulling apart lots of log files as they arrive so the obvious thing is to push each record/line as a single message. The problem of course is that as those log formats evolve over time (almost always by having new columns added) that I need change both the ingest mechanism and the Samza tasks; firstly just not to be broken by the new format, secondly to actually use the additional columns if appropriate. At which point Avro seems to have lots of value as a message format, we're moving to use it elsewhere in the data backend for very similar reasons of ability to manage schema evolution. Anyone went down this path at all? I guess there are two approaches, just have Samza treat the Avro message as a string and have each task parse and extract the fields of interest or to build an Avro serde that delivers an Avro record object in the envelope. Thanks Garry ........................................................................................................................... Garry Turkington | CTO | +44-7871315944 | skypeGarryTurkington: Improve Digital - Real time advertising technology A company of PubliGroupe cid:[email protected]
