Hi,

I was thinking about how best to do testing on Samza jobs. The ability to 
replay streams appears to help a lot here as by pushing some data into the 
consumed streams then rewinding it is always possible to get the same data fed 
through the tasks. So that helps a lot in terms of dealing with known input 
data and such.

But then I started thinking about message format evolution over time which in 
honesty wasn't something I had considered before. My primary use cases for 
Samza are pulling apart lots of log files as they arrive so the obvious thing 
is to push each record/line as a single message. The problem of course is that 
as those log formats evolve over  time (almost always by having new columns 
added) that I need change both the ingest mechanism and the Samza tasks; 
firstly just not to be broken by the new format, secondly to actually use the 
additional columns if appropriate.

At which point Avro seems to have lots of value as a message format, we're 
moving to use it elsewhere in the data backend for very similar reasons of 
ability to manage schema evolution.

Anyone went down this path at all? I guess there are two  approaches, just have 
Samza treat the Avro message as a string and have each task parse and extract 
the fields of interest or to build an Avro serde that delivers an Avro record 
object in the envelope.

Thanks
Garry

...........................................................................................................................
Garry Turkington | CTO | +44-7871315944 | skypeGarryTurkington:
Improve Digital - Real time advertising technology
A company of PubliGroupe

cid:[email protected]

Reply via email to