Hey Joe, I'm going to do answers in-line.
On 1/13/14 11:06 AM, "Joe Stein" <[email protected]> wrote: >Hello, I was wondering what different system(s) folks were using as a >final >resting place for data streamed and processed through Samza and how they >were getting it there? > >So are folks having Samza save the final streamed state of a job go to a >Kafka topic and then have a Kafka consumer connected to that topic which >fetches those results and then pushes those results to another system or >are they plugging those systems into the end of the Samza job directly? We tend to have a mix of styles. For some Samza jobs, the final output is a Kafka topic, which then gets consumed by some downstream system (search, realtime OLAP, etc). We also have some Samza jobs that write directly to either a database or web service. For the database-output jobs, we currently use the database's client directly from the StreamTask, but we're considering adopting a model where we'd write SystemProducers that actually write to the database under-the-hood. The primary advantage of this approach is that it would be an easy way to use threads and get async writes to the database--we'd only have to make sure everything as flushed when Samza is committing its offsets. > >Also what system(s) are folks using to store their aggregate counts for >use >(assuming counting calculation streams) or systems for non-counting >calculations in either case both for querying by other systems afterwards? The kinds of systems that we materialize to tend to look like these: * A realtime OLAP system. * A metrics/monitoring system (uses RRDs under the hood). * A social graph system. * A search system. * A distributed database. For realtime OLAP/counting, you might have a look at Druid, which already has a Kafka ingestion point built in. If the system you're looking at doesn't have a Kafka ingestion point built in, the decision has to be made about whether to write a SystemProducer, and have your StreamTask write to the system directly, or to write a shim outside of Samza that reads from the Kafka topic, and writes to the destination system. I think the proper solutions depends on your use case. One argument for putting the writer outside of Samza would be if you wanted to get write-locality with the destination system (co-locate the writer with the DB it's writing to). You'll have to think through your use case. Both styles work. > >Thanks in advance. > >/******************************************* > Joe Stein > Founder, Principal Consultant > Big Data Open Source Security LLC > http://www.stealth.ly > Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> >********************************************/
