Re: Samza final output/result implementations for streams

Chris Riccomini Mon, 13 Jan 2014 13:32:57 -0800

Hey Joe,

I'm going to do answers in-line.

On 1/13/14 11:06 AM, "Joe Stein" <[email protected]> wrote:

>Hello, I was wondering what different system(s) folks were using as a
>final
>resting place for data streamed and processed through Samza and how they
>were getting it there?
>
>So are folks having Samza save the final streamed state of a job go to a
>Kafka topic and then have a Kafka consumer connected to that topic which
>fetches those results and then pushes those results to another system or
>are they plugging those systems into the end of the Samza job directly?

We tend to have a mix of styles. For some Samza jobs, the final output is
a Kafka topic, which then gets consumed by some downstream system (search,
realtime OLAP, etc). We also have some Samza jobs that write directly to
either a database or web service. For the database-output jobs, we
currently use the database's client directly from the StreamTask, but
we're considering adopting a model where we'd write SystemProducers that
actually write to the database under-the-hood. The primary advantage of
this approach is that it would be an easy way to use threads and get async
writes to the database--we'd only have to make sure everything as flushed
when Samza is committing its offsets.

>
>Also what system(s) are folks using to store their aggregate counts for
>use
>(assuming counting calculation streams) or systems for non-counting
>calculations in either case both for querying by other systems afterwards?

The kinds of systems that we materialize to tend to look like these:

 * A realtime OLAP system.
 * A metrics/monitoring system (uses RRDs under the hood).
 * A social graph system.
 * A search system.
 * A distributed database.

For realtime OLAP/counting, you might have a look at Druid, which already
has a Kafka ingestion point built in.

If the system you're looking at doesn't have a Kafka ingestion point built
in, the decision has to be made about whether to write a SystemProducer,
and have your StreamTask write to the system directly, or to write a shim
outside of Samza that reads from the Kafka topic, and writes to the
destination system. I think the proper solutions depends on your use case.
One argument for putting the writer outside of Samza would be if you
wanted to get write-locality with the destination system (co-locate the
writer with the DB it's writing to). You'll have to think through your use
case. Both styles work.

>
>Thanks in advance.
>
>/*******************************************
> Joe Stein
> Founder, Principal Consultant
> Big Data Open Source Security LLC
> http://www.stealth.ly
> Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
>********************************************/

Re: Samza final output/result implementations for streams

Reply via email to