Hey Nirmal, Your method of comparison makes sense.
For Samza, you just need to setup a kafka system in the configuration to point to your brokers (you can take a look at the wiki-parser.properties file, for an example). As far as playing with state, the best place to look right now is the hello-stageful-world.samsa file in Samza's code base. This is a job config for a job that uses state management. This feature is generally most useful when you're doing joins (ad view stream + ad click stream), sorting (sort messages by time over a 5 minute window, and emit top 10), or aggregation (count page views by member id). For partitioning and parallelism, you just need to make sure that your input streams have a partition count > 1 (either by setting the Kafka broker default, or by manually creating topics with a partition count > 1), and then set yarn.container.count > 1 in your Samza job's config, as well. Cheers, Chris On 10/21/13 12:33 AM, "Nirmal Kumar" <[email protected]> wrote: >Hi Chris, > >Thanks a lot for the information. > >I am comparing Storm + Kafka0.8 vs Samza. > >As part of the initial use case I am using Kafka API to publish messages >to the Kafka Broker say 40k, 50k... >Then using the Storm spout I am consuming these messages. >On the Samza side I am using a similar Kafka Consumer that consumes >messages from Kafka. > >Is this use case fine for comparing Storm + Kafka0.8 vs Samza ? or do >you think any other things that I need to consider? > >Also going forward I am planning to find some use cases where I can >compare Samza's features like State Management, Partitioning and >Parallelism, etc. >Any pointers towards these specific use cases? > >Thanks, >-Nirmal > >-----Original Message----- >From: Chris Riccomini [mailto:[email protected]] >Sent: Thursday, October 17, 2013 10:42 PM >To: [email protected] >Subject: Re: Writing a simple KafkaProducer in Samza > >Hey Nirmal, > >Glad to hear that the hello-samza project is working for you. > >If I understand you correctly, I believe that you're saying you want to >send messages to a Kafka topic, right? As you've pointed out, you can >send messages to Kafka through Samza. You can also send messages to Kafka >directly using the Kafka producer API. Which one you use depends on what >you're code is doing. > >With Samza, typically a task is sending messages to Kafka as a reaction >to some other event (which triggers the process method). For example, in >the Wiki example, we send a message to Kafka whenever an update happens >on Wikipedia (via the IRC channel). In this example, we had to write a >Wikipedia consumer for Samza, which implements the SystemConsumer API in >Samza. This implementation reads messages from the Wikimedia IRC channel. >You can see this implementation here: > > >https://github.com/linkedin/hello-samza/blob/master/samza-wikipedia/src/ma >i >n/java/samza/examples/wikipedia/system/WikipediaConsumer.java > >If you use Samza, you MUST have at least one task.input defined, which >feeds messages to your StreamTask. Out of the box, Samza comes with a >KafkaSystemConsumer implementation. The hello-samza project comes with >the WikipediaSystemConsumer implementation. If you want to react to >messages from another system, or feed, you'd have to implement this >interface (and hopefully contribute it back :). > >The alternative approach would be to just send messages directly to Kafka >using the Kafka API. This approach is more appropriate in cases that >don't fit well with Samza's processing model (e.g. you can't easily >implement the SystemConsumer API, you need to guarantee deployment on a >specific host all the time, etc). For example, if you wanted to read >syslog messages on a specific host, and send them to Kafka, it probably >makes more sense to just write a simple Java main() method that creates a >Kafka producer, polls syslog periodically, and calls producer.send() >whenever a new message appears in the syslog. > >If you can be more specific about what you're doing, I can probably >provide better advice. > >Cheers, >Chris > >On 10/17/13 6:39 AM, "Nirmal Kumar" <[email protected]> wrote: > >>Hi All, >> >>I was referring the hello-samza project as was able to run it >>successfully. >>I was able to run all the jobs and also wrote a consumer task to listen >>to kafka.wikipedia-stats topic. >> >>I now want to write a Samza job that act as a KafkaProducer to >>continuously publishes simple string messages to a topic. >>Just like the WikipediaFeedStreamTask that reads Wikipedia events and >>publishes them to a topic. >>I am not sure of the any value of task.inputs in the config properties >>file? >>The way I think is like a java program publishing string messages to a >>kafka topic. >>How can I write such a Samza job? >> >>Any pointers would be of great help. >> >>Later on I want can read the same messages from a consumer like >>WikipediaParserStreamTask does. >>Referring the hello-samza project I was able to write a Consumer task >>that reads messages from the topic(kafka.wikipedia-stats) by simply >>task.class=samza.examples.wikipedia.task.TestConsumer >>task.inputs=kafka.wikipedia-stats >> >> >>Thanks, >>-Nirmal >> >>________________________________ >> >> >> >> >> >> >>NOTE: This message may contain information that is confidential, >>proprietary, privileged or otherwise protected by law. The message is >>intended solely for the named addressee. If received in error, please >>destroy and notify the sender. Any use of this email is prohibited when >>received in error. Impetus does not represent, warrant and/or >>guarantee, that the integrity of this communication has been maintained >>nor that the communication is free of errors, virus, interception or >>interference. > > >________________________________ > > > > > > >NOTE: This message may contain information that is confidential, >proprietary, privileged or otherwise protected by law. The message is >intended solely for the named addressee. If received in error, please >destroy and notify the sender. Any use of this email is prohibited when >received in error. Impetus does not represent, warrant and/or guarantee, >that the integrity of this communication has been maintained nor that the >communication is free of errors, virus, interception or interference.
