Hi Garry, First of all thank for the link.
After skimming through the archive and reading your reply, I think I misunderstood what are the streams in wikipedia system. So, in hello-samza, the WikipediaConsumer produces some wikipedia.* streams which are read by Wikipedia feed task. Wikipedia feed task then write messages from those wikipedia.* streams to kafka.wikipedia-raw. So my first impression of the flow of the system is: (Data file) => (wikipedia.* streams) => (kafka.wikipedia-raw) where wikipedia.* streams are something "physical", something like the kafka streams. But now after thinking more about it, I guess the system works as follow: WikipediaConsumer listens to the IRC feed, queue the messages it receives in a in-memory queue (or something in-memory). The Samza periodically polls WikipediaConsumer for messages for each of the wikipedia.* streams, sends the messages to Wikipedia feed task. Wikipedia feed task writes messages to kafka stream. In a way, those wikipedia streams are just names that the Samza uses to request WikipediaConsumer for messages. How to messages are buffered/queued/stored before they are polled by Samza is up to WikipediaConsumer. Then if my understanding is correct, and according to Chris' reply in that thread. I have 2 options: 1) Keep it as it is now and run the Wikipedia feed task as a Samza job 2) Implement a Kafka producer that reads from external source and write to a Kafka stream/topic. This will purely just an implementation on Kafka and has nothing to do with Samza. Correct me if I'm wrong. And thank you so much for the reply. Casey On Thu, Feb 27, 2014 at 1:12 PM, Garry Turkington < [email protected]> wrote: > Hi Casey, > > Here's a link to a response from Chris on that thread which I think gives > the high-level picture pretty well: > > > http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201310.mbox/%3C1B43C7411DB20E47AB0FB62E7262B80179B2159F%40ESV4-MBX02.linkedin.biz%3E > > Basically it's a choice between building a system client within Samza that > pulls the external data or having an external process that pushes the data > into a Kafka topic that is then the input to a Samza task. Which is most > appropriate will depend on the nature of the data source plus other > considerations such as if the raw data will have other consumers. > > Though I'm not quite sure I follow what you mean re your implemented > custom system? If you look at the source for the Wikipedia feed task it > reads from the external source (Wikipedia) then writes the output message > directly to Kafka (the call to collector.send) -- isn't this what you want? > > Regards, > Garry > > -----Original Message----- > From: Anh Thu Vu [mailto:[email protected]] > Sent: 27 February 2014 11:32 > To: [email protected] > Subject: Read from a file and write to a Kafka stream with Samza > > Hi everyone, > > As the subject said, I want to read from external source (simplest case is > to read from a file) and write to a Kafka stream. Then another StreamTask > will start reading from the Kafka stream. > > I've succeeded running HelloSamza, write a similar app (VERY similar) that > have a custom Consumer reading from a file then write to a custom system > (i.e.: testsystem.mystream). Then I have a StreamTask that read from this > custom stream, and write to a kafka stream. However, I want to bypass the > custom stream and write the message from the external source directly to > the Kafka stream. > > I guess I will have to implement the SystemFactory such that it will > return a Kafka producer for the getProducer() method but I'm not very sure > how to yet. Although I basically welcome & appreciate very much all > guides/advises/suggestions, my main objective of this mail is to ask for > the link to the thread "Writing a simple KafkaProducer in Samza" that was > mentioned in > > > https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201311.mbox/%3CEA1B8C30F3B4C34EBA71F2AAF48F5990D612E028%40Mail3.impetus.co.in%3E > > Thank you very much, > Casey > > ----- > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2014.0.4259 / Virus Database: 3705/7127 - Release Date: 02/26/14 >
