I also considered that. The approach was follows 1. Can the existing storm - kafka set up be leveraged ? 2. Is there any "proven" open source framework for the same ?
Spark is next "best" option looks like by keeping paradigm same. We also considered Secor (https://github.com/pinterest/secor/blob/master/DESIGN.md) Streamx (https://github.com/qubole/streamx) looks promising too. With Secor looking more promising. On Wed, May 11, 2016 at 2:40 PM, Steven Lewis <steven.le...@walmart.com> wrote: > It sounds like you want to use Spark / Spark Streaming to do that kind of > batching output. > > From: Milind Vaidya <kava...@gmail.com> > Reply-To: "user@storm.apache.org" <user@storm.apache.org> > Date: Wednesday, May 11, 2016 at 4:24 PM > To: "user@storm.apache.org" <user@storm.apache.org> > Subject: Re: Getting Kafka Offset in Storm Bolt > > Yeah. We have some microbatching in place for other topologies. This one > is little ambitious, in the sense each message is 1~2KB in size so grouping > them to a reasonable chunk is necessary say 500KB ~ 1 GB (this is just my > guess, I am not sure how much does S3 support or what is optimum). Once > that chunk is uploaded, all of them can be acked. But isn't it overkill ? I > guess storm is not even meant to support that kind of a use case. > > On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <ncle...@gmail.com> wrote: > >> You can micro batch kafka contents into a file that's replicated (e.g. >> HDFS) and then ack all of the input tuples after the file has been closed. >> >> On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <kava...@gmail.com> wrote: >> >>> in case of failure to upload a file or disk corruption leading to loss >>> of file, we have only current offset in Kafka Spout but have no record as >>> to which offsets were lost in the file which need to be replayed. So these >>> can be stored externally in zookeeper and could be used to account for lost >>> data. For them to save in ZK, they should be available in a bolt. >>> >>> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <ncle...@gmail.com> >>> wrote: >>> >>>> Why not just ack the tuple once it's been written to a file. If your >>>> topology fails then the data will be re-read from Kafka. Kafka spout >>>> already does this for you. Then uploading files to S3 is the >>>> responsibility of another job. For example, a storm topology that monitors >>>> the output folder. >>>> >>>> Monitoring the data from Kafka all the way out to S3 seems unnecessary. >>>> >>>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <kava...@gmail.com> >>>> wrote: >>>> >>>>> It does not matter, in the sense I am ready to upgrade if this thing >>>>> is in the roadmap. >>>>> >>>>> None the less >>>>> >>>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4 >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal < >>>>> abhishc...@gmail.com> wrote: >>>>> >>>>>> which version of storm-kafka, are you using? >>>>>> >>>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <kava...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Anybody ? Anything about this ? >>>>>>> >>>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <kava...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Is there any way I can know what Kafka offset corresponds to >>>>>>>> current tuple I am processing in a bolt ? >>>>>>>> >>>>>>>> Use case : Need to batch events from Kafka, persists them to a >>>>>>>> local file and eventually upload it to the S3. To manager failure >>>>>>>> cases, >>>>>>>> need to know the Kafka offset for a message, so that it can be >>>>>>>> persisted to >>>>>>>> Zookeeper and will be used to write / upload file. >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Abhishek Agarwal >>>>>> >>>>>> >>>>> >>>> >>> >> > This email and any files transmitted with it are confidential and intended > solely for the individual or entity to whom they are addressed. If you have > received this email in error destroy it immediately. *** Walmart > Confidential *** >