That was "the" thing in mind. I guess I should give it a try and then see how it performs and see how convenient it is, can't just speculate that.
On Wed, May 11, 2016 at 2:44 PM, Nathan Leung <ncle...@gmail.com> wrote: > You don't have to batch the whole tuple in supervisor memory, the data is > already in Kafka. Just keep the tuple ID and write to file. When you close > the file ack all of the tuple IDs. > On May 11, 2016 5:42 PM, "Steven Lewis" <steven.le...@walmart.com> wrote: > >> It sounds like you want to use Spark / Spark Streaming to do that kind of >> batching output. >> >> From: Milind Vaidya <kava...@gmail.com> >> Reply-To: "user@storm.apache.org" <user@storm.apache.org> >> Date: Wednesday, May 11, 2016 at 4:24 PM >> To: "user@storm.apache.org" <user@storm.apache.org> >> Subject: Re: Getting Kafka Offset in Storm Bolt >> >> Yeah. We have some microbatching in place for other topologies. This one >> is little ambitious, in the sense each message is 1~2KB in size so grouping >> them to a reasonable chunk is necessary say 500KB ~ 1 GB (this is just my >> guess, I am not sure how much does S3 support or what is optimum). Once >> that chunk is uploaded, all of them can be acked. But isn't it overkill ? I >> guess storm is not even meant to support that kind of a use case. >> >> On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <ncle...@gmail.com> wrote: >> >>> You can micro batch kafka contents into a file that's replicated (e.g. >>> HDFS) and then ack all of the input tuples after the file has been closed. >>> >>> On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <kava...@gmail.com> >>> wrote: >>> >>>> in case of failure to upload a file or disk corruption leading to loss >>>> of file, we have only current offset in Kafka Spout but have no record as >>>> to which offsets were lost in the file which need to be replayed. So these >>>> can be stored externally in zookeeper and could be used to account for lost >>>> data. For them to save in ZK, they should be available in a bolt. >>>> >>>> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <ncle...@gmail.com> >>>> wrote: >>>> >>>>> Why not just ack the tuple once it's been written to a file. If your >>>>> topology fails then the data will be re-read from Kafka. Kafka spout >>>>> already does this for you. Then uploading files to S3 is the >>>>> responsibility of another job. For example, a storm topology that >>>>> monitors >>>>> the output folder. >>>>> >>>>> Monitoring the data from Kafka all the way out to S3 seems unnecessary. >>>>> >>>>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <kava...@gmail.com> >>>>> wrote: >>>>> >>>>>> It does not matter, in the sense I am ready to upgrade if this thing >>>>>> is in the roadmap. >>>>>> >>>>>> None the less >>>>>> >>>>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal < >>>>>> abhishc...@gmail.com> wrote: >>>>>> >>>>>>> which version of storm-kafka, are you using? >>>>>>> >>>>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <kava...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Anybody ? Anything about this ? >>>>>>>> >>>>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <kava...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Is there any way I can know what Kafka offset corresponds to >>>>>>>>> current tuple I am processing in a bolt ? >>>>>>>>> >>>>>>>>> Use case : Need to batch events from Kafka, persists them to a >>>>>>>>> local file and eventually upload it to the S3. To manager failure >>>>>>>>> cases, >>>>>>>>> need to know the Kafka offset for a message, so that it can be >>>>>>>>> persisted to >>>>>>>>> Zookeeper and will be used to write / upload file. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> Abhishek Agarwal >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> This email and any files transmitted with it are confidential and >> intended solely for the individual or entity to whom they are addressed. If >> you have received this email in error destroy it immediately. *** Walmart >> Confidential *** >> >