You don't have to batch the whole tuple in supervisor memory, the data is
already in Kafka. Just keep the tuple ID and write to file. When you close
the file ack all of the tuple IDs.
On May 11, 2016 5:42 PM, "Steven Lewis" <steven.le...@walmart.com> wrote:

> It sounds like you want to use Spark / Spark Streaming to do that kind of
> batching output.
>
> From: Milind Vaidya <kava...@gmail.com>
> Reply-To: "user@storm.apache.org" <user@storm.apache.org>
> Date: Wednesday, May 11, 2016 at 4:24 PM
> To: "user@storm.apache.org" <user@storm.apache.org>
> Subject: Re: Getting Kafka Offset in Storm Bolt
>
> Yeah. We have some microbatching in place for other topologies. This one
> is little ambitious, in the sense each message is 1~2KB in size so grouping
> them to a reasonable chunk is necessary say 500KB  ~ 1 GB (this is just my
> guess, I am not sure how much does S3 support or what is optimum). Once
> that chunk is uploaded, all of them can be acked. But isn't it overkill ? I
> guess storm is not even meant to support that kind of a use case.
>
> On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <ncle...@gmail.com> wrote:
>
>> You can micro batch kafka contents into a file that's replicated (e.g.
>> HDFS) and then ack all of the input tuples after the file has been closed.
>>
>> On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <kava...@gmail.com> wrote:
>>
>>> in case of failure to upload a file or disk corruption leading to loss
>>> of file, we have only current offset in Kafka Spout but have no record as
>>> to which offsets were lost in the file which need to be replayed. So these
>>> can be stored externally in zookeeper and could be used to account for lost
>>> data. For them to save in ZK, they should be available in a bolt.
>>>
>>> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <ncle...@gmail.com>
>>> wrote:
>>>
>>>> Why not just ack the tuple once it's been written to a file.  If your
>>>> topology fails then the data will be re-read from Kafka.  Kafka spout
>>>> already does this for you.  Then uploading files to S3 is the
>>>> responsibility of another job.  For example, a storm topology that monitors
>>>> the output folder.
>>>>
>>>> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>>>>
>>>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <kava...@gmail.com>
>>>> wrote:
>>>>
>>>>> It does not matter, in the sense I am ready to upgrade if this thing
>>>>> is in the roadmap.
>>>>>
>>>>> None the less
>>>>>
>>>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <
>>>>> abhishc...@gmail.com> wrote:
>>>>>
>>>>>> which version of storm-kafka, are you using?
>>>>>>
>>>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <kava...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Anybody ? Anything about this ?
>>>>>>>
>>>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <kava...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is there any way I can know what Kafka offset corresponds to
>>>>>>>> current tuple I am processing in a bolt ?
>>>>>>>>
>>>>>>>> Use case : Need to batch events from Kafka, persists them to a
>>>>>>>> local file and eventually upload it to the S3. To manager failure 
>>>>>>>> cases,
>>>>>>>> need to know the Kafka offset for a message, so that it can be 
>>>>>>>> persisted to
>>>>>>>> Zookeeper and will be used to write / upload file.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Abhishek Agarwal
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> This email and any files transmitted with it are confidential and intended
> solely for the individual or entity to whom they are addressed. If you have
> received this email in error destroy it immediately. *** Walmart
> Confidential ***
>

Reply via email to