I also considered that. The approach was follows

1. Can the existing storm - kafka set up be leveraged ?
2. Is there any "proven" open source framework for the same ?

Spark is next "best" option looks like by keeping paradigm same.

We also considered
Secor (https://github.com/pinterest/secor/blob/master/DESIGN.md)
Streamx (https://github.com/qubole/streamx) looks promising too.

With Secor looking more promising.





On Wed, May 11, 2016 at 2:40 PM, Steven Lewis <steven.le...@walmart.com>
wrote:

> It sounds like you want to use Spark / Spark Streaming to do that kind of
> batching output.
>
> From: Milind Vaidya <kava...@gmail.com>
> Reply-To: "user@storm.apache.org" <user@storm.apache.org>
> Date: Wednesday, May 11, 2016 at 4:24 PM
> To: "user@storm.apache.org" <user@storm.apache.org>
> Subject: Re: Getting Kafka Offset in Storm Bolt
>
> Yeah. We have some microbatching in place for other topologies. This one
> is little ambitious, in the sense each message is 1~2KB in size so grouping
> them to a reasonable chunk is necessary say 500KB  ~ 1 GB (this is just my
> guess, I am not sure how much does S3 support or what is optimum). Once
> that chunk is uploaded, all of them can be acked. But isn't it overkill ? I
> guess storm is not even meant to support that kind of a use case.
>
> On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <ncle...@gmail.com> wrote:
>
>> You can micro batch kafka contents into a file that's replicated (e.g.
>> HDFS) and then ack all of the input tuples after the file has been closed.
>>
>> On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <kava...@gmail.com> wrote:
>>
>>> in case of failure to upload a file or disk corruption leading to loss
>>> of file, we have only current offset in Kafka Spout but have no record as
>>> to which offsets were lost in the file which need to be replayed. So these
>>> can be stored externally in zookeeper and could be used to account for lost
>>> data. For them to save in ZK, they should be available in a bolt.
>>>
>>> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <ncle...@gmail.com>
>>> wrote:
>>>
>>>> Why not just ack the tuple once it's been written to a file.  If your
>>>> topology fails then the data will be re-read from Kafka.  Kafka spout
>>>> already does this for you.  Then uploading files to S3 is the
>>>> responsibility of another job.  For example, a storm topology that monitors
>>>> the output folder.
>>>>
>>>> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>>>>
>>>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <kava...@gmail.com>
>>>> wrote:
>>>>
>>>>> It does not matter, in the sense I am ready to upgrade if this thing
>>>>> is in the roadmap.
>>>>>
>>>>> None the less
>>>>>
>>>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <
>>>>> abhishc...@gmail.com> wrote:
>>>>>
>>>>>> which version of storm-kafka, are you using?
>>>>>>
>>>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <kava...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Anybody ? Anything about this ?
>>>>>>>
>>>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <kava...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is there any way I can know what Kafka offset corresponds to
>>>>>>>> current tuple I am processing in a bolt ?
>>>>>>>>
>>>>>>>> Use case : Need to batch events from Kafka, persists them to a
>>>>>>>> local file and eventually upload it to the S3. To manager failure 
>>>>>>>> cases,
>>>>>>>> need to know the Kafka offset for a message, so that it can be 
>>>>>>>> persisted to
>>>>>>>> Zookeeper and will be used to write / upload file.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Abhishek Agarwal
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> This email and any files transmitted with it are confidential and intended
> solely for the individual or entity to whom they are addressed. If you have
> received this email in error destroy it immediately. *** Walmart
> Confidential ***
>

Reply via email to