1) Agreed that short duration blocks are less likely to cause an issue but
it will be much better to avoid them. Anyway, a conclusion is not easy
without some benchmark. I will let You know if I am able to do some volume
testing on both options and observe significant difference.

The auto spout block/wait when no messages are emitted makes absolute sense
but a different scenario, less likely in a mostly loaded topology.

You are absolutely correct about decoupling the consumption from processing
when you see it from the perspective of Storm. I think there is a very
subtle difference when we keep Kafka in mind. It is the spout thread which
is polling (consuming) and iterating (processing) over the polled messages.
If we separate the consumption in another thread and push messages in a
queue, iterating (processing) is now concurrent and decoupled. Anyway, I
too feel there should not be any significant difference in throughput but
it will be interesting to measure the difference.

I went overboard while mentioning seek before every poll when I had the
failure scenario in mind.

Thanks a lot. I will keep you guys updated on the switch from custom spout
to KafkaSpout.

Thanks for the amazing work.

Chandan

On Tue, Jul 11, 2017 at 12:56 AM, chandan singh <cks07...@gmail.com> wrote:

> Sorry about being cryptic there. What I meant is that it will be much
> better if we don't make assumptions about frequency of failure rates in
> topologies. I know it is more of a commonsense but out of curiosity, can
> you point me to any Storm documentation which makes a comment on preferable
> failure rates. I was suggesting if we can offer the user an optimization
> through clean API, the user will be free to decide on the rationale of
> using it.
>
> On Tue, Jul 11, 2017 at 12:06 AM, Bobby Evans <ev...@yahoo-inc.com.invalid
> > wrote:
>
>> I'm not sure what assumptions you want to make that this is preventing,
>> or why they would be helpful.
>>
>> - Bobby
>>
>>
>> On Monday, July 10, 2017, 12:14:53 PM CDT, chandan singh <
>> cks07...@gmail.com> wrote:
>>
>> Hi Stig & Bobby
>>
>> Thanks for confirming my understanding.
>>
>> 1) Ensuring that calls to nexTuple(), ack()  and fail() are non-blocking
>> has been a guideline on http://storm.apache.org/releas
>> es/1.1.0/Concepts.html
>> for long. Copying verbatim here : "The main method on spouts is nextTuple.
>> nextTuple either emits a new tuple into the topology or simply returns if
>> there are no new tuples to emit. It is imperative that nextTuple does not
>> block for any spout implementation, because Storm calls all the spout
>> methods on the same thread." I admit that there is some chance my
>> interpretation is partially incorrect but I have been following it in a
>> custom spout till now. Even though the objective is different, there is a
>> similar hint on Kafka official documentation. Please see under heading "2.
>> Decouple Consumption and Processing" on
>> https://kafka.apache.org/0110/javadoc/index.html?org/apache/
>> kafka/clients/consumer/KafkaConsumer.html.
>> Essentially, a thread polls Kafka and spout thread gets the messages
>> through a shared queue. If pre-fetching is present in Kafka (I will read
>> about it further), I assume we do not have to fetch in another thread but
>> I
>> am not sure how does the pre-fetching behave with re-seeking before every
>> poll.
>>
>> 2) @Bobby, you are correct in pointing what needs to be optimized but the
>> facts, sometimes, prevent us from making assumptions. We do optimize our
>> retry loop such that we don't poll the messages again. I especially see
>> problems when combined with exponential back off.  I am not sure how
>> difficult or clean will it be to expose some sort of configuration to
>> allow
>> such optimization?  Do you think it will be worth trying out something?
>>
>> Thanks
>> Chandan
>>
>
>

Reply via email to