Github user mjsax commented on the pull request:
https://github.com/apache/storm/pull/694#issuecomment-134656876
I just checked some older benchmark result doing batching in user land, ie,
on top of Storm (=> Aeolus). For this case, a batch size of 100 increased the
spout output rate by a factor of 6 (instead of 1.5 as the benchmark above
shows). The benchmark should yield more than 70M tuples per 30 seconds... (and
not about 19M).
Of course, batching is done a little different now. In Aeolus, a fat-tuple
is used as batch. Thus, the system sees only a single batch-tuple. Now, the
system sees all tuples, but emitting is delayed until the batch is full (this
still saved the overhead of going through the disruptor for each tuple).
However, we generate a tuple-ID for each tuple in the batch, instead of a
single ID per batch. Not sure how expensive this is. Because acking was not
enabled, it should not be too expensive, because the IDs have not to be
"registered" at the ackers (right?).
As a further optimization, it might be a good idea not to batch whole
tuples, but only `Values` and tuple-id. The `worker-context`, `task-id`, and
`outstream-id` is the same for all tuples within a batch. I will try this out,
and push a new version the next days if it works.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---