[
https://issues.apache.org/jira/browse/KAFKA-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879445#comment-15879445
]
Armin Braun edited comment on KAFKA-1895 at 2/22/17 11:16 PM:
--------------------------------------------------------------
{quote}
Maybe a simpler way to achieve that would be to have a new Deserializer type
which works with byte buffers instead of byte arrays?
{quote}
Having buffers here would be better than arrays and already allow for a lot of
optimizations. The downside I see here is that this would not come with reuse
of the deserialized object quite so naturally. Also then you start having to
support two kinds of deserializers(which would create a lot of complication in
the code, just to give users the same thing they'd have from the
RawRecordIterator interface - the option to reuse the deserialized object).
I agree here to some degree though:
{quote}
would make the consumer more confusing (we've had a tough enough time
explaining how the current API works).
{quote}
Yes this would make it more confusing. But on the other hand, at least the
existing API would not change. If you do this via the deserializers you could
probably keep things a little simpler but also slower to the outside, but at a
pretty high price in terms of added complexity in the codebase.
Also my argument for this (adding another function) not being so bad would be
that the interface already is fairly complex. Adding this method with proper
javadoc (imo) will not be the reason for anyone to be tipped towards not
understanding it anymore, who would have understood it before. Admittedly not
the best argument in the world, but I feel like it's a reasonable tradeoff if
you take the size of the necessary change into account (or the added complexity
of different deserializer interfaces)?
{quote}
it might not be a great idea to give users direct access to the underlying
buffers.
{quote}
This I would just solve by returning readonly buffers with the proper limit and
position for a record set? This means the user must do some bounds checking,
but this you have in Hadoop's RawKeyValueIterator too and is not an issue in my
opinion. The other option would be to wrap the buffers in say `DataInput` to
make the interface safer at the cost of a slight overhead (and the fact that
some users may rather work from buffers than from DataInput).
was (Author: original-brownbear):
{quote}
Maybe a simpler way to achieve that would be to have a new Deserializer type
which works with byte buffers instead of byte arrays?
{quote}
Having buffers here would be better than arrays and already allow for a lot of
optimizations. The downside I see here is that this would not come with reuse
of the deserialized object quite so naturally. Also then you start having to
support two kinds of deserializers(which would create a lot of complication in
the code, just to give users the same thing they'd have from the
RawRecordIterator interface - the option to reuse the deserialized object).
I agree here to some degree though:
{code}
would make the consumer more confusing (we've had a tough enough time
explaining how the current API works).
{code}
Yes this would make it more confusing. But on the other hand, at least the
existing API would not change. If you do this via the deserializers you could
probably keep things a little simpler but also slower to the outside, but at a
pretty high price in terms of added complexity in the codebase.
Also my argument for this (adding another function) not being so bad would be
that the interface already is fairly complex. Adding this method with proper
javadoc (imo) will not be the reason for anyone to be tipped towards not
understanding it anymore, who would have understood it before. Admittedly not
the best argument in the world, but I feel like it's a reasonable tradeoff if
you take the size of the necessary change into account (or the added complexity
of different deserializer interfaces)?
{quote}
it might not be a great idea to give users direct access to the underlying
buffers.
{quote}
This I would just solve by returning readonly buffers with the proper limit and
position for a record set? This means the user must do some bounds checking,
but this you have in Hadoop's RawKeyValueIterator too and is not an issue in my
opinion. The other option would be to wrap the buffers in say `DataInput` to
make the interface safer at the cost of a slight overhead (and the fact that
some users may rather work from buffers than from DataInput).
> Investigate moving deserialization and decompression out of KafkaConsumer
> -------------------------------------------------------------------------
>
> Key: KAFKA-1895
> URL: https://issues.apache.org/jira/browse/KAFKA-1895
> Project: Kafka
> Issue Type: Sub-task
> Components: consumer
> Reporter: Jay Kreps
>
> The consumer implementation in KAFKA-1760 decompresses fetch responses and
> deserializes them into ConsumerRecords which are then handed back as the
> result of poll().
> There are several downsides to this:
> 1. It is impossible to scale serialization and decompression work beyond the
> single thread running the KafkaConsumer.
> 2. The results can come back during the processing of other calls such as
> commit() etc which can result in caching these records a little longer.
> An alternative would be to have ConsumerRecords wrap the actual compressed
> serialized MemoryRecords chunks and do the deserialization during iteration.
> This way you could scale this over a thread pool if needed.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)