[jira] [Comment Edited] (KAFKA-1895) Investigate moving deserialization and decompression out of KafkaConsumer

Armin Braun (JIRA) Wed, 22 Feb 2017 15:17:09 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879445#comment-15879445
 ]


Armin Braun edited comment on KAFKA-1895 at 2/22/17 11:16 PM:
--------------------------------------------------------------

{quote}
Maybe a simpler way to achieve that would be to have a new Deserializer type 
which works with byte buffers instead of byte arrays?
{quote}

Having buffers here would be better than arrays and already allow for a lot of 
optimizations. The downside I see here is that this would not come with reuse 
of the deserialized object quite so naturally. Also then you start having to 
support two kinds of deserializers(which would create a lot of complication in 
the code, just to give users the same thing they'd have from the 
RawRecordIterator interface - the option to reuse the deserialized object).

I agree here to some degree though:

{quote}
 would make the consumer more confusing (we've had a tough enough time 
explaining how the current API works).
{quote}

Yes this would make it more confusing. But on the other hand, at least the 
existing API would not change. If you do this via the deserializers you could 
probably keep things a little simpler but also slower to the outside, but at a 
pretty high price in terms of added complexity in the codebase.
Also my argument for this (adding another function) not being so bad would be 
that the interface already is fairly complex. Adding this method with proper 
javadoc (imo) will not be the reason for anyone to be tipped towards not 
understanding it anymore, who would have understood it before. Admittedly not 
the best argument in the world, but I feel like it's a reasonable tradeoff if 
you take the size of the necessary change into account (or the added complexity 
of different deserializer interfaces)?

{quote}
it might not be a great idea to give users direct access to the underlying 
buffers.
{quote}

This I would just solve by returning readonly buffers with the proper limit and 
position for a record set? This means the user must do some bounds checking, 
but this you have in Hadoop's RawKeyValueIterator too and is not an issue in my 
opinion. The other option would be to wrap the buffers in say `DataInput` to 
make the interface safer at the cost of a slight overhead (and the fact that 
some users may rather work from buffers than from DataInput).


was (Author: original-brownbear):
{quote}
Maybe a simpler way to achieve that would be to have a new Deserializer type 
which works with byte buffers instead of byte arrays?
{quote}

Having buffers here would be better than arrays and already allow for a lot of 
optimizations. The downside I see here is that this would not come with reuse 
of the deserialized object quite so naturally. Also then you start having to 
support two kinds of deserializers(which would create a lot of complication in 
the code, just to give users the same thing they'd have from the 
RawRecordIterator interface - the option to reuse the deserialized object).

I agree here to some degree though:

{code}
 would make the consumer more confusing (we've had a tough enough time 
explaining how the current API works).
{code}

Yes this would make it more confusing. But on the other hand, at least the 
existing API would not change. If you do this via the deserializers you could 
probably keep things a little simpler but also slower to the outside, but at a 
pretty high price in terms of added complexity in the codebase.
Also my argument for this (adding another function) not being so bad would be 
that the interface already is fairly complex. Adding this method with proper 
javadoc (imo) will not be the reason for anyone to be tipped towards not 
understanding it anymore, who would have understood it before. Admittedly not 
the best argument in the world, but I feel like it's a reasonable tradeoff if 
you take the size of the necessary change into account (or the added complexity 
of different deserializer interfaces)?

{quote}
it might not be a great idea to give users direct access to the underlying 
buffers.
{quote}

This I would just solve by returning readonly buffers with the proper limit and 
position for a record set? This means the user must do some bounds checking, 
but this you have in Hadoop's RawKeyValueIterator too and is not an issue in my 
opinion. The other option would be to wrap the buffers in say `DataInput` to 
make the interface safer at the cost of a slight overhead (and the fact that 
some users may rather work from buffers than from DataInput).

> Investigate moving deserialization and decompression out of KafkaConsumer
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-1895
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1895
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: consumer
>            Reporter: Jay Kreps
>
> The consumer implementation in KAFKA-1760 decompresses fetch responses and 
> deserializes them into ConsumerRecords which are then handed back as the 
> result of poll().
> There are several downsides to this:
> 1. It is impossible to scale serialization and decompression work beyond the 
> single thread running the KafkaConsumer.
> 2. The results can come back during the processing of other calls such as 
> commit() etc which can result in caching these records a little longer.
> An alternative would be to have ConsumerRecords wrap the actual compressed 
> serialized MemoryRecords chunks and do the deserialization during iteration. 
> This way you could scale this over a thread pool if needed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (KAFKA-1895) Investigate moving deserialization and decompression out of KafkaConsumer

Reply via email to