[
https://issues.apache.org/jira/browse/KAFKA-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876816#comment-17876816
]
Ajit Singh edited comment on KAFKA-17424 at 8/27/24 7:00 AM:
-------------------------------------------------------------
Hi [~gharris1727], thank you for your quick response .
Allow me some time to run both the cases and capture screenshots for you.
I have had outOfMemoryException traced back to task.put(new
ArrayList(messageBatch))
Further elaborating my point -> each packet on the kafka topic from which I am
consuming is a mysql or mongoDb row with schema, now for some tables there are
some json columns (fairly large), which makes size of a single packet close to
an 500 kb and so a batch of 500 (250 Mb) records will reserve 0.5Gb of memory
and this is just one task running, and if we submit multiple such consumer
tasks we end up consuming more memory than required.
was (Author: JIRAUSER306598):
Hi [~gharris1727], thank you for your quick response .
Allow me some time to run both the cases and capture screenshots for you.
I have had outOfMemoryException traced back to task.put(new
ArrayList(messageBatch))
Further elaborating my point -> each packet on the kafka topic from which I am
consuming is a mysql or mongoDb row with schema, now for some tables there are
some json columns (fairly large), which makes size of a single packet close to
an 500 kb and so a batch of 500 records will reserve 0.5Gb of memory and this
is just one task running, and if we submit multiple such consumer tasks we end
up consuming more memory than required.
> Memory optimisation for Kafka-connect
> -------------------------------------
>
> Key: KAFKA-17424
> URL: https://issues.apache.org/jira/browse/KAFKA-17424
> Project: Kafka
> Issue Type: Improvement
> Components: connect
> Affects Versions: 3.8.0
> Reporter: Ajit Singh
> Priority: Major
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> When Kafka connect gives sink task it's own copy of List<SinkRecords> that
> RAM utilisation shoots up and at that particular moment the there will be two
> lists and the original list gets cleared after the sink worker finishes the
> current batch.
>
> Originally the list is declared final and it's copy is provided to sink task
> as those can be custom and we let user process it however they want without
> any risk. But one of the most popular uses of kafka connect is OLTP - OLAP
> replication, and during initial copying/snapshots a lot of data is generated
> rapidly which fills the list to it's max batch size length, and we are prone
> to "Out of Memory" exceptions. And the only use of the list is to get filled
> > cloned for sink > get size > cleared > repeat. So I have taken the size of
> list before giving the original list to sink task and after sink has
> performed it's operations , set list = new ArrayList<>(). I did not use clear
> for just in case sink task has set our list to null.
> There is a time vs memory trade-off,
> In the original approach the jvm does not have spend time to find free memory
> In new approach the jvm will have to create new list by finding free memory
> addresses but this results in more free memory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)