[ 
https://issues.apache.org/jira/browse/ARROW-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550237#comment-17550237
 ] 

Lubo Slivka commented on ARROW-16697:
-------------------------------------

hello [~lidavidm]  thanks for looking into this. I was oblivious to the 
allocator behavior and unaware of the malloc trim so went down the leak rabbit 
hole. With this new info, I think I can move forward and follow existing 
sources on this topic.

Seems to me that this is about tuning the malloc behavior 
([https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html)
 
|https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html)]and
 perhaps if needed also triggering malloc trim.

---

I would like to expand on the behavior that you mentioned that the RSS usage 
stabilizes at some point: what I see is the point where RSS stabilizes is a 
function of number of concurrent clients.  So let's say with 64 concurrent 
clients, the high watermark goes up (4GB no problem, running with 64 clients 
for longer, i was able to surpass 10GB).

Perhaps some gRPC behavior + overhead combines with malloc all contribute into 
how high the memory usage can climb?

> [FlightRPC][Python] Server seems to leak memory during DoPut
> ------------------------------------------------------------
>
>                 Key: ARROW-16697
>                 URL: https://issues.apache.org/jira/browse/ARROW-16697
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Lubo Slivka
>            Assignee: David Li
>            Priority: Major
>         Attachments: leak_repro_client.py, leak_repro_server.py, sample.csv.gz
>
>
> Hello,
> We are stress testing our Flight RPC server (PyArrow 8.0.0) with write-heavy 
> workloads and are running into what appear to be memory leaks.
> The server is under pressure by a number of separate clients doing DoPut. 
> What we are seeing is that server's memory usage only ever goes up until the 
> server finally gets whacked by k8s due to hitting memory limit.
> I have spent many hours fishing through our code for memory leaks with no 
> success. Even short-circuiting all our custom DoPut handling logic does not 
> alleviate the situation. This led me to create a reproducer that uses nothing 
> but PyArrow and I see the server process memory only increasing similar to 
> what we see on our servers.
> The reproducer is in attachments + I included the test CSV file (20MB) that I 
> use for my tests. Few notes:
>  * The client code has multiple threads, each emulating a separate Flight 
> Client
>  * There are two variants where I see slightly different memory usage 
> characteristic:
>  ** _do_put_with_client_reuse << one client opened at start of thread, then 
> hammering many puts, finally closing the client; leaks appear to happen 
> faster in this variant
>  ** _do_put_with_client_per_request << client opens & connects, does put, 
> then disconnects; loop like this many times; leaks appear to happen slower in 
> this variant if there are less concurrent clients; increasing number of 
> threads 'helps'
>  * The server code handling do_put reads batch-by-batch & does nothing with 
> the chunks
> Also one interesting (but highly likely unrelated thing) that I keep noticing 
> is that _sometimes_ FlightClient takes long time to close (like 5seconds). It 
> happens intermittently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to