[ https://issues.apache.org/jira/browse/ARROW-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550429#comment-17550429 ]
David Li commented on ARROW-16697: ---------------------------------- Interesting. Thanks for sharing your findings! If you're all set we can close this now. I should find a place to start collecting pitfalls like this… > [FlightRPC][Python] Server seems to leak memory during DoPut > ------------------------------------------------------------ > > Key: ARROW-16697 > URL: https://issues.apache.org/jira/browse/ARROW-16697 > Project: Apache Arrow > Issue Type: Bug > Reporter: Lubo Slivka > Assignee: David Li > Priority: Major > Attachments: leak_repro_client.py, leak_repro_server.py, massif.txt, > massif_client.txt, sample.csv.gz > > > Hello, > We are stress testing our Flight RPC server (PyArrow 8.0.0) with write-heavy > workloads and are running into what appear to be memory leaks. > The server is under pressure by a number of separate clients doing DoPut. > What we are seeing is that server's memory usage only ever goes up until the > server finally gets whacked by k8s due to hitting memory limit. > I have spent many hours fishing through our code for memory leaks with no > success. Even short-circuiting all our custom DoPut handling logic does not > alleviate the situation. This led me to create a reproducer that uses nothing > but PyArrow and I see the server process memory only increasing similar to > what we see on our servers. > The reproducer is in attachments + I included the test CSV file (20MB) that I > use for my tests. Few notes: > * The client code has multiple threads, each emulating a separate Flight > Client > * There are two variants where I see slightly different memory usage > characteristic: > ** _do_put_with_client_reuse << one client opened at start of thread, then > hammering many puts, finally closing the client; leaks appear to happen > faster in this variant > ** _do_put_with_client_per_request << client opens & connects, does put, > then disconnects; loop like this many times; leaks appear to happen slower in > this variant if there are less concurrent clients; increasing number of > threads 'helps' > * The server code handling do_put reads batch-by-batch & does nothing with > the chunks > Also one interesting (but highly likely unrelated thing) that I keep noticing > is that _sometimes_ FlightClient takes long time to close (like 5seconds). It > happens intermittently. -- This message was sent by Atlassian Jira (v8.20.7#820007)