[ https://issues.apache.org/jira/browse/TS-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289206#comment-14289206 ]
Feifei Cai edited comment on TS-2497 at 1/23/15 12:54 PM: ---------------------------------------------------------- Memory leak is noticed in our production hosts. It should be related to handling 5xx response from origin sever. The dump info is as follows, it's from 1 host with ~70% POST requests. I enabled memory dump {{proxy.config.dump_mem_info_frequency}} and track {{proxy.config.res_track_memory}}. *traffic.out:* {noformat} allocated | in-use | type size | free list name --------------------|--------------------|------------|---------------------------------- 0 | 0 | 2097152 | memory/ioBufAllocator[14] 0 | 0 | 1048576 | memory/ioBufAllocator[13] 0 | 0 | 524288 | memory/ioBufAllocator[12] 0 | 0 | 262144 | memory/ioBufAllocator[11] 0 | 0 | 131072 | memory/ioBufAllocator[10] 0 | 0 | 65536 | memory/ioBufAllocator[9] 1266679808 | 1262354432 | 32768 | memory/ioBufAllocator[8] 600309760 | 599703552 | 16384 | memory/ioBufAllocator[7] 395051008 | 391086080 | 8192 | memory/ioBufAllocator[6] 229113856 | 224432128 | 4096 | memory/ioBufAllocator[5] 342622208 | 342503424 | 2048 | memory/ioBufAllocator[4] 245104640 | 245042176 | 1024 | memory/ioBufAllocator[3] 2228224 | 2176512 | 512 | memory/ioBufAllocator[2] 622592 | 607232 | 256 | memory/ioBufAllocator[1] 2375680 | 2370176 | 128 | memory/ioBufAllocator[0] Location | Size In-use ---------------------------------------------------+------------------------ memory/IOBuffer/ProtocolProbeSessionAccept.cc:39 | 66768896 memory/IOBuffer/HttpClientSession.cc:230 | 0 memory/IOBuffer/HttpSM.cc:3314 | 0 memory/IOBuffer/HttpSM.cc:5349 | 3003506816 memory/IOBuffer/HttpSM.cc:5668 | 0 memory/IOBuffer/HttpSM.cc:5874 | 0 memory/IOBuffer/HttpSM.cc:5976 | 0 memory/IOBuffer/HttpSM.cc:6267 | 0 memory/IOBuffer/HttpServerSession.cc:87 | 0 memory/IOBuffer/HttpTunnel.cc:95 | 0 memory/IOBuffer/HttpTunnel.cc:100 | 0 TOTAL | 3070275712 {noformat} I take a refer to [~shaunmcginnity]'s node.js with some changes, and reproduce the memory leak in my local environment. # [origin-server.js|https://issues.apache.org/jira/secure/attachment/12694145/origin-server.js] This origin server responses a 503 when receives more than one single byte, so the post would not complete at most cases. I change [~shaunmcginnity]'s code, make origin server responses to ats, which would make ats hits another code path. # [client.js|https://issues.apache.org/jira/secure/attachment/12694146/client.js] We create a new client per second, and each client try to post 32K bytes data. # ats *remap.config*: remap all to local port 5000 {quote}map / http://127.0.0.1:5000{quote} *records.config*: listen on 80 {quote}CONFIG proxy.config.http.server_ports STRING 80{quote} Then we can get dump info as follows, and in-use number of MIOBuffer with index=8 (size=32K) would increase 1 per second. {noformat} allocated | in-use | type size | free list name --------------------|--------------------|------------|---------------------------------- 1048576 | 32768 | 32768 | memory/ioBufAllocator[8] {noformat} We can also try change the "Content-Length" in client.js to a smaller size, and MIOBuffer with the corresponding index(0-7) would also increase. I add this simple patch to prevent the memory leak in the case above, just like last commit, and it's verified in 1 test host. free.diff {code} diff --git a/proxy/http/HttpSM.cc b/proxy/http/HttpSM.cc index 932ef97..123b97a 100644 --- a/proxy/http/HttpSM.cc +++ b/proxy/http/HttpSM.cc @@ -5074,6 +5074,7 @@ HttpSM::handle_post_failure() t_state.current.server->keep_alive = HTTP_NO_KEEPALIVE; if (server_buffer_reader->read_avail() > 0) { + tunnel.deallocate_buffers(); tunnel.reset(); // There's data from the server so try to read the header setup_server_read_response_header(); {code} *traffic.out* {noformat} allocated | in-use | type size | free list name --------------------|--------------------|------------|---------------------------------- 0 | 0 | 2097152 | memory/ioBufAllocator[14] 0 | 0 | 1048576 | memory/ioBufAllocator[13] 0 | 0 | 524288 | memory/ioBufAllocator[12] 0 | 0 | 262144 | memory/ioBufAllocator[11] 0 | 0 | 131072 | memory/ioBufAllocator[10] 0 | 0 | 65536 | memory/ioBufAllocator[9] 6291456 | 1474560 | 32768 | memory/ioBufAllocator[8] 524288 | 32768 | 16384 | memory/ioBufAllocator[7] 4456448 | 1949696 | 8192 | memory/ioBufAllocator[6] 6291456 | 2674688 | 4096 | memory/ioBufAllocator[5] 262144 | 4096 | 2048 | memory/ioBufAllocator[4] 131072 | 8192 | 1024 | memory/ioBufAllocator[3] 65536 | 0 | 512 | memory/ioBufAllocator[2] 32768 | 512 | 256 | memory/ioBufAllocator[1] 16384 | 1280 | 128 | memory/ioBufAllocator[0] {noformat} However, this patch is actually trying to revert TS-2497 here, in my understanding. So, I'm not doing the right job, and it returns to the start point: potential crash issue. :( I'm not very clear about the root cause of "prematurely free buffer" issue yet. I think we can not rely calling {{HttpTunnel::deallocate_buffers()}} in {{HttpSM::kill_this()}} to free the {{post_buffer}} we allocated in {{HttpSM::do_setup_post_tunnel}}. In the test case above, server does not receive all post data, but response to ats(mostly a 5xx) instead of disconnected, so we hit the {{if}} condition in {{HttpSM::handle_post_failure()}}. Then ats transfers the 5xx response back to client, it resets tunnel and reuses it, but ats does not reuse buffer, it allocates a new when needed. So, when HttpSM calls kill_this to clean up tunnel at last, it frees the buffer allocated in {{HttpSM::setup_server_transfer()}}, not the {{post_buffer}} in {{HttpSM::do_setup_post_tunnel}}. We can enable {{proxy.config.res_track_memory}} for a track, or debug and break on each call of HttpTunnel::deallocate_buffers() to look at MIOBuffer's location. BTW, TS-1425 is also related with this ticket here, both are trying to handle failed post correctly. was (Author: ffcai): Memory leak is noticed in our production hosts. It should be related to handling 5xx response from origin sever. The dump info is as follows, it's from 1 host with ~70% POST requests. I enabled memory dump {{proxy.config.dump_mem_info_frequency}} and track {{proxy.config.res_track_memory}}. *traffic.out:* {noformat} allocated | in-use | type size | free list name --------------------|--------------------|------------|---------------------------------- 0 | 0 | 2097152 | memory/ioBufAllocator[14] 0 | 0 | 1048576 | memory/ioBufAllocator[13] 0 | 0 | 524288 | memory/ioBufAllocator[12] 0 | 0 | 262144 | memory/ioBufAllocator[11] 0 | 0 | 131072 | memory/ioBufAllocator[10] 0 | 0 | 65536 | memory/ioBufAllocator[9] 1266679808 | 1262354432 | 32768 | memory/ioBufAllocator[8] 600309760 | 599703552 | 16384 | memory/ioBufAllocator[7] 395051008 | 391086080 | 8192 | memory/ioBufAllocator[6] 229113856 | 224432128 | 4096 | memory/ioBufAllocator[5] 342622208 | 342503424 | 2048 | memory/ioBufAllocator[4] 245104640 | 245042176 | 1024 | memory/ioBufAllocator[3] 2228224 | 2176512 | 512 | memory/ioBufAllocator[2] 622592 | 607232 | 256 | memory/ioBufAllocator[1] 2375680 | 2370176 | 128 | memory/ioBufAllocator[0] Location | Size In-use ---------------------------------------------------+------------------------ memory/IOBuffer/ProtocolProbeSessionAccept.cc:39 | 66768896 memory/IOBuffer/HttpClientSession.cc:230 | 0 memory/IOBuffer/HttpSM.cc:3314 | 0 memory/IOBuffer/HttpSM.cc:5349 | 3003506816 memory/IOBuffer/HttpSM.cc:5668 | 0 memory/IOBuffer/HttpSM.cc:5874 | 0 memory/IOBuffer/HttpSM.cc:5976 | 0 memory/IOBuffer/HttpSM.cc:6267 | 0 memory/IOBuffer/HttpServerSession.cc:87 | 0 memory/IOBuffer/HttpTunnel.cc:95 | 0 memory/IOBuffer/HttpTunnel.cc:100 | 0 TOTAL | 3070275712 {noformat} I take a refer to [~shaunmcginnity]'s node.js with some changes, and reproduce the memory leak in my local environment. # [origin-server.js|https://issues.apache.org/jira/secure/attachment/12694146/client.js] This origin server responses a 503 when receives more than one single byte, so the post would not complete at most cases. I change [~shaunmcginnity]'s code, make origin server responses to ats, which would make ats hits another code path. # [client.js|https://issues.apache.org/jira/secure/attachment/12694145/origin-server.js] We create a new client per second, and each client try to post 32K bytes data. # ats *remap.config*: remap all to local port 5000 {quote}map / http://127.0.0.1:5000{quote} *records.config*: listen on 80 {quote}CONFIG proxy.config.http.server_ports STRING 80{quote} Then we can get dump info as follows, and in-use number of MIOBuffer with index=8 (size=32K) would increase 1 per second. {noformat} allocated | in-use | type size | free list name --------------------|--------------------|------------|---------------------------------- 1048576 | 32768 | 32768 | memory/ioBufAllocator[8] {noformat} We can also try change the "Content-Length" in client.js to a smaller size, and MIOBuffer with the corresponding index(0-7) would also increase. I add this simple patch to prevent the memory leak in the case above, just like last commit, and it's verified in 1 test host. free.diff {code} diff --git a/proxy/http/HttpSM.cc b/proxy/http/HttpSM.cc index 932ef97..123b97a 100644 --- a/proxy/http/HttpSM.cc +++ b/proxy/http/HttpSM.cc @@ -5074,6 +5074,7 @@ HttpSM::handle_post_failure() t_state.current.server->keep_alive = HTTP_NO_KEEPALIVE; if (server_buffer_reader->read_avail() > 0) { + tunnel.deallocate_buffers(); tunnel.reset(); // There's data from the server so try to read the header setup_server_read_response_header(); {code} *traffic.out* {noformat} allocated | in-use | type size | free list name --------------------|--------------------|------------|---------------------------------- 0 | 0 | 2097152 | memory/ioBufAllocator[14] 0 | 0 | 1048576 | memory/ioBufAllocator[13] 0 | 0 | 524288 | memory/ioBufAllocator[12] 0 | 0 | 262144 | memory/ioBufAllocator[11] 0 | 0 | 131072 | memory/ioBufAllocator[10] 0 | 0 | 65536 | memory/ioBufAllocator[9] 6291456 | 1474560 | 32768 | memory/ioBufAllocator[8] 524288 | 32768 | 16384 | memory/ioBufAllocator[7] 4456448 | 1949696 | 8192 | memory/ioBufAllocator[6] 6291456 | 2674688 | 4096 | memory/ioBufAllocator[5] 262144 | 4096 | 2048 | memory/ioBufAllocator[4] 131072 | 8192 | 1024 | memory/ioBufAllocator[3] 65536 | 0 | 512 | memory/ioBufAllocator[2] 32768 | 512 | 256 | memory/ioBufAllocator[1] 16384 | 1280 | 128 | memory/ioBufAllocator[0] {noformat} However, this patch is actually trying to revert TS-2497 here, in my understanding. So, I'm not doing the right job, and it returns to the start point: potential crash issue. :( I'm not very clear about the root cause of "prematurely free buffer" issue yet. I think we can not rely calling {{HttpTunnel::deallocate_buffers()}} in {{HttpSM::kill_this()}} to free the {{post_buffer}} we allocated in {{HttpSM::do_setup_post_tunnel}}. In the test case above, server does not receive all post data, but response to ats(mostly a 5xx) instead of disconnected, so we hit the {{if}} condition in {{HttpSM::handle_post_failure()}}. Then ats transfers the 5xx response back to client, it resets tunnel and reuses it, but ats does not reuse buffer, it allocates a new when needed. So, when HttpSM calls kill_this to clean up tunnel at last, it frees the buffer allocated in {{HttpSM::setup_server_transfer()}}, not the {{post_buffer}} in {{HttpSM::do_setup_post_tunnel}}. We can enable {{proxy.config.res_track_memory}} for a track, or debug and break on each call of HttpTunnel::deallocate_buffers() to look at MIOBuffer's location. BTW, TS-1425 is also related with this ticket here, both are trying to handle failed post correctly. > Failed post results in tunnel buffers being returned to freelist prematurely > ---------------------------------------------------------------------------- > > Key: TS-2497 > URL: https://issues.apache.org/jira/browse/TS-2497 > Project: Traffic Server > Issue Type: Bug > Components: Core > Reporter: Brian Geffon > Assignee: Brian Geffon > Fix For: 4.2.0 > > Attachments: TS-2497.patch, client.js, origin-server.js, repro.js > > > When a post fails to an origin server either the server died or the server > returned a response without reading all of the post data, in either case, TS > will destroy buffers too early. This normally does not result in a crash > because the MIOBuffers are returned to the freelist and only with sufficient > load will the race happen causing a crash. Additionally, even if a crash > doesn't happen you might have data corruption across post requests from the > buffers being used after being returned to the freelist. > Thanks to Thomas Jackson for help reproducing and resolving this bug. > An example stack trace, while we've seen other crashes in write_avail too. > #0 0x00000000004eff14 in IOBufferBlock::read_avail (this=0x0) at > ../iocore/eventsystem/I_IOBuffer.h:362 > #1 0x000000000050d151 in MIOBuffer::append_block_internal > (this=0x2aab38001130, b=0x2aab0c037200) at > ../iocore/eventsystem/P_IOBuffer.h:946 > #2 0x000000000050d39b in MIOBuffer::append_block (this=0x2aab38001130, > asize_index=15) at ../iocore/eventsystem/P_IOBuffer.h:986 > #3 0x000000000050d49b in MIOBuffer::add_block (this=0x2aab38001130) at > ../iocore/eventsystem/P_IOBuffer.h:994 > #4 0x000000000055cee2 in MIOBuffer::check_add_block (this=0x2aab38001130) at > ../iocore/eventsystem/P_IOBuffer.h:1002 > #5 0x000000000055d115 in MIOBuffer::write_avail (this=0x2aab38001130) at > ../iocore/eventsystem/P_IOBuffer.h:1048 > #6 0x00000000006c18f3 in read_from_net (nh=0x2aaafca0d208, > vc=0x2aab1c009140, thread=0x2aaafca0a010) at UnixNetVConnection.cc:234 > #7 0x00000000006c37bf in UnixNetVConnection::net_read_io > (this=0x2aab1c009140, nh=0x2aaafca0d208, lthread=0x2aaafca0a010) at > UnixNetVConnection.cc:816 > #8 0x00000000006be392 in NetHandler::mainNetEvent (this=0x2aaafca0d208, > event=5, e=0x271d8e0) at UnixNet.cc:380 > #9 0x00000000004f05c4 in Continuation::handleEvent (this=0x2aaafca0d208, > event=5, data=0x271d8e0) at ../iocore/eventsystem/I_Continuation.h:146 > #10 0x00000000006e361e in EThread::process_event (this=0x2aaafca0a010, > e=0x271d8e0, calling_code=5) at UnixEThread.cc:142 > #11 0x00000000006e3b13 in EThread::execute (this=0x2aaafca0a010) at > UnixEThread.cc:264 > #12 0x00000000006e290b in spawn_thread_internal (a=0x2716400) at Thread.cc:88 > #13 0x0000003372c077e1 in start_thread () from /lib64/libpthread.so.0 > #14 0x00000033728e68ed in clone () from /lib64/libc.so.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)