chaoli created MAPREDUCE-7329:
---------------------------------
Summary: HadoopPipes task may fail when linux kernel version
change from 3.x to 4.x
Key: MAPREDUCE-7329
URL: https://issues.apache.org/jira/browse/MAPREDUCE-7329
Project: Hadoop Map/Reduce
Issue Type: Bug
Affects Versions: 2.6.0
Reporter: chaoli
Attachments: image-2021-03-15-14-29-49-475.png,
image-2021-03-15-14-37-32-184.png
Recently, we upgrade linux kernel version from 3.x to 4.x. And we find hadoop
pipe task exit with connect timeout which is implemented by PingThread in
HadoopPipes.cc.
!image-2021-03-15-14-37-32-184.png!
After a deep research, we finally find that current ping server won't accept
ping client created socket, which has hidden danger:
# it will cause tcp accept queue full(*default 50*)
# when client close socket, server socket won't call close method, which will
leave too many CLOSE_WAIT socket fd existed(*default 2h*), and accept queue
never cleared.
# Even worse, in 4.x linux kernel version, it will cause tcp drop packet
directly which makes ping client connect time out. While In 3.x linux kernel
version, when accept queue full, client can also make half connection till sync
queue full (*default 2048*), so from client side, ping will aslo work till sync
queue full. And after 3 hours, task will also exit with connect timeout
exception.
To fix this problem, we introduced a *PingSocketCleaner* thread, which will
continuously accept ping socket connect from ping client. When socket close
from client, cleaner thread will detected by inputStream read, then will
finally close socket from sever side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]