[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chaoli updated MAPREDUCE-7329:
------------------------------
    Description: 
Recently, we upgrade linux kernel version from 3.x to 4.x. And we find hadoop 
pipe task exit with connect timeout which is implemented by PingThread in 
HadoopPipes.cc.

!image-2021-03-15-14-37-32-184.png!
 After a deep research, we finally find that current ping server won't accept 
ping client created socket, which has hidden danger: 
 # it will cause tcp accept queue full(*default 50*)
 # when client close socket, server socket won't call close method, which will 
leave too many CLOSE_WAIT socket fd existed(*default 2h*), and accept queue 
never cleared.
 # Even worse, in 4.x linux kernel version, it will cause tcp drop packet 
directly which makes ping client connect time out. While In 3.x linux kernel 
version, when accept queue full, client can also make half connection till sync 
queue full (*default 2048*), so from client side, ping will aslo work till sync 
queue full. And after 3 hours, task will also exit with connect timeout 
exception.

To fix this problem, we introduced a *PingSocketCleaner* thread, which will 
continuously accept ping socket connect from ping client. When socket close 
from client,  cleaner thread will detecte closed inputStream reading, then  
will finally close socket from sever side.

Refrenced by linux kernel patch: 
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ea8ea2cb7]

 

  was:
Recently, we upgrade linux kernel version from 3.x to 4.x. And we find hadoop 
pipe task exit with connect timeout which is implemented by PingThread in 
HadoopPipes.cc.

!image-2021-03-15-14-37-32-184.png!
 After a deep research, we finally find that current ping server won't accept 
ping client created socket, which has hidden danger: 
 # it will cause tcp accept queue full(*default 50*)
 # when client close socket, server socket won't call close method, which will 
leave too many CLOSE_WAIT socket fd existed(*default 2h*), and accept queue 
never cleared.
 # Even worse, in 4.x linux kernel version, it will cause tcp drop packet 
directly which makes ping client connect time out. While In 3.x linux kernel 
version, when accept queue full, client can also make half connection till sync 
queue full (*default 2048*), so from client side, ping will aslo work till sync 
queue full. And after 3 hours, task will also exit with connect timeout 
exception.

To fix this problem, we introduced a *PingSocketCleaner* thread, which will 
continuously accept ping socket connect from ping client. When socket close 
from client,  cleaner thread will detected by inputStream read, then  will 
finally close socket from sever side.

Refrenced by linux kernel patch: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ea8ea2cb7

 


> HadoopPipes task may fail when linux kernel version change from 3.x to 4.x
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7329
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7329
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: chaoli
>            Priority: Major
>              Labels: patch, pull-request-available
>         Attachments: 
> 0001-MAPREDUCE-7329-HadoopPipes-task-may-fail-when-linux-.patch, 
> image-2021-03-15-14-29-49-475.png, image-2021-03-15-14-37-32-184.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Recently, we upgrade linux kernel version from 3.x to 4.x. And we find hadoop 
> pipe task exit with connect timeout which is implemented by PingThread in 
> HadoopPipes.cc.
> !image-2021-03-15-14-37-32-184.png!
>  After a deep research, we finally find that current ping server won't accept 
> ping client created socket, which has hidden danger: 
>  # it will cause tcp accept queue full(*default 50*)
>  # when client close socket, server socket won't call close method, which 
> will leave too many CLOSE_WAIT socket fd existed(*default 2h*), and accept 
> queue never cleared.
>  # Even worse, in 4.x linux kernel version, it will cause tcp drop packet 
> directly which makes ping client connect time out. While In 3.x linux kernel 
> version, when accept queue full, client can also make half connection till 
> sync queue full (*default 2048*), so from client side, ping will aslo work 
> till sync queue full. And after 3 hours, task will also exit with connect 
> timeout exception.
> To fix this problem, we introduced a *PingSocketCleaner* thread, which will 
> continuously accept ping socket connect from ping client. When socket close 
> from client,  cleaner thread will detecte closed inputStream reading, then  
> will finally close socket from sever side.
> Refrenced by linux kernel patch: 
> [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ea8ea2cb7]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to