[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203015#comment-17203015
 ] 

Xiaoqiao He commented on HDFS-14997:
------------------------------------

[~Captainhzy][~sodonnell] Thanks for your comments. 
BPOfferService#mReadWriteLock is indeed another competition point. In my 
practice, processes commands asynchronously could mitigate most of datanode 
issues, but not complete. If any datanode traces share will help to dig this 
case further more.
{quote}I have an idea. It can put the `updateActorStatesFromHeartbeat` function 
to `CommandProcessingThread`. In this case, it will not block heartbeat due to 
`writeLock`.{quote}
Would you like to file another JIRA and try to submit patch? Thanks.

> BPServiceActor processes commands from NameNode asynchronously
> --------------------------------------------------------------
>
>                 Key: HDFS-14997
>                 URL: https://issues.apache.org/jira/browse/HDFS-14997
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to