[
https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Suresh Srinivas updated HADOOP-4584:
------------------------------------
Attachment: 4584.patch
Current loop in {{Datanode.OfferService()}} performs multiple steps as follows:
1. If in the next heartbeat interval {{sendHeartbeat}}. Process the
{{DatanodeCommand}} from the namenode
2. If there is a block received send {{blockReceived}} request to the namenode
3. If in the next blockreport interval build and send {{blockReport}}. Process
the {{DatanodeCommand}} from the namenode.
4. Wait till the next heartbeat interval or until another block is received
5. go back to 1.
With the changes we have two threads.
Heartbeat Thread:
1. New thread sends heartbeat and receives {{DatanodeCommand}} in response.
Queues the command to an arraylist.
Main thread does the following without the previous heartbeat functionality:
1. If there are commands in the queue, process all of them.
2. If there is a block received send {{blockReceived}} request to the namenode
3. If in the next blockreport interval build and send {{blockReport}}. Process
the {{DatanodeCommand}} from the namenode.
4. If there are no blocks recieved or commands to process wait for 1 second or
until another block is received
5. go back to 1.
Questions:
1. In step 4. should we wait for receiving a command or for receiving another
block?
2. In OfferService we process all the commands that are in the queue at once.
Do you see any issues with it?
> Slow generation of blockReport at DataNode causes delay of sending heartbeat
> to NameNode
> ----------------------------------------------------------------------------------------
>
> Key: HADOOP-4584
> URL: https://issues.apache.org/jira/browse/HADOOP-4584
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Reporter: Hairong Kuang
> Assignee: Suresh Srinivas
> Fix For: 0.20.0
>
> Attachments: 4584.patch
>
>
> sometimes due to disk or some other problems, datanode takes minutes or tens
> of minutes to generate a block report. It causes the datanode not able to
> send heartbeat to NameNode every 3 seconds. In the worst case, it makes
> NameNode to detect a lost heartbeat and wrongly decide that the datanode is
> dead.
> It would be nice to have two threads instead. One thread is for scanning data
> directories and generating block report, and executes the requests sent by
> NameNode; Another thread is for sending heartbeats, block reports, and
> picking up the requests from NameNode. By having these two threads, the
> sending of heartbeats will not get delayed by any slow block report or slow
> execution of NameNode requests.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.