[jira] [Commented] (HDFS-14622) [Dynamometer] State transition err when CCM( HDFS Centralized Cache Management) feature is used

2019-07-09 Thread TanYuxin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881197#comment-16881197
 ] 

TanYuxin commented on HDFS-14622:
-

Thanks [~xkrogen] and [~hexiaoqiao] very much. I think the patch can not only 
fix the issue caused by the presence of Centralized Cache Management 
functionality, but also improves compatibility with any other Fsimage XML 
Section generated by other feature. Thanks again.

> [Dynamometer] State transition err when CCM( HDFS Centralized Cache 
> Management) feature is used
> ---
>
> Key: HDFS-14622
> URL: https://issues.apache.org/jira/browse/HDFS-14622
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: tools
>Reporter: TanYuxin
>Assignee: Erik Krogen
>Priority: Major
> Attachments: HDFS-14622.000.patch
>
>
> When we used dynamometer to test HDFS performance, the test encountered a 
> error when generate DataNode Block info, then the generation process failed.  
> The error stack is
> {code:java}
> Error: java.io.IOException: State transition not allowed; from DEFAULT to 
> FILE_WITH_REPLICATION at 
> com.linkedin.dynamometer.blockgenerator.XMLParser.transitionTo(XMLParser.java:107)
>  at 
> com.linkedin.dynamometer.blockgenerator.XMLParser.parseLine(XMLParser.java:77)
>  at 
> com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:53)
>  at 
> com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:26)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:151) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:828) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at 
> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:415) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1690)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> {code}
> After checking Fsimage xml and the source code, I find that *XMLParser* can 
> not parse the lines correctly, these lines are like
>  
> {code:java}
> 8963/user/somepath/path13cache_other_pool1544454142310false
>  
> 8964/user/somepath/path23cache_hadoop-data_pool1544497817686false
>  
> 8965/user/somepath/path33cache_hadoop-peisong_pool1544451500312false
>  
> 8967/user/somepath/path43cache_other_pool1544497602570false
> {code}
>  
> These fsimage xml lines are generated when [HDFS Centralized Cache Management 
> (CCM)|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html]
>  feature is used.
> I have a discuss with [~xkrogen] 
> [here|[https://github.com/linkedin/dynamometer/pull/77]], and some patches 
> provided can fix the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14622) State transition err when CCM( HDFS Centralized Cache Management) feature is used

2019-06-30 Thread TanYuxin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-14622:

Description: 
When we used dynamometer to test HDFS performance, the test encountered a error 
when generate DataNode Block info, then the generation process failed.  The 
error stack is
{code:java}
Error: java.io.IOException: State transition not allowed; from DEFAULT to 
FILE_WITH_REPLICATION at 
com.linkedin.dynamometer.blockgenerator.XMLParser.transitionTo(XMLParser.java:107)
 at 
com.linkedin.dynamometer.blockgenerator.XMLParser.parseLine(XMLParser.java:77) 
at 
com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:53)
 at 
com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:26)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:151) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:828) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1690)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
{code}
After checking Fsimage xml and the source code, I find that *XMLParser* can not 
parse the lines correctly, these lines are like

 
{code:java}
8963/user/somepath/path13cache_other_pool1544454142310false
 
8964/user/somepath/path23cache_hadoop-data_pool1544497817686false
 
8965/user/somepath/path33cache_hadoop-peisong_pool1544451500312false
 
8967/user/somepath/path43cache_other_pool1544497602570false
{code}
 

These fsimage xml lines are generated when [HDFS Centralized Cache Management 
(CCM)|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html]
 feature is used.

I have a discuss with [~xkrogen] 
[here|[https://github.com/linkedin/dynamometer/pull/77]], and some patches 
provided can fix the issue.

  was:
When we used dynamometer to test HDFS performance, the test encountered a error 
when generate DataNode Block info, then the generation process failed.  The 
error stack is
{code:java}
Error: java.io.IOException: State transition not allowed; from DEFAULT to 
FILE_WITH_REPLICATION at 
com.linkedin.dynamometer.blockgenerator.XMLParser.transitionTo(XMLParser.java:107)
 at 
com.linkedin.dynamometer.blockgenerator.XMLParser.parseLine(XMLParser.java:77) 
at 
com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:53)
 at 
com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:26)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:151) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:828) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1690)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
{code}
After checking Fsimage xml and the source code, I find that *XMLParser* can not 
parse the lines correctly, these lines are like

 
{code:java}
8963/user/somepath/path13cache_other_pool1544454142310false
 
8964/user/somepath/path23cache_hadoop-data_pool1544497817686false
 
8965/user/somepath/path33cache_hadoop-peisong_pool1544451500312false
 
8967/user/somepath/path43cache_other_pool1544497602570false
{code}
 

These fsimage xml lines are generated when [HDFS Centralized Cache Management 
(CCM)|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html]
 feature is used.

I have a discuss with [~xkrogen] 
[here|[https://github.com/linkedin/dynamometer/pull/77]], and some patches 
provided can fix the issue.
[code 
placeholder|http://dict.youdao.com/search?q=code%20placeholder=chrome.extension]
 
[详细|http://www.youdao.com/search?q=code%20placeholder=utf8=chrome.extension]X
  没有英汉互译结果
  
[请尝试网页搜索|http://www.youdao.com/search?q=code%20placeholder=utf8=chrome.extension]


> State transition err when CCM( HDFS Centralized Cache Management) feature is 
> used
> -
>
> Key: HDFS-14622
> URL: https://issues.apache.org/jira/browse/HDFS-14622
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: tools
>Reporter: TanYuxin
>Priority: Major
>
> When we used dynamometer to test HDFS performance, the test encountered a 
> error when generate DataNode Block info, then the generation process failed.  
> The error stack is
> {code:java}
> Error: java.io.IOException: State 

[jira] [Created] (HDFS-14622) State transition err when CCM( HDFS Centralized Cache Management) feature is used

2019-06-30 Thread TanYuxin (JIRA)
TanYuxin created HDFS-14622:
---

 Summary: State transition err when CCM( HDFS Centralized Cache 
Management) feature is used
 Key: HDFS-14622
 URL: https://issues.apache.org/jira/browse/HDFS-14622
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Reporter: TanYuxin


When we used dynamometer to test HDFS performance, the test encountered a error 
when generate DataNode Block info, then the generation process failed.  The 
error stack is
{code:java}
Error: java.io.IOException: State transition not allowed; from DEFAULT to 
FILE_WITH_REPLICATION at 
com.linkedin.dynamometer.blockgenerator.XMLParser.transitionTo(XMLParser.java:107)
 at 
com.linkedin.dynamometer.blockgenerator.XMLParser.parseLine(XMLParser.java:77) 
at 
com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:53)
 at 
com.linkedin.dynamometer.blockgenerator.XMLParserMapper.map(XMLParserMapper.java:26)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:151) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:828) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1690)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
{code}
After checking Fsimage xml and the source code, I find that *XMLParser* can not 
parse the lines correctly, these lines are like

 
{code:java}
8963/user/somepath/path13cache_other_pool1544454142310false
 
8964/user/somepath/path23cache_hadoop-data_pool1544497817686false
 
8965/user/somepath/path33cache_hadoop-peisong_pool1544451500312false
 
8967/user/somepath/path43cache_other_pool1544497602570false
{code}
 

These fsimage xml lines are generated when [HDFS Centralized Cache Management 
(CCM)|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html]
 feature is used.

I have a discuss with [~xkrogen] 
[here|[https://github.com/linkedin/dynamometer/pull/77]], and some patches 
provided can fix the issue.
[code 
placeholder|http://dict.youdao.com/search?q=code%20placeholder=chrome.extension]
 
[详细|http://www.youdao.com/search?q=code%20placeholder=utf8=chrome.extension]X
  没有英汉互译结果
  
[请尝试网页搜索|http://www.youdao.com/search?q=code%20placeholder=utf8=chrome.extension]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-05-30 Thread TanYuxin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852590#comment-16852590
 ] 

TanYuxin commented on HDFS-14090:
-

[~crh] Thanks. It's a great feature, looking forward to it resolved.

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, RBF_ Isolation 
> design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14090) RBF: Improved isolation for downstream name nodes.

2019-05-28 Thread TanYuxin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849440#comment-16849440
 ] 

TanYuxin commented on HDFS-14090:
-

[~crh], Hi, excuse me. I have a small question about the patch. In the patch, I 
found that the method RouterRpcClient#acquirePermit() is only called in 
invokeConcurrent() method. And I am confused why is the method acquirePermit() 
not called in invokeSequential().

Did I miss something or make some mistakes? Could you please help take a look 
and correct me if I am wrong. Thank you very much.:D

> RBF: Improved isolation for downstream name nodes.
> --
>
> Key: HDFS-14090
> URL: https://issues.apache.org/jira/browse/HDFS-14090
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Attachments: HDFS-14090-HDFS-13891.001.patch, RBF_ Isolation 
> design.pdf
>
>
> Router is a gateway to underlying name nodes. Gateway architectures, should 
> help minimize impact of clients connecting to healthy clusters vs unhealthy 
> clusters.
> For example - If there are 2 name nodes downstream, and one of them is 
> heavily loaded with calls spiking rpc queue times, due to back pressure the 
> same with start reflecting on the router. As a result of this, clients 
> connecting to healthy/faster name nodes will also slow down as same rpc queue 
> is maintained for all calls at the router layer. Essentially the same IPC 
> thread pool is used by router to connect to all name nodes.
> Currently router uses one single rpc queue for all calls. Lets discuss how we 
> can change the architecture and add some throttling logic for 
> unhealthy/slow/overloaded name nodes.
> One way could be to read from current call queue, immediately identify 
> downstream name node and maintain a separate queue for each underlying name 
> node. Another simpler way is to maintain some sort of rate limiter configured 
> for each name node and let routers drop/reject/send error requests after 
> certain threshold. 
> This won’t be a simple change as router’s ‘Server’ layer would need redesign 
> and implementation. Currently this layer is the same as name node.
> Opening this ticket to discuss, design and implement this feature.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13787) RBF: Add Snapshot related ClientProtocol APIs

2019-05-06 Thread TanYuxin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833702#comment-16833702
 ] 

TanYuxin commented on HDFS-13787:
-

hi, [~RANith], in the patch, a new line is redundant in _TestRouterRpc.java_, 
which is
{code:java}
import org.apache.hadoop.hdfs.protocol.SnapshottableDirectoryStatus;{code}

> RBF: Add Snapshot related ClientProtocol APIs
> -
>
> Key: HDFS-13787
> URL: https://issues.apache.org/jira/browse/HDFS-13787
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
>  Labels: RBF
> Attachments: HDFS-13787-HDFS-13891.003.patch, HDFS-13787.001.patch, 
> HDFS-13787.002.patch
>
>
> Currently, allowSnapshot, disallowSnapshot, renameSnapshot, createSnapshot, 
> deleteSnapshot , SnapshottableDirectoryStatus, getSnapshotDiffReport and 
> getSnapshotDiffReportListing are not implemented in RouterRpcServer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2018-03-01 Thread TanYuxin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383110#comment-16383110
 ] 

TanYuxin commented on HDFS-12749:
-

Thanks [~hexiaoqiao]  [~xkrogen] very much for reviewing and resolving the 
issue. I think v003 patch committed by [~hexiaoqiao] is more effective and 
simpler. 
Anyone mind having a review?  In our production cluster, the problem has been 
fixed for months by the proposed patch.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1, 2.8.3, 2.7.5, 3.0.0, 2.9.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749-branch-2.7.002.patch, 
> HDFS-12749-trunk.003.patch, HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-11-02 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Affects Version/s: 2.7.1
   Status: Patch Available  (was: Open)

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN 
> to re-register again



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-11-02 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Attachment: (was: HDFS-12749.001.patch)

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN 
> to re-register again



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-11-02 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Attachment: HDFS-12749.001.patch

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN 
> to re-register again



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-11-02 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Attachment: HDFS-12749.001.patch

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>Priority: Major
> Attachments: HDFS-12749.001.patch
>
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps scatter the BR from all DNs
> scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
>   }
> {code}
> But NameNode has processed registerDatanode successfully, so it won't ask DN 
> to re-register again



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2017-11-02 Thread TanYuxin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235320#comment-16235320
 ] 

TanYuxin commented on HDFS-12749:
-

Thanks @kihwal very much for the comments. 
DN will retry RPC calls when timeout, but DN won't send BlockReport in 
BPserviceActor#register if it got an IOException.
Here is the scene occurred in our cluseter.
1. DN BlockReport Interval is set to 10 hours.
2. Restart NN at 1 hours after DN sent last BR.
3. DN re-register to NN, but DN get an IOException and won't send BR 
immediately. 
4. NN will receive DN's BR after 9 more hours later, which is too long.

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>Priority: Major
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The un-catched IOException breaks BPServiceActor#register, and the Block 
> Report can not be sent immediately. 
> {code}
>   /**
>* Register one bp with the corresponding NameNode
>* 
>* The bpDatanode needs to register with the namenode on startup in order
>* 1) to report which storage it is serving now and 
>* 2) to receive a registrationID
>*  
>* issued by the namenode to recognize registered datanodes.
>* 
>* @param nsInfo current NamespaceInfo
>* @see FSNamesystem#registerDatanode(DatanodeRegistration)
>* @throws IOException
>*/
>   void register(NamespaceInfo nsInfo) throws IOException {
> // The handshake() phase loaded the block pool storage
> // off disk - so update the bpRegistration object from that info
> DatanodeRegistration newBpRegistration = bpos.createRegistration();
> LOG.info(this + " beginning handshake with NN");
> while (shouldRun()) {
>   try {
> // Use returned registration from namenode with updated fields
> newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
> newBpRegistration.setNamespaceInfo(nsInfo);
> bpRegistration = newBpRegistration;
> break;
>   } catch(EOFException e) {  // namenode might have just restarted
> LOG.info("Problem connecting to server: " + nnAddr + " :"
> + e.getLocalizedMessage());
> sleepAndLogInterrupts(1000, "connecting to server");
>   } catch(SocketTimeoutException e) {  // namenode is busy
> LOG.info("Problem connecting to server: " + nnAddr);
> sleepAndLogInterrupts(1000, "connecting to server");
>   }
> }
> 
> LOG.info("Block pool " + this + " successfully registered with NN");
> bpos.registrationSucceeded(this, bpRegistration);
> // random short delay - helps 

[jira] [Commented] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226405#comment-16226405
 ] 

TanYuxin commented on HDFS-12749:
-


{code}
// DN and NN log and timeline about DN re-register
// DN start re-register:
2017-10-26 12:59:35,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from Namenode_Host/IP:Port with standby 
state

// DN SocketTimeoutException and retry to register
2017-10-26 13:02:08,497 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Problem connecting to server: Namenode_Host/IP:Port
2017-10-26 13:06:34,204 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Problem connecting to server: Namenode_Host/IP:Port

// NN Successfully register the DN
2017-10-26 13:07:24,325 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(DataNode_IP:Port, 
datanodeUuid=DataNodeUuid, infoPort=DataNode_port, infoSecurePort=0, 
ipcPort=IPCPort, storageInfo=lv=-57;cid=CID-;nsid=NSID;c=0) storage Storage_ID

// DN get IOExecption:
2017-10-26 13:07:35,265 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Error processing datanode Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/DataNode_IP:Port remote=Namenode_Host/IP:Port]; Host Details : local 
host is: "DataNode_Host/IP"; destination host is: "NameNode_Host":Port;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(
...
{code}


> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> Now our cluster have thousands of DN, millions of files and blocks. When NN 
> restart, NN's load is very high.
> After NN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
> host is: "DataNode_Host/Datanode_IP"; destination host is: 
> "NameNode_Host":Port;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: 
Now our cluster have thousands of DN, millions of files and blocks. When NN 
restart, NN's load is very high.
After NN restart,DN will call BPServiceActor#reRegister method to register. But 
register RPC will get a IOException since NN is busy dealing with Block Report. 
 The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/DataNode_IP:Port remote=NameNode_Host/IP:Port]; Host Details : local 
host is: "DataNode_Host/Datanode_IP"; destination host is: "NameNode_Host":Port;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

The un-catched IOException breaks BPServiceActor#register, and the Block Report 
can not be sent immediately. 
{code}
  /**
   * Register one bp with the corresponding NameNode
   * 
   * The bpDatanode needs to register with the namenode on startup in order
   * 1) to report which storage it is serving now and 
   * 2) to receive a registrationID
   *  
   * issued by the namenode to recognize registered datanodes.
   * 
   * @param nsInfo current NamespaceInfo
   * @see FSNamesystem#registerDatanode(DatanodeRegistration)
   * @throws IOException
   */
  void register(NamespaceInfo nsInfo) throws IOException {
// The handshake() phase loaded the block pool storage
// off disk - so update the bpRegistration object from that info
DatanodeRegistration newBpRegistration = bpos.createRegistration();

LOG.info(this + " beginning handshake with NN");

while (shouldRun()) {
  try {
// Use returned registration from namenode with updated fields
newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
newBpRegistration.setNamespaceInfo(nsInfo);
bpRegistration = newBpRegistration;
break;
  } catch(EOFException e) {  // namenode might have just restarted
LOG.info("Problem connecting to server: " + nnAddr + " :"
+ e.getLocalizedMessage());
sleepAndLogInterrupts(1000, "connecting to server");
  } catch(SocketTimeoutException e) {  // namenode is busy
LOG.info("Problem connecting to server: " + nnAddr);
sleepAndLogInterrupts(1000, "connecting to server");
  }
}

LOG.info("Block pool " + this + " successfully registered with NN");
bpos.registrationSucceeded(this, bpRegistration);

// random short delay - helps scatter the BR from all DNs
scheduler.scheduleBlockReport(dnConf.initialBlockReportDelay);
  }
{code}
But NameNode has processed registerDatanode successfully, so it won't ask DN to 
re-register again

  was:
Now our cluster have thousands of DN, millions of files and blocks. When NN 
restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/IP:Port remote=NameNode/IP:Port]; Host Details : local host is: 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: 
Now our cluster have thousands of DN, millions of files and blocks. When NN 
restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/IP:Port remote=NameNode/IP:Port]; Host Details : local host is: 
"DataNode/Datanode_ip"; destination host is: "NameNode_Host":Port;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again at next HeartBeat, which makes DN Block Report  is 
not sent correctly after register. 

  was:
Now our cluster have thousands of DN, files num 100+ million, block num 100+ 
million. When NN restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=NameNode/IP:Port]; Host Details : local host 
is: "datanode-2220/10.14.110.33"; destination host is: "namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again at next HeartBeat, which makes DN Block Report  is 
not sent correctly after 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: 
Now our cluster have thousands of DN, files num 100+ million, block num 100+ 
million. When NN restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=NameNode/IP:Port]; Host Details : local host 
is: "datanode-2220/10.14.110.33"; destination host is: "namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again at next HeartBeat, which makes DN Block Report  is 
not sent correctly after register. 

  was:
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. 
When NN restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again at next HeartBeat, which makes DN 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: 
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. 
When NN restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again at next HeartBeat, which makes DN Block Report  is 
not sent correctly after register. 

  was:
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. 
When NN restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again at next HeartBeat, which 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: 
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. 
When NN restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again at next HeartBeat, which makes Block Report  is not 
sent correctly after register. 

  was:
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. 
When NN restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again, which makes Block Report  

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: 
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million. 
When NN restart, NN's load is very high.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}

If encountering a IOException in BPServiceActor#register,  
scheduler.scheduleBlockReport method can't be run, and the Block Report will 
not be sent immediately. 
But NN has get the register RPC, and successfully register the DN. So NN will 
not make DN register again, which makes Block Report  is not sent

  was:
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}




> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> Now our cluster have 7000+ DN, files num 180+ million, block num 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: 
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy dealing with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}



  was:
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy for deal with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}




> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
> After SNN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy dealing with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: 
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy for deal with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while w
aiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.14.110.33:24562 
remote=namenode.host.03/10.14.27.17:8040]; Host Details : local host is: 
"datanode-2220/10.14.110.33"; destination host is: "namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}



  was:After SNN restart,DN will call BPServiceActor#reRegister method to 
register. But when DN registers to NN, SNN will return 


> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
> After SNN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy for deal with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing 
> datanode Command
> java.io.IOException: Failed on local exception: java.io.IOException: 
> java.net.SocketTimeoutException: 6 millis timeout while w
> aiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.14.110.33:24562 
> remote=namenode.host.03/10.14.27.17:8040]; Host Details : local host is: 
> "datanode-2220/10.14.110.33"; destination host is: "namenode.host.03":8040;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
> at org.apache.hadoop.ipc.Client.call(Client.java:1474)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-31 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: 
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy for deal with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.14.110.33:24562 remote=namenode.host.03/10.14.27.17:8040]; Host 
Details : local host is: "datanode-2220/10.14.110.33"; destination host is: 
"namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}



  was:
Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
After SNN restart,DN will call BPServiceActor#reRegister method to register. 
But register RPC will get a IOException since NN is busy for deal with Block 
Report.  The exception is caught at BPServiceActor#processCommand.
Next is the caught IOException:
{code:java}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode 
Command
java.io.IOException: Failed on local exception: java.io.IOException: 
java.net.SocketTimeoutException: 6 millis timeout while w
aiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.14.110.33:24562 
remote=namenode.host.03/10.14.27.17:8040]; Host Details : local host is: 
"datanode-2220/10.14.110.33"; destination host is: "namenode.host.03":8040;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.registerDatanode(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:126)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:793)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:926)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:604)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:898)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:711)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:864)
at java.lang.Thread.run(Thread.java:745)
{code}




> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> Now our cluster have 7000+ DN, files num 180+ million, block num 180+ million.
> After SNN restart,DN will call BPServiceActor#reRegister method to register. 
> But register RPC will get a IOException since NN is busy for deal with Block 
> Report.  The exception is caught at BPServiceActor#processCommand.
> Next is the caught IOException:
> {code:java}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 

[jira] [Updated] (HDFS-12749) DN may not send block report to NN after NN restart

2017-10-30 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Summary: DN may not send block report to NN after NN restart  (was: DN may 
not send block report to SNN after SNN restart and DN re-register to SNN,)

> DN may not send block report to NN after NN restart
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> After SNN restart,DN will call BPServiceActor#reRegister method to register. 
> But when DN registers to NN, SNN will return 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12749) DN may not send block report to SNN after SNN restart and DN re-register to SNN,

2017-10-30 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Summary: DN may not send block report to SNN after SNN restart and DN 
re-register to SNN,  (was: After SNN restart and DN re-register to SNN,some DN 
may not send block report to SNN if DN num is too large)

> DN may not send block report to SNN after SNN restart and DN re-register to 
> SNN,
> 
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> After SNN restart,DN will call BPServiceActor#reRegister method to register. 
> But when DN registers to NN, SNN will return 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12749) After SNN restart and DN re-register to SNN,some DN may not send block report to SNN if DN num is too large

2017-10-30 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Description: After SNN restart,DN will call BPServiceActor#reRegister 
method to register. But when DN registers to NN, SNN will return 

> After SNN restart and DN re-register to SNN,some DN may not send block report 
> to SNN if DN num is too large
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>
> After SNN restart,DN will call BPServiceActor#reRegister method to register. 
> But when DN registers to NN, SNN will return 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12749) After SNN restart and DN re-register to SNN,some DN may not send block report to SNN if DN num is too large

2017-10-30 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Affects Version/s: (was: 2.7.4)

> After SNN restart and DN re-register to SNN,some DN may not send block report 
> to SNN if DN num is too large
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12749) After SNN restart and DN re-register to SNN,some DN may not send block report to SNN if DN num is too large

2017-10-30 Thread TanYuxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TanYuxin updated HDFS-12749:

Target Version/s:   (was: 2.7.4)

> After SNN restart and DN re-register to SNN,some DN may not send block report 
> to SNN if DN num is too large
> ---
>
> Key: HDFS-12749
> URL: https://issues.apache.org/jira/browse/HDFS-12749
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: TanYuxin
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-12749) After SNN restart and DN re-register to SNN,some DN may not send block report to SNN if DN num is too large

2017-10-30 Thread TanYuxin (JIRA)
TanYuxin created HDFS-12749:
---

 Summary: After SNN restart and DN re-register to SNN,some DN may 
not send block report to SNN if DN num is too large
 Key: HDFS-12749
 URL: https://issues.apache.org/jira/browse/HDFS-12749
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.4
Reporter: TanYuxin






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org