[jira] [Commented] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

2015-01-06 Thread Frode Halvorsen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266043#comment-14266043
 ] 

Frode Halvorsen commented on HDFS-7480:
---

2.6.1 is not out yet, but one thought; This fix might resolve the issue when 
namenodes are started with a lot of incoming information about 'loose' 
data-blokcs, but it probably won't resolve the issue that causes the namenodes 
to be killed by zookeeper when I delete a lot of files.
Athe the delete-moment, I don't think that the logging is that problematic.
The logging-issue, I believe, is secondary. I believe that the active namenode 
gets busy calculating/distributing delete-orders to datanodes when I drop 
500.000 files at once, and that this is the causer fo the zookeeper-shutdown. 
When the namenode gets overloaded with caclulating/distributing those 
delete-orders, it doesn't keep up with responses to zoo-keeper, which the kills 
the namenode in order to failover to NN2.

 Namenodes loops on 'block does not belong to any file' after deleting many 
 files
 

 Key: HDFS-7480
 URL: https://issues.apache.org/jira/browse/HDFS-7480
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
 Environment: CentOS - HDFS-HA (journal), zookeeper
Reporter: Frode Halvorsen

 A small cluster has 8 servers with 32 G RAM.
 Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
 with RAID as one 21 TB drive).
 The cluster recieves avg 400.000 small files each day. I started archiving 
 (HAR) each day as separate archives. After deleting the orinigal files for 
 one month, the namenodes stared acting up really bad.
 When restaring those, both active and passive nodes seems to work OK for some 
 time, but then starts to report a lot of blocks belonging to no files, and 
 the name-node just spins those messages in a massive loop. If the passive 
 node is first, it also influences the active node in susch a way that it's no 
 longer possible to archive new files. If the active node also starts in this 
 loop, it suddenly dies without any error-message.
 The only way I'm able to get rid of the problem, is to start decommission 
 nodes, watching the cluster closely to avoid downtime, and make sure every 
 datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
 turns), and restarted with clean disks, the problem is gone. But if I then 
 delete a lot of files in a short time, the problem starts again...  
 The main problem (I think), is that the recieving and reporting of those 
 blocks takes so many resources, that the namenodes is too busy to tell the 
 datanodes to delete those blocks.. 
 If the active name-node starts on the loop, it does the 'right' thing by 
 telling the datanode to invalidate the block, But the amount of blocks is so 
 massive, that the namenode doesn't do anything else. Just now, I have about 
 1200-1400 log-entries pr second in the passive node.
 update :
 Just got the active namenode in the loop - it logs 1000 lines pr second. 
 500 'BlockStateChange: BLOCK* processReport: blk_1080796332_7056241 on 
 x.x.x.x:50010 size 1742 does not belong to any file'
 and 
 500 ' BlockStateChange: BLOCK* InvalidateBlocks: add blk_1080796332_7056241 
 to x.x.x.x:50010'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

2015-01-06 Thread Frode Halvorsen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266046#comment-14266046
 ] 

Frode Halvorsen commented on HDFS-7480:
---

2.6.1 is not out yet, but one thought; This fix might resolve the issue when 
namenodes are started with a lot of incoming information about 'loose' 
data-blokcs, but it probably won't resolve the issue that causes the namenodes 
to be killed by zookeeper when I delete a lot of files.
Athe the delete-moment, I don't think that the logging is that problematic.
The logging-issue, I believe, is secondary. I believe that the active namenode 
gets busy calculating/distributing delete-orders to datanodes when I drop 
500.000 files at once, and that this is the causer fo the zookeeper-shutdown. 
When the namenode gets overloaded with caclulating/distributing those 
delete-orders, it doesn't keep up with responses to zoo-keeper, which the kills 
the namenode in order to failover to NN2.

 Namenodes loops on 'block does not belong to any file' after deleting many 
 files
 

 Key: HDFS-7480
 URL: https://issues.apache.org/jira/browse/HDFS-7480
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
 Environment: CentOS - HDFS-HA (journal), zookeeper
Reporter: Frode Halvorsen

 A small cluster has 8 servers with 32 G RAM.
 Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
 with RAID as one 21 TB drive).
 The cluster recieves avg 400.000 small files each day. I started archiving 
 (HAR) each day as separate archives. After deleting the orinigal files for 
 one month, the namenodes stared acting up really bad.
 When restaring those, both active and passive nodes seems to work OK for some 
 time, but then starts to report a lot of blocks belonging to no files, and 
 the name-node just spins those messages in a massive loop. If the passive 
 node is first, it also influences the active node in susch a way that it's no 
 longer possible to archive new files. If the active node also starts in this 
 loop, it suddenly dies without any error-message.
 The only way I'm able to get rid of the problem, is to start decommission 
 nodes, watching the cluster closely to avoid downtime, and make sure every 
 datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
 turns), and restarted with clean disks, the problem is gone. But if I then 
 delete a lot of files in a short time, the problem starts again...  
 The main problem (I think), is that the recieving and reporting of those 
 blocks takes so many resources, that the namenodes is too busy to tell the 
 datanodes to delete those blocks.. 
 If the active name-node starts on the loop, it does the 'right' thing by 
 telling the datanode to invalidate the block, But the amount of blocks is so 
 massive, that the namenode doesn't do anything else. Just now, I have about 
 1200-1400 log-entries pr second in the passive node.
 update :
 Just got the active namenode in the loop - it logs 1000 lines pr second. 
 500 'BlockStateChange: BLOCK* processReport: blk_1080796332_7056241 on 
 x.x.x.x:50010 size 1742 does not belong to any file'
 and 
 500 ' BlockStateChange: BLOCK* InvalidateBlocks: add blk_1080796332_7056241 
 to x.x.x.x:50010'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

2015-01-06 Thread Frode Halvorsen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266044#comment-14266044
 ] 

Frode Halvorsen commented on HDFS-7480:
---

2.6.1 is not out yet, but one thought; This fix might resolve the issue when 
namenodes are started with a lot of incoming information about 'loose' 
data-blokcs, but it probably won't resolve the issue that causes the namenodes 
to be killed by zookeeper when I delete a lot of files.
Athe the delete-moment, I don't think that the logging is that problematic.
The logging-issue, I believe, is secondary. I believe that the active namenode 
gets busy calculating/distributing delete-orders to datanodes when I drop 
500.000 files at once, and that this is the causer fo the zookeeper-shutdown. 
When the namenode gets overloaded with caclulating/distributing those 
delete-orders, it doesn't keep up with responses to zoo-keeper, which the kills 
the namenode in order to failover to NN2.

 Namenodes loops on 'block does not belong to any file' after deleting many 
 files
 

 Key: HDFS-7480
 URL: https://issues.apache.org/jira/browse/HDFS-7480
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
 Environment: CentOS - HDFS-HA (journal), zookeeper
Reporter: Frode Halvorsen

 A small cluster has 8 servers with 32 G RAM.
 Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
 with RAID as one 21 TB drive).
 The cluster recieves avg 400.000 small files each day. I started archiving 
 (HAR) each day as separate archives. After deleting the orinigal files for 
 one month, the namenodes stared acting up really bad.
 When restaring those, both active and passive nodes seems to work OK for some 
 time, but then starts to report a lot of blocks belonging to no files, and 
 the name-node just spins those messages in a massive loop. If the passive 
 node is first, it also influences the active node in susch a way that it's no 
 longer possible to archive new files. If the active node also starts in this 
 loop, it suddenly dies without any error-message.
 The only way I'm able to get rid of the problem, is to start decommission 
 nodes, watching the cluster closely to avoid downtime, and make sure every 
 datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
 turns), and restarted with clean disks, the problem is gone. But if I then 
 delete a lot of files in a short time, the problem starts again...  
 The main problem (I think), is that the recieving and reporting of those 
 blocks takes so many resources, that the namenodes is too busy to tell the 
 datanodes to delete those blocks.. 
 If the active name-node starts on the loop, it does the 'right' thing by 
 telling the datanode to invalidate the block, But the amount of blocks is so 
 massive, that the namenode doesn't do anything else. Just now, I have about 
 1200-1400 log-entries pr second in the passive node.
 update :
 Just got the active namenode in the loop - it logs 1000 lines pr second. 
 500 'BlockStateChange: BLOCK* processReport: blk_1080796332_7056241 on 
 x.x.x.x:50010 size 1742 does not belong to any file'
 and 
 500 ' BlockStateChange: BLOCK* InvalidateBlocks: add blk_1080796332_7056241 
 to x.x.x.x:50010'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

2015-01-06 Thread Frode Halvorsen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266045#comment-14266045
 ] 

Frode Halvorsen commented on HDFS-7480:
---

2.6.1 is not out yet, but one thought; This fix might resolve the issue when 
namenodes are started with a lot of incoming information about 'loose' 
data-blokcs, but it probably won't resolve the issue that causes the namenodes 
to be killed by zookeeper when I delete a lot of files.
Athe the delete-moment, I don't think that the logging is that problematic.
The logging-issue, I believe, is secondary. I believe that the active namenode 
gets busy calculating/distributing delete-orders to datanodes when I drop 
500.000 files at once, and that this is the causer fo the zookeeper-shutdown. 
When the namenode gets overloaded with caclulating/distributing those 
delete-orders, it doesn't keep up with responses to zoo-keeper, which the kills 
the namenode in order to failover to NN2.

 Namenodes loops on 'block does not belong to any file' after deleting many 
 files
 

 Key: HDFS-7480
 URL: https://issues.apache.org/jira/browse/HDFS-7480
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
 Environment: CentOS - HDFS-HA (journal), zookeeper
Reporter: Frode Halvorsen

 A small cluster has 8 servers with 32 G RAM.
 Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
 with RAID as one 21 TB drive).
 The cluster recieves avg 400.000 small files each day. I started archiving 
 (HAR) each day as separate archives. After deleting the orinigal files for 
 one month, the namenodes stared acting up really bad.
 When restaring those, both active and passive nodes seems to work OK for some 
 time, but then starts to report a lot of blocks belonging to no files, and 
 the name-node just spins those messages in a massive loop. If the passive 
 node is first, it also influences the active node in susch a way that it's no 
 longer possible to archive new files. If the active node also starts in this 
 loop, it suddenly dies without any error-message.
 The only way I'm able to get rid of the problem, is to start decommission 
 nodes, watching the cluster closely to avoid downtime, and make sure every 
 datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
 turns), and restarted with clean disks, the problem is gone. But if I then 
 delete a lot of files in a short time, the problem starts again...  
 The main problem (I think), is that the recieving and reporting of those 
 blocks takes so many resources, that the namenodes is too busy to tell the 
 datanodes to delete those blocks.. 
 If the active name-node starts on the loop, it does the 'right' thing by 
 telling the datanode to invalidate the block, But the amount of blocks is so 
 massive, that the namenode doesn't do anything else. Just now, I have about 
 1200-1400 log-entries pr second in the passive node.
 update :
 Just got the active namenode in the loop - it logs 1000 lines pr second. 
 500 'BlockStateChange: BLOCK* processReport: blk_1080796332_7056241 on 
 x.x.x.x:50010 size 1742 does not belong to any file'
 and 
 500 ' BlockStateChange: BLOCK* InvalidateBlocks: add blk_1080796332_7056241 
 to x.x.x.x:50010'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

2015-01-06 Thread Frode Halvorsen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266042#comment-14266042
 ] 

Frode Halvorsen commented on HDFS-7480:
---

2.6.1 is not out yet, but one thought; This fix might resolve the issue when 
namenodes are started with a lot of incoming information about 'loose' 
data-blokcs, but it probably won't resolve the issue that causes the namenodes 
to be killed by zookeeper when I delete a lot of files.
Athe the delete-moment, I don't think that the logging is that problematic.
The logging-issue, I believe, is secondary. I believe that the active namenode 
gets busy calculating/distributing delete-orders to datanodes when I drop 
500.000 files at once, and that this is the causer fo the zookeeper-shutdown. 
When the namenode gets overloaded with caclulating/distributing those 
delete-orders, it doesn't keep up with responses to zoo-keeper, which the kills 
the namenode in order to failover to NN2.

 Namenodes loops on 'block does not belong to any file' after deleting many 
 files
 

 Key: HDFS-7480
 URL: https://issues.apache.org/jira/browse/HDFS-7480
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
 Environment: CentOS - HDFS-HA (journal), zookeeper
Reporter: Frode Halvorsen

 A small cluster has 8 servers with 32 G RAM.
 Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
 with RAID as one 21 TB drive).
 The cluster recieves avg 400.000 small files each day. I started archiving 
 (HAR) each day as separate archives. After deleting the orinigal files for 
 one month, the namenodes stared acting up really bad.
 When restaring those, both active and passive nodes seems to work OK for some 
 time, but then starts to report a lot of blocks belonging to no files, and 
 the name-node just spins those messages in a massive loop. If the passive 
 node is first, it also influences the active node in susch a way that it's no 
 longer possible to archive new files. If the active node also starts in this 
 loop, it suddenly dies without any error-message.
 The only way I'm able to get rid of the problem, is to start decommission 
 nodes, watching the cluster closely to avoid downtime, and make sure every 
 datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
 turns), and restarted with clean disks, the problem is gone. But if I then 
 delete a lot of files in a short time, the problem starts again...  
 The main problem (I think), is that the recieving and reporting of those 
 blocks takes so many resources, that the namenodes is too busy to tell the 
 datanodes to delete those blocks.. 
 If the active name-node starts on the loop, it does the 'right' thing by 
 telling the datanode to invalidate the block, But the amount of blocks is so 
 massive, that the namenode doesn't do anything else. Just now, I have about 
 1200-1400 log-entries pr second in the passive node.
 update :
 Just got the active namenode in the loop - it logs 1000 lines pr second. 
 500 'BlockStateChange: BLOCK* processReport: blk_1080796332_7056241 on 
 x.x.x.x:50010 size 1742 does not belong to any file'
 and 
 500 ' BlockStateChange: BLOCK* InvalidateBlocks: add blk_1080796332_7056241 
 to x.x.x.x:50010'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

2014-12-15 Thread Frode Halvorsen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246605#comment-14246605
 ] 

Frode Halvorsen commented on HDFS-7480:
---

I will test when 2.6.1 is released..

 Namenodes loops on 'block does not belong to any file' after deleting many 
 files
 

 Key: HDFS-7480
 URL: https://issues.apache.org/jira/browse/HDFS-7480
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
 Environment: CentOS - HDFS-HA (journal), zookeeper
Reporter: Frode Halvorsen

 A small cluster has 8 servers with 32 G RAM.
 Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
 with RAID as one 21 TB drive).
 The cluster recieves avg 400.000 small files each day. I started archiving 
 (HAR) each day as separate archives. After deleting the orinigal files for 
 one month, the namenodes stared acting up really bad.
 When restaring those, both active and passive nodes seems to work OK for some 
 time, but then starts to report a lot of blocks belonging to no files, and 
 the name-node just spins those messages in a massive loop. If the passive 
 node is first, it also influences the active node in susch a way that it's no 
 longer possible to archive new files. If the active node also starts in this 
 loop, it suddenly dies without any error-message.
 The only way I'm able to get rid of the problem, is to start decommission 
 nodes, watching the cluster closely to avoid downtime, and make sure every 
 datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
 turns), and restarted with clean disks, the problem is gone. But if I then 
 delete a lot of files in a short time, the problem starts again...  
 The main problem (I think), is that the recieving and reporting of those 
 blocks takes so many resources, that the namenodes is too busy to tell the 
 datanodes to delete those blocks.. 
 If the active name-node starts on the loop, it does the 'right' thing by 
 telling the datanode to invalidate the block, But the amount of blocks is so 
 massive, that the namenode doesn't do anything else. Just now, I have about 
 1200-1400 log-entries pr second in the passive node.
 update :
 Just got the active namenode in the loop - it logs 1000 lines pr second. 
 500 'BlockStateChange: BLOCK* processReport: blk_1080796332_7056241 on 
 x.x.x.x:50010 size 1742 does not belong to any file'
 and 
 500 ' BlockStateChange: BLOCK* InvalidateBlocks: add blk_1080796332_7056241 
 to x.x.x.x:50010'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

2014-12-13 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14245463#comment-14245463
 ] 

Konstantin Shvachko commented on HDFS-7480:
---

Looks similar to HDFS-7503. Is it fixed by it?

 Namenodes loops on 'block does not belong to any file' after deleting many 
 files
 

 Key: HDFS-7480
 URL: https://issues.apache.org/jira/browse/HDFS-7480
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
 Environment: CentOS - HDFS-HA (journal), zookeeper
Reporter: Frode Halvorsen

 A small cluster has 8 servers with 32 G RAM.
 Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
 with RAID as one 21 TB drive).
 The cluster recieves avg 400.000 small files each day. I started archiving 
 (HAR) each day as separate archives. After deleting the orinigal files for 
 one month, the namenodes stared acting up really bad.
 When restaring those, both active and passive nodes seems to work OK for some 
 time, but then starts to report a lot of blocks belonging to no files, and 
 the name-node just spins those messages in a massive loop. If the passive 
 node is first, it also influences the active node in susch a way that it's no 
 longer possible to archive new files. If the active node also starts in this 
 loop, it suddenly dies without any error-message.
 The only way I'm able to get rid of the problem, is to start decommission 
 nodes, watching the cluster closely to avoid downtime, and make sure every 
 datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
 turns), and restarted with clean disks, the problem is gone. But if I then 
 delete a lot of files in a short time, the problem starts again...  
 The main problem (I think), is that the recieving and reporting of those 
 blocks takes so many resources, that the namenodes is too busy to tell the 
 datanodes to delete those blocks.. 
 If the active name-node starts on the loop, it does the 'right' thing by 
 telling the datanode to invalidate the block, But the amount of blocks is so 
 massive, that the namenode doesn't do anything else. Just now, I have about 
 1200-1400 log-entries pr second in the passive node.
 update :
 Just got the active namenode in the loop - it logs 1000 lines pr second. 
 500 'BlockStateChange: BLOCK* processReport: blk_1080796332_7056241 on 
 x.x.x.x:50010 size 1742 does not belong to any file'
 and 
 500 ' BlockStateChange: BLOCK* InvalidateBlocks: add blk_1080796332_7056241 
 to x.x.x.x:50010'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files

2014-12-06 Thread Frode Halvorsen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14236768#comment-14236768
 ] 

Frode Halvorsen commented on HDFS-7480:
---

One thought : Could it be that the failover-controller is responsible for 
killing the namenode before it's finished with the task because it doesn't 
respond properly ?


I had one situation now, where one datanode had only 120.000 blocks without 
file-referende, and when I restarted this datanode, it connected to the active 
namenode. This then recieved those block-id's and invalidated them. When 
finished, I had 120.000 bocks 'pending deletion', but since the name-node kept 
logging blocks 'does not belong to any file' it took very long time before it 
managed to tell the datanode to start deleting. When the number of blocks 
decreased, it took shorter and shorter time between delete-commands. In the end 
that datanode was cleaned up, but when I tried the samme approach for a 
datanode with more unattached blocsk, the namenode died before all blocks was 
marked as invalid. I suspect that it might have beed the failover-controller 
that actually killed the namenode. And of course; when the namenode died, it 
lost all information about blocks 'pending deletion' and had to start over when 
restarted...

For the moment, I have killed the failover-controller, but it seems that the 
number of invalid blocks that constantly is bombarding the name-server prevents 
it from ever getting around to tell the datanode to delete the blocks. (It's 
taking forever between deletes in the beginning)

The bug in this case must be that the namenode/datanode-communication repeats 
the loop of non-attached-blocks, the second bug must be that the name-node 
get's so busy recieving those messages that it's unresponsive to anything 
else...

 Namenodes loops on 'block does not belong to any file' after deleting many 
 files
 

 Key: HDFS-7480
 URL: https://issues.apache.org/jira/browse/HDFS-7480
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
 Environment: CentOS - HDFS-HA (journal), zookeeper
Reporter: Frode Halvorsen

 A small cluster has 8 servers with 32 G RAM.
 Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured 
 with RAID as one 21 TB drive).
 The cluster recieves avg 400.000 small files each day. I started archiving 
 (HAR) each day as separate archives. After deleting the orinigal files for 
 one month, the namenodes stared acting up really bad.
 When restaring those, both active and passive nodes seems to work OK for some 
 time, but then starts to report a lot of blocks belonging to no files, and 
 the name-node just spins those messages in a massive loop. If the passive 
 node is first, it also influences the active node in susch a way that it's no 
 longer possible to archive new files. If the active node also starts in this 
 loop, it suddenly dies without any error-message.
 The only way I'm able to get rid of the problem, is to start decommission 
 nodes, watching the cluster closely to avoid downtime, and make sure every 
 datanode gets a 'clean' start. After all datanodes has been decommisioned (in 
 turns), and restarted with clean disks, the problem is gone. But if I then 
 delete a lot of files in a short time, the problem starts again...  
 The main problem (I think), is that the recieving and reporting of those 
 blocks takes so many resources, that the namenodes is too busy to tell the 
 datanodes to delete those blocks.. 
 If the active name-node starts on the loop, it does the 'right' thing by 
 telling the datanode to invalidate the block, But the amount of blocks is so 
 massive, that the namenode doesn't do anything else. Just now, I have about 
 1200-1400 log-entries pr second in the passive node.
 update :
 Just got the active namenode in the loop - it logs 1000 lines pr second. 
 500 'BlockStateChange: BLOCK* processReport: blk_1080796332_7056241 on 
 x.x.x.x:50010 size 1742 does not belong to any file'
 and 
 500 ' BlockStateChange: BLOCK* InvalidateBlocks: add blk_1080796332_7056241 
 to x.x.x.x:50010'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)