[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-30 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273631#comment-16273631
 ] 

genericqa commented on HDFS-12638:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 47s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 32s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 64 unchanged - 0 fixed = 66 total (was 64) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m  9s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}142m 50s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
18s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}189m 36s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure |
|   | hadoop.hdfs.web.TestWebHdfsTimeouts |
|   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
|   | hadoop.hdfs.server.namenode.TestCheckpoint |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | hadoop.hdfs.server.datanode.TestDataNodeUUID |
|   | hadoop.hdfs.server.balancer.TestBalancerRPCDelay |
|   | 
hadoop.hdfs.server.blockmanagement.TestReconstructStripedBlocksWithRackAwareness
 |
|   | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 |
| JIRA Issue | HDFS-12638 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12900072/HDFS-12638.003.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ab241a5c43d4 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 
11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| 

[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-30 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273621#comment-16273621
 ] 

Zhe Zhang commented on HDFS-12638:
--

Thanks [~shv]. +1 on the v3 patch (pretty clear fix to remove orphan 
copy-on-truncate block). Two minor comments:

# Shouldn't {{BlockUnderConstructionFeature#truncateBlock}} be a {{BlockInfo}}? 
All its value assignments are from a BI. If you agree, I'm OK with addressing 
this in a separate JIRA.
# The title doesn't really reflect the fix (I imagine ReplicationMonitor 
crashing is only one of the symptoms). Could you update it?

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> HDFS-12638.003.patch, OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-29 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272083#comment-16272083
 ] 

Erik Krogen commented on HDFS-12638:


I investigated this further and think that Konstantin's v002 patch should 
actually solve the problem. Actually, HDFS-9754 does not change the invariant 
that [~shv] mentioned. A few notes from the investigation:
* After incremental block deletion was added, it was already true that a block 
which was not associated with a valid INode could be present in the blocksMap. 
In {{FSNamesystem#delete()}}, we first call {{FSDirDeleteOp#delete()}} within 
the write lock. Then release the write lock, then call 
{{BlockManager#removeBlock()}} (will remove the block from blocksMap) on each 
block later on. Within {{FSDirDeleteOp#delete()}}, all INodes being deleted are 
removed from the inodesMap (see {{FSDirDeleteOp#deleteInternal()}} which calls 
{{FSNamesystem#removeLeasesAndINodes()}}). 
* This scenario meant that places e.g. 
{{BlockManager#scheduleReconstruction()}} had to check if there was a 
BlockCollection associated with the block, which it previously did by checking 
{{FSNamesystem#getBlockCollection(blkInfo.getBlockCollectionId()) != null}}. 
Now, HDFS-9754 replaced this call with {{BlockInfo#isDeleted()}}, so this means 
whenever we remove an INode from the inodesMap, we need to call 
{{BlockInfo#delete()}} to indicate that it does not have a valid 
BlockCollection associated with it (this is currently done within 
{{INodeFile#clearFile()}}, called by {{INode#destroyAndCollectBlocks()}}, 
called by {{FSDirDeleteOp#unprotectedDelete()}}).
* When HDFS-9754 was added, it did not properly mark copy-on-truncate blocks 
with {{BlockInfo#delete()}}, so the {{BlockInfo#isDeleted()}} check would fail, 
thus causing {{BlockManager#secheduleReconstruction()}} to throw NPE when it 
tries to use {{FSNamesystem#getBlockCollection(blkInfo)}} (since it assumes 
there is a valid block collection associated).
* Konstantin's patch correctly invalidates copy-on-truncate blocks, so should 
fix this NPE, at least for the case of copy-on-truncate blocks.

So +1 from me (non-binding) on the logic of the v002 patch. We should also try 
to get some unit test in for this.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269808#comment-16269808
 ] 

Junping Du commented on HDFS-12638:
---

Ping... [~szetszwo] and [~jingzhao], do you agree with [~shv]'s proposal above 
that to revert HDFS-9754? Or you have some other proposal here? 2.8.3 release 
is pending on this.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-22 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263649#comment-16263649
 ] 

Junping Du commented on HDFS-12638:
---

Should be a blocker for 2.9.1 and 3.0.0 as well. Add them into target version 
for tracking.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-22 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263561#comment-16263561
 ] 

Konstantin Shvachko commented on HDFS-12638:


I looked more closely through the changes introduced in HDFS-9754. It looks 
like a major refactoring with a goal which seems like a minor optimization, and 
with no test coverage. The main problem is that it violated the key invariant, 
which is mentioned in the [first 
comment|https://issues.apache.org/jira/browse/HDFS-9754?focusedCommentId=15131492=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15131492]
 of that jira, stating
* *A block is always associated with a file when added to {{BlocksMap}}*

I propose to revert HDFS-9754. I think the invariant is important and should be 
preserved.
Reverting is not trivial, since more refactoring accumulated on top of it, but 
possible.

The patch I posted here is still needed since we should remove the extra block 
in truncate case. It is not as critical though, since this is not the cause of 
NPE and the truncate block is eventually deleted, when DN finishes truncate or 
on the next block report. So to resolve this blocker for upcoming releases 
(2.8, 2.9, 3.0) we need to complete the revert.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-22 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263553#comment-16263553
 ] 

Tsz Wo Nicholas Sze commented on HDFS-12638:


Since this is about NameNode failed by NullPointerException, I agree that it is 
a blocker.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-22 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263522#comment-16263522
 ] 

Junping Du commented on HDFS-12638:
---

[~jingzhao] and [~szetszwo], any comments on Konstantin's proposal above?

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-16 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256293#comment-16256293
 ] 

Konstantin Shvachko commented on HDFS-12638:


I think it's a blocker for all branches 2.8 and up. Even just removing that 
line {{toDelete.delete();}} would prevent crashing NameNode.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Critical
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-08 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16245161#comment-16245161
 ] 

Weiwei Yang commented on HDFS-12638:


Hi [~shv]

Thanks, I see your point. I have increased the priority of this bug to 
critical, as it has a big risk to crash NN in a production cluster, with DN 
rolling update or creating some snapshots, this issue can be easily triggered. 
[~jingzhao], please share your thoughts on this. Thanks.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Critical
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-08 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16245134#comment-16245134
 ] 

Konstantin Shvachko commented on HDFS-12638:


[~wweic] {{addDeleteBlock()}} is supposed to be called, when the block is 
really intended to be deleted, that is it is not contained in any snapshots. I 
checked the code, there is a lot of logic around detecting which blocks belong 
or not to a snapshot, see e.g. {{INodeFile.collectBlocksBeyondSnapshot()}}. 
This makes it safe to delete the {{truncateBlock}}. Unless you have a test case 
as a counterexample. 

Did some digging and now understand why we don't see this in 2.7.4. The 
following line was introduced into {{addDeleteBlock()}} by HDFS-9754:
{code}
   assert toDelete != null : "toDelete is null";
+  toDelete.delete();
   toDeleteList.add(toDelete);
{code}
which sets {{Block.bcId = INVALID_INODE_ID}}. I think this was the wrong place 
to invalidate bcId, as [I mensioned 
earlier|https://issues.apache.org/jira/browse/HDFS-12638?focusedCommentId=16214120=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16214120].
[~jingzhao] could you please take a look.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-30 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226111#comment-16226111
 ] 

Weiwei Yang commented on HDFS-12638:


Hi [~shv]

Thanks for the new patch. One question though, in snapshot case,  the 
copy-on-truncate blocks are referred by the snapshot taken before the delete 
happens. If you remove these blocks during deleting the file, will read the 
snapshot fail?

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-24 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217402#comment-16217402
 ] 

Daryn Sharp commented on HDFS-12638:


Might want to investigate the lifecycle handling of 
{{BlockUnderConstructionFeature#truncateBlock}}.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-23 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215008#comment-16215008
 ] 

Weiwei Yang commented on HDFS-12638:


Hi [~shv]

I don't think what you suggested could fix this. Please see attachment 
[^OphanBlocksAfterTruncateDelete.jpg]. We are hinting the NPE because inode is 
removed but the block {{blk_1073741825}} was left over (created during 
copy-on-truncate), then if we look for INode by the block bcID {{16387}}, we 
get a {{null}}. What you suggested seems still be operating on deleted block 
{{blk_1073741826}}, I tried and it did not resolve this problem. Please take a 
look and let me know if I have any place wrong. Thanks.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-21 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214120#comment-16214120
 ] 

Konstantin Shvachko commented on HDFS-12638:


It looks to me that the main problem here is that {{unprotectedDelete()}} while 
collecting blocks in {{INodeFile.destroyAndCollectBlocks()}} invalidates block 
collection {{bcid}}, but leaves the block in the {{BlocksMap}}. Then 
{{FSNamesystem.delete()}} releases the lock and reacquires it again for actual 
block removal in {{FSNamesystem.removeBlocks()}}. So if {{ReplicationMonitor}} 
or {{NamenodeFsck}} kick in after the lock is released, but before the blocks 
are deleted from {{BlocksMap}} they can hit NPE accessing invalid (id = -1) 
INode.
Incremental block deletion was introduced in HDFS-6618, so all major versions 
should be affected.

For fixing this we should not invalidate {{bcid}} in 
{{NodeFile.destroyAndCollectBlocks()}}, but rather in 
{{BlockManager.removeBlockFromMap()}}, when the block is actually removed from 
{{BlocksMap}}.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-19 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211413#comment-16211413
 ] 

Daryn Sharp commented on HDFS-12638:


bq. Yes, I think our code should bear with such orphan blocks, instead of 
failing the NN with NPE like this. At least.
See below, they aren't really orphaned.  I think it's correct for the NN to 
crash if the namesystem data structures are corrupted.

bq. I assume when the snapshot gets deleted, these blocks will be also removed 
from the blocks map. But before that, we need to live with such orphaned blocks
To the block manager, replication monitor, etc these copy-on-truncate blocks 
are not (supposed to be) special.  My prior point stated another way is the 
block is not orphaned if it's in a snapshot diff.  INodes are not orphaned when 
only referenced via a snapshot diff.  A block in the blocks map should not be 
referencing an inode not in the inodes map.  Direct namespace accessibility is 
irrelevant to the block/inode/map linkages being correct.

We need to fix the bug, not mask it.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-18 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210498#comment-16210498
 ] 

Weiwei Yang commented on HDFS-12638:


Hi [~daryn]

bq. it sounds like you want to avoid the symptom (NPE) 

Yes, I think our code should bear with such orphan blocks, instead of failing 
the NN with NPE like this. At least.

bq. rather than address the orphaned block (root cause)?

The test case explains how these orphaned blocks come from

||Step#||Step Operation||Explain||
|1|Create a file| this creates a few blocks, e.g B1|
|2|Run rolling upgrade (or create a snapshot)|this is the condition to trigger 
the copy-on-truncate behavior|
|3|Truncate the file to a smaller size|"copy-on-truncate" schema is used, it 
creates a new block B2 and copy required bytes from B1 to B2, and inode 
reference updated to B2|
|4|Delete this file|this will delete inode ref and remove B2 from blocks map, 
leaving B1 behind. So when we read snapshot again, it is able to find its 
original block B1.|

Please see also {{FSDirTtruncateOp#shouldCopyOnTruncate}}. I have read the 
truncate design doc, this seems to be the designed behavior. We cannot delete 
the old block otherwise snapshot won't be able to read it anymore. I assume 
when the snapshot gets deleted, these blocks will be also removed from the 
blocks map. But before that, we need to live with such orphaned blocks. Any 
thoughts on this?

Thanks


> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-18 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209604#comment-16209604
 ] 

Daryn Sharp commented on HDFS-12638:


[~cheersyang], maybe I'm misinterpreting the proposal but it sounds like you 
want to avoid the symptom (NPE) rather than address the orphaned block (root 
cause)?

I haven't really studied the truncate code.  For "copy-on-truncate" it sounds 
reasonable for both blocks to be in the map.  But both should be referenced 
somewhere.  Presumably this is just for snapshots, which means one of the 
blocks is in a diff.  If delete misses one of the blocks, that's the root cause 
we must fix.  Be warned the block reclamation code is not a fun place.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-17 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208104#comment-16208104
 ] 

Konstantin Shvachko commented on HDFS-12638:


Hey [~cheersyang], blocks deletion from {{BlocksMap}} is not immediate.
If I step through the test case in debugger giving enough time for deletion to 
complete, I do not hit the assert.
So I understand the problem that there are blocks in the {{BlocksMap}} that do 
not belong to any file, which we should not allow, but I don't see the unit 
test capturing this bug.
Which branch are you testing this with? I am looking on trunk.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-16 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206921#comment-16206921
 ] 

Weiwei Yang commented on HDFS-12638:


Hi [~shv], the actual deletion of those blocks are running in async mode at 
background, but the blocks map {{BlocksMap}} is immediately updated. I think 
[~yangjiandan]'s UT explained how this orphan block was created, which 
eventually caused the failure in the JIRA's description. 

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206786#comment-16206786
 ] 

Hadoop QA commented on HDFS-12638:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 11m 
47s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-2.8.2 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
10s{color} | {color:green} branch-2.8.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green} branch-2.8.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} branch-2.8.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} branch-2.8.2 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} branch-2.8.2 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} | {color:green} branch-2.8.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}665m 50s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 15m 
10s{color} | {color:red} The patch generated 154 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}714m 10s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeInitStorage |
|   | hadoop.hdfs.server.blockmanagement.TestNodeCount |
|   | hadoop.hdfs.security.TestDelegationTokenForProxyUser |
|   | hadoop.hdfs.server.datanode.TestDataNodeUUID |
|   | 
hadoop.hdfs.server.datanode.fsdataset.TestAvailableSpaceVolumeChoosingPolicy |
| Timed out junit tests | org.apache.hadoop.hdfs.server.datanode.TestHSync |
|   | 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistPolicy |
|   | org.apache.hadoop.hdfs.TestModTime |
|   | org.apache.hadoop.hdfs.TestEncryptionZones |
|   | org.apache.hadoop.hdfs.TestLeaseRecovery2 |
|   | org.apache.hadoop.hdfs.TestSmallBlock |
|   | org.apache.hadoop.hdfs.server.namenode.TestAllowFormat |
|   | org.apache.hadoop.hdfs.tools.TestDFSAdminWithHA |
|   | org.apache.hadoop.hdfs.TestDFSStartupVersions |
|   | org.apache.hadoop.hdfs.TestHdfsAdmin |
|   | org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting 
|
|   | org.apache.hadoop.hdfs.server.datanode.TestDeleteBlockPool |
|   | org.apache.hadoop.hdfs.server.datanode.TestNNHandlesBlockReportPerStorage 
|
|   | org.apache.hadoop.hdfs.server.blockmanagement.TestHeartbeatHandling |
|   | org.apache.hadoop.hdfs.TestFileCreationEmpty |
|   | org.apache.hadoop.hdfs.TestFileCreationClient |
|   | org.apache.hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewer |
|   | org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer |
|   | org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManager |
|   | org.apache.hadoop.hdfs.TestDatanodeRegistration |
|   | 

[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-16 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205861#comment-16205861
 ] 

Weiwei Yang commented on HDFS-12638:


Hi [~yangjiandan]

Thanks for narrowing down the root cause and providing a test case. I believe 
as long as the truncate runs as *copy-on-truncate* schema. e.g under rolling 
upgrade, upgrade not finalized or in snapshot, it will have this problem. This 
code path creates a new block for truncation and at same time the old block is 
left over in blocks map. When the file gets deleted, the old block becomes to 
be an orphan block.

Further, I read quite a few JIRAs similar to this problem. Such as HDFS-7611, 
HDFS-8113, HDFS-4867. It looks like what we deal with such blocks (if  it is 
reasonably be an orphan block) is to simply add a check to avoid NPE. For 
example in {{BlockManager#dumpBlockMeta}}

{code}
if (block instanceof BlockInfo) {
  BlockCollection bc = getBlockCollection((BlockInfo)block);
  String fileName = (bc == null) ? "[orphaned]" : bc.getName();
  out.print(fileName + ": ");
}
{code}

most places already handled the case like this. So I would suggest to use a 
similar fix to resolve this issue. A few suggestions

# Add a check in {{BlockManager#scheduleReplication}} to avoid NPE
# Review the call in {{BlockManager#chooseExcessReplicates}}, most likely it 
needs a check too
# Add a check in {{NamenodeFsck}} to fix the NPE when run {{fsck -blockId}} 
agaist an orphan block
# Add a javadoc to remind {{BlockManager#getBlockCollection}} might return a 
null

Please let me know if this makes sense, [~yangjiandan], [~kihwal], [~daryn].





> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
> Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-16 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205739#comment-16205739
 ] 

Jiandan Yang  commented on HDFS-12638:
--

Doing truncate & delete files when cluster is RollingUpgrade leads to orphans 
blocks. I add a ut in patch to reproduce the problem.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-13 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203585#comment-16203585
 ] 

Daryn Sharp commented on HDFS-12638:


[~shv], you worked on truncate, any further insights?

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-13 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203577#comment-16203577
 ] 

Daryn Sharp commented on HDFS-12638:


There's 2 likely scenarios:
* The block is added to the blocks map with an unlinked inode not in the inode 
map.  The only way to add a block to the map is via 
{{BlockManager#addBlockCollection}}.  Truncate, like add block, does this but 
it should not have been able to resolve the inode if it's unlinked and I don't 
immediately see locking issues.
* The lease manager encountered an error and removed the inode but left the 
blocks intact.  Look for log warn of "Removing lease with an invalid path" 
because the lease manager catches IOEs and just removes the inode anyway which 
can leave the blocks map in an inconsistent state.  Very bad.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-13 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203501#comment-16203501
 ] 

Jiandan Yang  commented on HDFS-12638:
--

I find  another block with the same problem,  and the auditlogs about the file 
to which the block blongs are as following.
{code:java}
2017-10-13 03:26:59,198 INFO FSNamesystem.audit: allowed=true   ugi=admin 
(auth:SIMPLE) ip=/xx.xxx.xx.xxx   cmd=create  
src=/user/admin/ba13ed3a-898d-4b72-a873-1999da5f0a70dst=null
perm=admin:hadoop:rw-r--r-- proto=rpc
2017-10-13 03:26:59,290 INFO FSNamesystem.audit: allowed=true   ugi=admin 
(auth:SIMPLE) ip=/xx.xxx.xx.xxx   cmd=truncate
src=/user/admin/ba13ed3a-898d-4b72-a873-1999da5f0a70dst=null
perm=admin:hadoop:rw-r--r-- proto=rpc
2017-10-13 03:26:59,293 INFO FSNamesystem.audit: allowed=true   ugi=admin 
(auth:SIMPLE) ip=/xx.xxx.xx.xxx   cmd=delete  
src=/user/admin/ba13ed3a-898d-4b72-a873-1999da5f0a70dst=null
perm=null   proto=rpc
{code}

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-12 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203022#comment-16203022
 ] 

Jiandan Yang  commented on HDFS-12638:
--

[~daryn] NN audit log lost some, and we did not find truncate operation about 
this file, but I think this file was truncated by viewing DN logs. In DN log 
the block was first finallized then do recover.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-12 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202357#comment-16202357
 ] 

Daryn Sharp commented on HDFS-12638:


The issues actually appear unrelated.  In our case, there was only 1 replica, 
that node died, the block was deleted.  The block got stuck in the replication 
monitor's missing block queue.  Kihwal is filing a jira with additional details.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-12 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202092#comment-16202092
 ] 

Daryn Sharp commented on HDFS-12638:


It does differ from our case but likely has the same root cause.  The block was 
in your blocks map, but not in ours.  The block existed on your DN, but not 
ours.  The commonality is the block referenced a non-existent inode/collection.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-12 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202079#comment-16202079
 ] 

Daryn Sharp commented on HDFS-12638:


This is bad.  The fsck NPE means the block _is_ in the blocks map but the 
inode/collection it references _is not_ in the blocks map.  Last I knew, a size 
of Long.MAX_VALUE means the block is scheduled for a "fire and forget" 
invalidation because the file was deleted.

bq. and there are logs of truncate cmd in auditlog

To double check, did you mean "are" or "are not"?



> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-12 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201540#comment-16201540
 ] 

Jiandan Yang  commented on HDFS-12638:
--

datanode revover failed because new blocksize is Long.MAX
{code:java}
2017-10-09 19:19:17,054 INFO 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@437346ab] 
org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at 
et2btsm1.et2.tbsite.net/11.251.159.136:8020 calls 
recoverBlock(BP-1721125339-xx.xxx.xx.xxx-1505883414013:blk_1084203820_11907141, 
targets=[DatanodeInfoWithStorage[xx.xxx.xx.aaa:50010,null,null], 
DatanodeInfoWithStorage[xx.xxx.xx.bbb:50010,null,null], 
DatanodeInfoWithStorage[xx.xxx.xx.ccc:50010,null,null]], 
newGenerationStamp=11907145, newBlock=blk_1084203824_11907145)
2017-10-09 19:19:17,055 INFO 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@437346ab] 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
initReplicaRecovery: blk_1084203820_11907141, recoveryId=11907145, 
replica=FinalizedReplica, blk_1084203820_11907141, FINALIZED
  getNumBytes() = 7
  getBytesOnDisk()  = 7
  getVisibleLength()= 7
  getVolume()   = /dump/10/dfs/data/current
  getBlockFile()= 
/dump/10/dfs/data/current/BP-1721125339-xx.xxx.xx.xxx-1505883414013/current/finalized/subdir31/subdir3/blk_1084203820
2017-10-09 19:19:17,055 INFO 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@437346ab] 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
initReplicaRecovery: changing replica state for blk_1084203820_11907141 from 
FINALIZED to RUR
2017-10-09 19:19:17,058 WARN 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@437346ab] 
org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to 
updateBlock 
(newblock=BP-1721125339-xx.xxx.xx.xxx-1505883414013:blk_1084203824_11907145, 
datanode=DatanodeInfoWithStorage[xx.xxx.xx.aaa:50010,null,null])
org.apache.hadoop.ipc.RemoteException(java.io.IOException): rur.getNumBytes() < 
newlength = 9223372036854775807, rur=ReplicaUnderRecovery, 
blk_1084203820_11907141, RUR
  getNumBytes() = 7
  getBytesOnDisk()  = 7
  getVisibleLength()= 7
  getVolume()   = /dump/9/dfs/data/current
  getBlockFile()= 
/dump/9/dfs/data/current/BP-1721125339-11.251.159.136-1505883414013/current/finalized/subdir31/subdir3/blk_1084203820
  recoveryId=11907145
  original=FinalizedReplica, blk_1084203820_11907141, FINALIZED
  getNumBytes() = 7
  getBytesOnDisk()  = 7
  getVisibleLength()= 7
  getVolume()   = /dump/9/dfs/data/current
  getBlockFile()= 
/dump/9/dfs/data/current/BP-1721125339-11.251.159.136-1505883414013/current/finalized/subdir31/subdir3/blk_1084203820
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.updateReplicaUnderRecovery(FsDatasetImpl.java:2736)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.updateReplicaUnderRecovery(FsDatasetImpl.java:2678)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.updateReplicaUnderRecovery(DataNode.java:2776)
at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:78)
at 
org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3107)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)


at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1483)
at org.apache.hadoop.ipc.Client.call(Client.java:1429)
at org.apache.hadoop.ipc.Client.call(Client.java:1339)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy22.updateReplicaUnderRecovery(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112)
at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:77)
at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$600(BlockRecoveryWorker.java:60)
at 

[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-11 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201413#comment-16201413
 ] 

Jiandan Yang  commented on HDFS-12638:
--

We found missing blockId by metasave, and did fsck -blockId, NN also throw NPE, 
and the inode to which the block blongs was truncated after created.

create log:
{code:java}
hadoop-hadoop-namenode-**.log.9:2017-10-09 19:19:16,370 INFO [IPC Server 
handler 902 on 8020] org.apache.hadoop.hdfs.StateChange: BLOCK* allocate 
blk_1084203820_11907141, replicas=11.251.153.26:50010, 11.251.153.29:50010, 
11.227.70.75:50010 for /user/admin/xxx
{code}
because auditlog was overrided,  we can not found operation about this file
fsck -blockId log:
{code:java}
2017-10-12 11:22:03,929 WARN [502920422@qtp-1473771722-3789] 
org.apache.hadoop.hdfs.server.namenode.NameNode: Error in looking up block
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.blockIdCK(NamenodeFsck.java:259)
at 
org.apache.hadoop.hdfs.server.namenode.NamenodeFsck.fsck(NamenodeFsck.java:323)
at 
org.apache.hadoop.hdfs.server.namenode.FsckServlet$1.run(FsckServlet.java:69)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
at 
org.apache.hadoop.hdfs.server.namenode.FsckServlet.doGet(FsckServlet.java:58)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1351)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{code}


> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> 

[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-11 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201343#comment-16201343
 ] 

Jiandan Yang  commented on HDFS-12638:
--

[~daryn] There is no snapshot directory in cluster, and we could not find block 
info in the log, and there are logs of truncate cmd in auditlog. In active NN  
WebUI there are 2000+ missing blocks, but fsck result do not include missing 
replicas.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-11 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200469#comment-16200469
 ] 

Daryn Sharp commented on HDFS-12638:


[~yangjiandan] do you have any further info you can share?  Did this involve 
snapshots?  Do you know if the problematic block belonged to a deleted file, or 
a truncated file, abandoned block, etc?

This case case appears to mirror ours (except we didn't crash).  The block has 
a "valid" (not the sentinel no collection id) block collection id for an inode 
not in the inode map.  The block also isn't in the blocksmap.  These are big 
problems.

Either the replication monitor is working with copies of stored blocks, or 
values are being cleared in the wrong order.  Delete tries to unlink the blocks 
from the collection before removing from the blocks map, then clears the 
collection's block list before removing from the inode map.  Thus there should 
never be a block referencing a missing collection...

Moving the block's bc from a reference to an id is highly likely to have 
exposed latent bugs if copies of stored blocks are not re-resolved via the 
blocks map.  Having a reference hid logic issues.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-10-11 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200361#comment-16200361
 ] 

Kihwal Lee commented on HDFS-12638:
---

We have also seen blocks with "null" bc staying in the replication queue. They 
were missing blocks so the replication monitor didn't even try to schedule them 
and didn't crash. But metaSave was listing them as orphaned (bc == null, 
deleted).  Other than failing over (force queue reinitialization),there was no 
way to clear them.

In your particular case, we can add a null check in {{scheduleReplication()}} 
in addition to the existing deletion check.  The missing block case is a bit 
trickier, since the replication monitor will not touch them and nothing will 
move it to a different priority level since the block is already deleted and 
invalidated on datanodes.  We should prevent it from getting added to the queue.

In any case, it is apparent that the new {{isDeleted()}} check cannot replace 
the bc null check 100%.  [~jingzhao] any thoughts?


> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org