[jira] [Commented] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Akira Ajisaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378898#comment-15378898
 ] 

Akira Ajisaka commented on HDFS-10628:
--

+1, thanks Jiayi.

> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.8.0
>Reporter: Jiayi Zhou
>Assignee: Jiayi Zhou
>Priority: Minor
> Attachments: HDFS-10628.001.patch, HDFS-10628.002.patch
>
>
> Currently, the exit message is logged to stderr. It would be more convenient 
> if we also log this to Balancer log when people want to figure out why 
> Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378879#comment-15378879
 ] 

Vinayakumar B commented on HDFS-10587:
--

bq. Here it says checksum error at 81920, which is at the very beginning 
itself. May be 229 disk have some problem, or during transfer to 77 some 
corruption due to network card would have happened. Is not exactly same as 
current case.
I was wrong. [~xupeng] case is also exactly same as this Jira.
Here is how, 
# 77 is throwing exception while verifying the received packet during transfer 
from 229(which got the block transfered earlier from 228)
# While verifying only packet, the position mentioned in the checksum 
exception, is relative to packet buffer offset, not the block offset. So 81920 
is the offset in the exception.
# Blocks already written to disk in 77 during transfer before checksum 
exception : 9830400
# Total : 9830400 + 81920 == 9912320, which is same as bytes received by 229 
from 228 when it was added to pipeline.


> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> 
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41381376
>   getBytesOnDisk()  = 41381376
>   getVisibleLength()= 41186444
>   getVolume()   = /hadoop-i/data/current
>   getBlockFile()= 
> /hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186444
>   bytesOnDisk=41381376
> while the receiver has the replica as follows:
> 2016-04-15 22:03:05,068 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41186816
>   getBytesOnDisk()  = 41186816
>   getVisibleLength()= 41186816
>   getVolume()   = /hadoop-g/data/current
>   getBlockFile()= 
> /hadoop-g/data/current/BP-1043567091-10.1.1.1-1

[jira] [Commented] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378808#comment-15378808
 ] 

Hadoop QA commented on HDFS-10626:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
 8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 58m 
52s{color} | {color:green} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
18s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 79m 46s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12818086/HDFS-10626.002.patch |
| JIRA Issue | HDFS-10626 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 78a59e0c1b08 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / e549a9a |
| Default Java | 1.8.0_91 |
| findbugs | v3.0.0 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16066/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16066/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-10626.001.patch, HD

[jira] [Commented] (HDFS-10632) DataXceiver to report the length of the block it's receiving

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378805#comment-15378805
 ] 

Hadoop QA commented on HDFS-10632:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 29s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
19s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 94m 16s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12818080/HDFS-10632.001.patch |
| JIRA Issue | HDFS-10632 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux fc321d41bab2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / e549a9a |
| Default Java | 1.8.0_91 |
| findbugs | v3.0.0 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16065/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16065/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16065/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> DataXceiver to report the length of the block it's receiving
> 
>
> Key: HDFS-10632
> URL: https://issues.apache.org/jira/browse/HDFS-10632
> Project: Hadoop HDFS
>  Iss

[jira] [Commented] (HDFS-10600) PlanCommand#getThrsholdPercentage should not use throughput value.

2016-07-14 Thread Yiqun Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378758#comment-15378758
 ] 

Yiqun Lin commented on HDFS-10600:
--

Thanks a lot for the commit, [~eddyxu]!

> PlanCommand#getThrsholdPercentage should not use throughput value.
> --
>
> Key: HDFS-10600
> URL: https://issues.apache.org/jira/browse/HDFS-10600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: diskbalancer
>Affects Versions: 2.9.0, 3.0.0-beta1
>Reporter: Lei (Eddy) Xu
>Assignee: Yiqun Lin
> Fix For: 3.0.0-alpha1
>
> Attachments: HDFS-10600.001.patch, HDFS-10600.002.patch
>
>
> In {{PlanCommand#getThresholdPercentage}}
> {code}
>  private double getThresholdPercentage(CommandLine cmd) {
> 
> if ((value <= 0.0) || (value > 100.0)) {
>   value = getConf().getDouble(
>   DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT,
>   DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT_DEFAULT);
> }
> return value;
>   }
> {code}
> {{DISK_THROUGHPUT}} has the unit of "MB", so it does not make sense to return 
> {{throughput}} as a percentage value.
> Btw, we should use {{THROUGHPUT}} instead of {{THRUPUT}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Yiqun Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-10626:
-
Attachment: HDFS-10626.002.patch

> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-10626.001.patch, HDFS-10626.002.patch
>
>
> VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. 
> The related codes:
> {code}
> public void handle(ExtendedBlock block, IOException e) {
>   FsVolumeSpi volume = scanner.volume;
>   ...
>   try {
> scanner.datanode.reportBadBlocks(block, volume);
>   } catch (IOException ie) {
> // This is bad, but not bad enough to shut down the scanner.
> LOG.warn("Cannot report bad " + block.getBlockId(), e);
>   }
> }
> {code}
> The IOException that printed in the log should be {{ie}} rather than {{e}} 
> which was passed in method {{handle(ExtendedBlock block, IOException e)}}.
> It will be a important info that can help us to know why datanode 
> reporBadBlocks failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Yiqun Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378755#comment-15378755
 ] 

Yiqun Lin commented on HDFS-10626:
--

Thanks [~shahrs87] and [~yzhangal] for the review. Post the new patch for 
addressing the comment.
The patch change the code form 
{code}
LOG.warn("Cannot report bad " + block.getBlockId(), e);
{code}
to
{code}
LOG.warn("Cannot report bad " + block, ie);
{code}
It will print out the detail info of the bad block.

> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-10626.001.patch, HDFS-10626.002.patch
>
>
> VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. 
> The related codes:
> {code}
> public void handle(ExtendedBlock block, IOException e) {
>   FsVolumeSpi volume = scanner.volume;
>   ...
>   try {
> scanner.datanode.reportBadBlocks(block, volume);
>   } catch (IOException ie) {
> // This is bad, but not bad enough to shut down the scanner.
> LOG.warn("Cannot report bad " + block.getBlockId(), e);
>   }
> }
> {code}
> The IOException that printed in the log should be {{ie}} rather than {{e}} 
> which was passed in method {{handle(ExtendedBlock block, IOException e)}}.
> It will be a important info that can help us to know why datanode 
> reporBadBlocks failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10632) DataXceiver to report the length of the block it's receiving

2016-07-14 Thread Yiqun Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-10632:
-
Status: Patch Available  (was: Open)

Attach a simple patch to fix this.

> DataXceiver to report the length of the block it's receiving
> 
>
> Key: HDFS-10632
> URL: https://issues.apache.org/jira/browse/HDFS-10632
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Yiqun Lin
>Priority: Trivial
>  Labels: supportability
> Attachments: HDFS-10632.001.patch
>
>
> In DataXceiver:#writeBlock:
> {code}
> LOG.info("Receiving " + block + " src: " + remoteAddress + " dest: "
> + localAddress);
> {code}
> This message is better to report the size of the block its receiving.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10632) DataXceiver to report the length of the block it's receiving

2016-07-14 Thread Yiqun Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-10632:
-
Attachment: HDFS-10632.001.patch

> DataXceiver to report the length of the block it's receiving
> 
>
> Key: HDFS-10632
> URL: https://issues.apache.org/jira/browse/HDFS-10632
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Yiqun Lin
>Priority: Trivial
>  Labels: supportability
> Attachments: HDFS-10632.001.patch
>
>
> In DataXceiver:#writeBlock:
> {code}
> LOG.info("Receiving " + block + " src: " + remoteAddress + " dest: "
> + localAddress);
> {code}
> This message is better to report the size of the block its receiving.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10632) DataXceiver to report the length of the block it's receiving

2016-07-14 Thread Yiqun Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin reassigned HDFS-10632:


Assignee: Yiqun Lin

> DataXceiver to report the length of the block it's receiving
> 
>
> Key: HDFS-10632
> URL: https://issues.apache.org/jira/browse/HDFS-10632
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Yiqun Lin
>Priority: Trivial
>  Labels: supportability
>
> In DataXceiver:#writeBlock:
> {code}
> LOG.info("Receiving " + block + " src: " + remoteAddress + " dest: "
> + localAddress);
> {code}
> This message is better to report the size of the block its receiving.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10614) Appended blocks can be closed even before IBRs from DataNodes

2016-07-14 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378688#comment-15378688
 ] 

Jing Zhao commented on HDFS-10614:
--

Thanks for the fix and review, [~vinayrpet] and [~ajisakaa]. The patch also 
looks good to me. One question is, before we remove the block from the 
storageInfo, whether we can also add an extra check to make sure the reported 
block's GS is greater than the stored block. In this way the logic will be the 
same with {{setGenerationStampAndVerifyReplicas}} in {{updatePipeline}}.

> Appended blocks can be closed even before IBRs from DataNodes
> -
>
> Key: HDFS-10614
> URL: https://issues.apache.org/jira/browse/HDFS-10614
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-10614.01.patch, HDFS-10614.02.patch
>
>
> Scenario:
>1. Open the file for append()
>2. Trigger append pipeline setup by adding some data.
>3. Consider RECEIVING IBRs of DNs reaches NN first.
>4. updatePipeline() rpc sent to namenode to update the pipeline.
>5. Now, if complete() is called on the file even before closing the 
> pipeline, then block will be COMPLETE, even before block is actually 
> FINALIZED at DN side and file will be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378642#comment-15378642
 ] 

Hadoop QA commented on HDFS-10628:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
23s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 77m 12s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
17s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 97m 51s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestDataTransferKeepalive |
|   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotFileLength |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12818039/HDFS-10628.002.patch |
| JIRA Issue | HDFS-10628 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 2358a4540623 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / e549a9a |
| Default Java | 1.8.0_91 |
| findbugs | v3.0.0 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16064/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16064/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16064/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
>   

[jira] [Commented] (HDFS-8940) Support for large-scale multi-tenant inotify service

2016-07-14 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378555#comment-15378555
 ] 

Ming Ma commented on HDFS-8940:
---

[~drankye], it might come from the following assumption. Only admins can use 
the existing inotify functionality. With this feature, any user should be able 
to use it.

> Support for large-scale multi-tenant inotify service
> 
>
> Key: HDFS-8940
> URL: https://issues.apache.org/jira/browse/HDFS-8940
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
> Attachments: Large-Scale-Multi-Tenant-Inotify-Service.pdf
>
>
> HDFS-6634 provides the core inotify functionality. We would like to extend 
> that to provide a large-scale service that ten of thousands of clients can 
> subscribe to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378546#comment-15378546
 ] 

Wei-Chiu Chuang commented on HDFS-10587:


I see. So for the logs I saw, there's no "Appending to " messages. I think the 
replica was created without being in FINALIZED state.

> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> 
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41381376
>   getBytesOnDisk()  = 41381376
>   getVisibleLength()= 41186444
>   getVolume()   = /hadoop-i/data/current
>   getBlockFile()= 
> /hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186444
>   bytesOnDisk=41381376
> while the receiver has the replica as follows:
> 2016-04-15 22:03:05,068 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41186816
>   getBytesOnDisk()  = 41186816
>   getVisibleLength()= 41186816
>   getVolume()   = /hadoop-g/data/current
>   getBlockFile()= 
> /hadoop-g/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186816
>   bytesOnDisk=41186816



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10632) DataXceiver to report the length of the block it's receiving

2016-07-14 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-10632:
-
Labels: supportability  (was: )

> DataXceiver to report the length of the block it's receiving
> 
>
> Key: HDFS-10632
> URL: https://issues.apache.org/jira/browse/HDFS-10632
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Priority: Trivial
>  Labels: supportability
>
> In DataXceiver:#writeBlock:
> {code}
> LOG.info("Receiving " + block + " src: " + remoteAddress + " dest: "
> + localAddress);
> {code}
> This message is better to report the size of the block its receiving.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10632) DataXceiver to report the length of the block it's receiving

2016-07-14 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-10632:
-
   Priority: Trivial  (was: Major)
Component/s: hdfs
 datanode

> DataXceiver to report the length of the block it's receiving
> 
>
> Key: HDFS-10632
> URL: https://issues.apache.org/jira/browse/HDFS-10632
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Priority: Trivial
>
> In DataXceiver:#writeBlock:
> {code}
> LOG.info("Receiving " + block + " src: " + remoteAddress + " dest: "
> + localAddress);
> {code}
> This message is better to report the size of the block its receiving.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8940) Support for large-scale multi-tenant inotify service

2016-07-14 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378541#comment-15378541
 ] 

Kai Zheng commented on HDFS-8940:
-

Thanks [~mingma] for the confirm. I have another question, after reading your 
doc. In what sense does this relate to {{multi-tenant}}? If sounds good, I 
suggest we change the tittle some bit to clarify. In my understanding currently 
HDFS doesn't support multi-tenant by itself, so inherently this design might be 
not neither for simple. The {{large-scale}} sounds good to have already.

> Support for large-scale multi-tenant inotify service
> 
>
> Key: HDFS-8940
> URL: https://issues.apache.org/jira/browse/HDFS-8940
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
> Attachments: Large-Scale-Multi-Tenant-Inotify-Service.pdf
>
>
> HDFS-6634 provides the core inotify functionality. We would like to extend 
> that to provide a large-scale service that ten of thousands of clients can 
> subscribe to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-10632) DataXceiver to report the length of the block it's receiving

2016-07-14 Thread Yongjun Zhang (JIRA)
Yongjun Zhang created HDFS-10632:


 Summary: DataXceiver to report the length of the block it's 
receiving
 Key: HDFS-10632
 URL: https://issues.apache.org/jira/browse/HDFS-10632
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Yongjun Zhang


In DataXceiver:#writeBlock:

{code}
LOG.info("Receiving " + block + " src: " + remoteAddress + " dest: "
+ localAddress);
{code}
This message is better to report the size of the block its receiving.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Jiayi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Zhou updated HDFS-10628:
--
Attachment: HDFS-10628.002.patch

Checkstyle fixed. Failure tests are not related as it's just a simple patch.

> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.8.0
>Reporter: Jiayi Zhou
>Assignee: Jiayi Zhou
>Priority: Minor
> Attachments: HDFS-10628.001.patch, HDFS-10628.002.patch
>
>
> Currently, the exit message is logged to stderr. It would be more convenient 
> if we also log this to Balancer log when people want to figure out why 
> Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10326) Disable setting tcp socket send/receive buffers for write pipelines

2016-07-14 Thread Mingliang Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378500#comment-15378500
 ] 

Mingliang Liu edited comment on HDFS-10326 at 7/14/16 10:18 PM:


The v1 patch rebases from {{trunk}} and resolves conflicts with [HADOOP-13351] 
in unit test.


was (Author: liuml07):
The v1 patch rebases from {{trunk}} and resolve conflicts with [HADOOP-13351] 
in unit test.

> Disable setting tcp socket send/receive buffers for write pipelines
> ---
>
> Key: HDFS-10326
> URL: https://issues.apache.org/jira/browse/HDFS-10326
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: HDFS-10326.000.patch, HDFS-10326.001.patch
>
>
> The DataStreamer and the Datanode use a hardcoded 
> DEFAULT_DATA_SOCKET_SIZE=128K for the send and receive buffers of a write 
> pipeline.  Explicitly setting tcp buffer sizes disables tcp stack 
> auto-tuning.  
> The hardcoded value will saturate a 1Gb with 1ms RTT.  105Mbs at 10ms.  
> Paltry 11Mbs over a 100ms long haul.  10Gb networks are underutilized.
> There should either be a configuration to completely disable setting the 
> buffers, or the the setReceiveBuffer and setSendBuffer should be removed 
> entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10326) Disable setting tcp socket send/receive buffers for write pipelines

2016-07-14 Thread Mingliang Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378500#comment-15378500
 ] 

Mingliang Liu commented on HDFS-10326:
--

The v1 patch rebases from {{trunk}} and resolve conflicts with [HADOOP-13351] 
in unit test.

> Disable setting tcp socket send/receive buffers for write pipelines
> ---
>
> Key: HDFS-10326
> URL: https://issues.apache.org/jira/browse/HDFS-10326
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: HDFS-10326.000.patch, HDFS-10326.001.patch
>
>
> The DataStreamer and the Datanode use a hardcoded 
> DEFAULT_DATA_SOCKET_SIZE=128K for the send and receive buffers of a write 
> pipeline.  Explicitly setting tcp buffer sizes disables tcp stack 
> auto-tuning.  
> The hardcoded value will saturate a 1Gb with 1ms RTT.  105Mbs at 10ms.  
> Paltry 11Mbs over a 100ms long haul.  10Gb networks are underutilized.
> There should either be a configuration to completely disable setting the 
> buffers, or the the setReceiveBuffer and setSendBuffer should be removed 
> entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10326) Disable setting tcp socket send/receive buffers for write pipelines

2016-07-14 Thread Mingliang Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Liu updated HDFS-10326:
-
Attachment: HDFS-10326.001.patch

> Disable setting tcp socket send/receive buffers for write pipelines
> ---
>
> Key: HDFS-10326
> URL: https://issues.apache.org/jira/browse/HDFS-10326
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Mingliang Liu
> Attachments: HDFS-10326.000.patch, HDFS-10326.001.patch
>
>
> The DataStreamer and the Datanode use a hardcoded 
> DEFAULT_DATA_SOCKET_SIZE=128K for the send and receive buffers of a write 
> pipeline.  Explicitly setting tcp buffer sizes disables tcp stack 
> auto-tuning.  
> The hardcoded value will saturate a 1Gb with 1ms RTT.  105Mbs at 10ms.  
> Paltry 11Mbs over a 100ms long haul.  10Gb networks are underutilized.
> There should either be a configuration to completely disable setting the 
> buffers, or the the setReceiveBuffer and setSendBuffer should be removed 
> entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10326) Disable setting tcp socket send/receive buffers for write pipelines

2016-07-14 Thread Mingliang Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Liu reassigned HDFS-10326:


Assignee: Mingliang Liu  (was: Daryn Sharp)

> Disable setting tcp socket send/receive buffers for write pipelines
> ---
>
> Key: HDFS-10326
> URL: https://issues.apache.org/jira/browse/HDFS-10326
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Mingliang Liu
> Attachments: HDFS-10326.000.patch, HDFS-10326.001.patch
>
>
> The DataStreamer and the Datanode use a hardcoded 
> DEFAULT_DATA_SOCKET_SIZE=128K for the send and receive buffers of a write 
> pipeline.  Explicitly setting tcp buffer sizes disables tcp stack 
> auto-tuning.  
> The hardcoded value will saturate a 1Gb with 1ms RTT.  105Mbs at 10ms.  
> Paltry 11Mbs over a 100ms long haul.  10Gb networks are underutilized.
> There should either be a configuration to completely disable setting the 
> buffers, or the the setReceiveBuffer and setSendBuffer should be removed 
> entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10326) Disable setting tcp socket send/receive buffers for write pipelines

2016-07-14 Thread Mingliang Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Liu updated HDFS-10326:
-
Assignee: Daryn Sharp  (was: Mingliang Liu)

> Disable setting tcp socket send/receive buffers for write pipelines
> ---
>
> Key: HDFS-10326
> URL: https://issues.apache.org/jira/browse/HDFS-10326
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: HDFS-10326.000.patch, HDFS-10326.001.patch
>
>
> The DataStreamer and the Datanode use a hardcoded 
> DEFAULT_DATA_SOCKET_SIZE=128K for the send and receive buffers of a write 
> pipeline.  Explicitly setting tcp buffer sizes disables tcp stack 
> auto-tuning.  
> The hardcoded value will saturate a 1Gb with 1ms RTT.  105Mbs at 10ms.  
> Paltry 11Mbs over a 100ms long haul.  10Gb networks are underutilized.
> There should either be a configuration to completely disable setting the 
> buffers, or the the setReceiveBuffer and setSendBuffer should be removed 
> entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10614) Appended blocks can be closed even before IBRs from DataNodes

2016-07-14 Thread Akira Ajisaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378492#comment-15378492
 ] 

Akira Ajisaka commented on HDFS-10614:
--

Looks good to me. +1 pending another committer's review.
cc: [~yzhangal], [~jingzhao]

> Appended blocks can be closed even before IBRs from DataNodes
> -
>
> Key: HDFS-10614
> URL: https://issues.apache.org/jira/browse/HDFS-10614
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-10614.01.patch, HDFS-10614.02.patch
>
>
> Scenario:
>1. Open the file for append()
>2. Trigger append pipeline setup by adding some data.
>3. Consider RECEIVING IBRs of DNs reaches NN first.
>4. updatePipeline() rpc sent to namenode to update the pipeline.
>5. Now, if complete() is called on the file even before closing the 
> pipeline, then block will be COMPLETE, even before block is actually 
> FINALIZED at DN side and file will be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378483#comment-15378483
 ] 

Yongjun Zhang commented on HDFS-10587:
--

I think this looks why we had new data reaching the new DN after the init block 
transfer:  after adding the new DN to the pipeline, doing the block transfer to 
this new DN, the client resumed writing data. Then in the process, corruption 
is detected again, thus repeating the pipeline recovery process. Even though 
from client side point of view, it keeps getting the following exception

{code}
INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as 10.1.1.1:1110
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1293)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
{code}

Wei-Chiu and I discussed, and we think here is a more complete picture:

* 1. pipeline going on DN1 -> DN2 -> DN3
* 2. trouble at DN3, it's gone
* 3. pipeline recovery, new DN DN4 added
* 4. block transfer from DN1 to DN4, DN4's data is now a multiple of chunks.
* 5. DataStreamer resumed writing data to DN1 -> DN4 -> DN3 (this is where new 
data gets in), the first chunk DN4 got is corrupt for some reason that we are 
searching for
* 6. DN3 detects corruption, quit; while new data has been written to DN1 and 
DN4
* 7. goes back to step 3, new pipeline recovery starts
DN1 ->DN4 -> DN5
DN1 -> DN4 -> DN6
..

At a corner case, Step 3 could be replaced with "DN3 restarted", in which case, 
another block transfer would happen, and may cause corruption.

Since DN1's visibleLength in step 4 is not a multiple of chunks, this fact 
might be somehow related to the corruption in step 5.

> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> 
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, 

[jira] [Commented] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378466#comment-15378466
 ] 

Hadoop QA commented on HDFS-10623:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
32s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
39s{color} | {color:green} branch-2.7.3 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
10s{color} | {color:green} branch-2.7.3 passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} branch-2.7.3 passed with JDK v1.7.0_101 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} branch-2.7.3 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} branch-2.7.3 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
15s{color} | {color:green} branch-2.7.3 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
9s{color} | {color:green} branch-2.7.3 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
20s{color} | {color:green} branch-2.7.3 passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
3s{color} | {color:green} branch-2.7.3 passed with JDK v1.7.0_101 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed with JDK v1.7.0_101 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1382 line(s) that end in whitespace. Use 
git apply --whitespace=fix. {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m 
37s{color} | {color:red} The patch 70 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
55s{color} | {color:green} the patch passed with JDK v1.7.0_101 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 54m 27s{color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_101. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
21s{color} | {color:red} The patch generated 3 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}138m 14s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_91 Failed junit tests | 
hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
|   | hadoop.hdfs.server.blockmanagement.TestComputeInvalidateWork |
| JDK v1.7.0_101 Failed junit tests | 
hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
|   | hadoop.hdfs.web.TestWebHdfsTokens |
|   | hadoop.hdfs.TestDataTransferKeepalive |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.server.namenode.TestFileTruncate |
|   | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:c42

[jira] [Commented] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378474#comment-15378474
 ] 

Hadoop QA commented on HDFS-10628:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 23s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 1 new + 38 unchanged - 0 fixed = 39 total (was 38) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 11s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 82m 39s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.datanode.TestFsDatasetCache |
|   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
|   | hadoop.hdfs.TestCrcCorruption |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12818020/HDFS-10628.001.patch |
| JIRA Issue | HDFS-10628 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux fe31dfa596fa 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 6cf0175 |
| Default Java | 1.8.0_91 |
| findbugs | v3.0.0 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16062/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16062/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16062/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16062/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT   htt

[jira] [Commented] (HDFS-10601) Improve log message to include hostname when the NameNode is in safemode

2016-07-14 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378418#comment-15378418
 ] 

Sean Busbey commented on HDFS-10601:


+1 (non-binding)

> Improve log message to include hostname when the NameNode is in safemode
> 
>
> Key: HDFS-10601
> URL: https://issues.apache.org/jira/browse/HDFS-10601
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Minor
> Attachments: HDFS-10601.001.patch, HDFS-10601.002.patch
>
>
> When remote NN operations are involved, it would be nice to have the Namenode 
> hostname in safemode notification log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Jiayi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378367#comment-15378367
 ] 

Jiayi Zhou commented on HDFS-10628:
---

Versions updated as comment.

> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.8.0
>Reporter: Jiayi Zhou
>Assignee: Jiayi Zhou
>Priority: Minor
> Attachments: HDFS-10628.001.patch
>
>
> Currently, the exit message is logged to stderr. It would be more convenient 
> if we also log this to Balancer log when people want to figure out why 
> Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Jiayi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Zhou updated HDFS-10628:
--
Affects Version/s: 2.8.0
 Target Version/s: 2.8.0

> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.8.0
>Reporter: Jiayi Zhou
>Assignee: Jiayi Zhou
>Priority: Minor
> Attachments: HDFS-10628.001.patch
>
>
> Currently, the exit message is logged to stderr. It would be more convenient 
> if we also log this to Balancer log when people want to figure out why 
> Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-10628:
---
   Priority: Minor  (was: Major)
Component/s: balancer & mover

> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Reporter: Jiayi Zhou
>Assignee: Jiayi Zhou
>Priority: Minor
> Attachments: HDFS-10628.001.patch
>
>
> Currently, the exit message is logged to stderr. It would be more convenient 
> if we also log this to Balancer log when people want to figure out why 
> Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378365#comment-15378365
 ] 

Andrew Wang commented on HDFS-10628:


+1 pending Jenkins, thanks for finding and fixing this Jiayi! Do you mind 
setting the Affects and Target version appropriately? As this issue is pretty 
minor, I think we can just target 2.8.0.

> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Reporter: Jiayi Zhou
>Assignee: Jiayi Zhou
> Attachments: HDFS-10628.001.patch
>
>
> Currently, the exit message is logged to stderr. It would be more convenient 
> if we also log this to Balancer log when people want to figure out why 
> Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10627) Volume Scanner mark a block as "suspect" even if the block sender encounters 'Broken pipe' or 'Connection reset by peer' exception

2016-07-14 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378353#comment-15378353
 ] 

Daryn Sharp commented on HDFS-10627:


Rushabh and I checked a few random healthy nodes on multiple clusters.
# They are backlogged with thousands of suspect blocks on all storages.  It 
will take days to catch up – assuming no more false positives.
# They been up for months but haven't started a rescan in over a month 
(available non-archived logs).  So obviously the false positives are trickling 
in faster than the the scan rate.
# 1 node reported one bad block in the past month.  The others have not 
reported any bad blocks.
# On a large cluster, 0.08% of pipeline recovery corruption was detected

_The scanner is completely negligent in its duty to find and report bad 
blocks_.  Rushabh added the priority scan feature to prevent rack failure from 
causing (hopefully) temporary data loss.  Instead of waiting for up to week for 
a suspected corrupt block to be reported, it would be reported almost 
immediately.  Well, guess what happened today?  Rack failed.  Lost data.  DN 
knew the block was bad but was too backlogged to verify and report it.  
Completely avoidable situation not worth detecting 0.08% of pipeline recovery 
corruptions.

*This is completely broken and must be reverted to be fixed*.


> Volume Scanner mark a block as "suspect" even if the block sender encounters 
> 'Broken pipe' or 'Connection reset by peer' exception
> --
>
> Key: HDFS-10627
> URL: https://issues.apache.org/jira/browse/HDFS-10627
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.7.0
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
>
> In the BlockSender code,
> {code:title=BlockSender.java|borderStyle=solid}
> if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
> reset")) {
>   LOG.error("BlockSender.sendChunks() exception: ", e);
> }
> datanode.getBlockScanner().markSuspectBlock(
>   volumeRef.getVolume().getStorageID(),
>   block);
> {code}
> Before HDFS-7686, the block was marked as suspect only if the exception 
> message doesn't start with Broken pipe or Connection reset.
> But after HDFS-7686, the block is marked as corrupt irrespective of the 
> exception message.
> In one of our datanode, it took approximately a whole day (22 hours) to go 
> through all the suspect blocks to scan one corrupt block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-10631) Federation State Store ZooKeeper implementation

2016-07-14 Thread Inigo Goiri (JIRA)
Inigo Goiri created HDFS-10631:
--

 Summary: Federation State Store ZooKeeper implementation
 Key: HDFS-10631
 URL: https://issues.apache.org/jira/browse/HDFS-10631
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Inigo Goiri


State Store implementation using ZooKeeper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378343#comment-15378343
 ] 

Hadoop QA commented on HDFS-10477:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 75m  
0s{color} | {color:green} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
18s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 97m 11s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12817992/HDFS-10477.005.patch |
| JIRA Issue | HDFS-10477 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux ec06a4fc50cb 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 6cf0175 |
| Default Java | 1.8.0_91 |
| findbugs | v3.0.0 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16059/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16059/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Stop decommission a rack of DataNodes caused NameNode fail over to standby
> --
>
> Key: HDFS-10477
> URL: https://issues.apache.org/jira/browse/HDFS-10477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: HDFS-10477.002.p

[jira] [Created] (HDFS-10630) Federation State Store

2016-07-14 Thread Inigo Goiri (JIRA)
Inigo Goiri created HDFS-10630:
--

 Summary: Federation State Store
 Key: HDFS-10630
 URL: https://issues.apache.org/jira/browse/HDFS-10630
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs
Reporter: Inigo Goiri


Interface to store the federation shared state across Routers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-10629) Federation Router

2016-07-14 Thread Inigo Goiri (JIRA)
Inigo Goiri created HDFS-10629:
--

 Summary: Federation Router
 Key: HDFS-10629
 URL: https://issues.apache.org/jira/browse/HDFS-10629
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs
Reporter: Inigo Goiri


Component that routes calls from the clients to the right Namespace. It 
implements {{ClientProtocol}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9271) Implement basic NN operations

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378333#comment-15378333
 ] 

Hadoop QA commented on HDFS-9271:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 
48s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
47s{color} | {color:green} HDFS-8707 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m 
10s{color} | {color:green} HDFS-8707 passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m  
8s{color} | {color:green} HDFS-8707 passed with JDK v1.7.0_101 {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
18s{color} | {color:green} HDFS-8707 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} HDFS-8707 passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m 
55s{color} | {color:green} the patch passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  5m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  5m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m 
22s{color} | {color:green} the patch passed with JDK v1.7.0_101 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  5m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  5m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  7m 40s{color} 
| {color:red} hadoop-hdfs-native-client in the patch failed with JDK 
v1.7.0_101. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 63m 19s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_91 Failed CTEST tests | 
test_libhdfs_threaded_hdfspp_test_shim_static |
| JDK v1.7.0_101 Failed CTEST tests | 
test_libhdfs_threaded_hdfspp_test_shim_static |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0cf5e66 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12818005/HDFS-9271.HDFS-8707.003.patch
 |
| JIRA Issue | HDFS-9271 |
| Optional Tests |  asflicense  compile  cc  mvnsite  javac  unit  |
| uname | Linux 1952abccb1c6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | HDFS-8707 / d18e396 |
| Default Java | 1.7.0_101 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_91 
/usr/lib/jvm/java-7-openjdk-amd64:1.7.0_101 |
| CTEST | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16060/artifact/patchprocess/patch-hadoop-hdfs-project_hadoop-hdfs-native-client-jdk1.8.0_91-ctest.txt
 |
| CTEST | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16060/artifact/patchprocess/patch-hadoop-hdfs-project_hadoop-hdfs-native-client-jdk1.7.0_101-ctest.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16060/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-native-client-jdk1.7.0_101.txt
 |
| JDK v1.7.0_101  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16060/testReport/ |
| modules | C: hadoop-hdfs-project

[jira] [Updated] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Jiayi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Zhou updated HDFS-10628:
--
Attachment: HDFS-10628.001.patch

> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jiayi Zhou
>Assignee: Jiayi Zhou
> Attachments: HDFS-10628.001.patch
>
>
> Currently, the exit message is logged to stderr. It would be more convenient 
> if we also log this to Balancer log when people want to figure out why 
> Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Jiayi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Zhou updated HDFS-10628:
--
Status: Patch Available  (was: Open)

Simply log the exit message in Balancer log 

> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jiayi Zhou
>Assignee: Jiayi Zhou
> Attachments: HDFS-10628.001.patch
>
>
> Currently, the exit message is logged to stderr. It would be more convenient 
> if we also log this to Balancer log when people want to figure out why 
> Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Jiayi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiayi Zhou updated HDFS-10628:
--
Description: Currently, the exit message is logged to stderr. It would be 
more convenient if we also log this to Balancer log when people want to figure 
out why Balancer is aborted.  (was: Currently, the exit message is logged to 
stderr. It would be better if we also log this to Balancer log when people want 
to figure out why Balancer is aborted.)

> Log HDFS Balancer exit message to its own log
> -
>
> Key: HDFS-10628
> URL: https://issues.apache.org/jira/browse/HDFS-10628
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jiayi Zhou
>Assignee: Jiayi Zhou
>
> Currently, the exit message is logged to stderr. It would be more convenient 
> if we also log this to Balancer log when people want to figure out why 
> Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-10628) Log HDFS Balancer exit message to its own log

2016-07-14 Thread Jiayi Zhou (JIRA)
Jiayi Zhou created HDFS-10628:
-

 Summary: Log HDFS Balancer exit message to its own log
 Key: HDFS-10628
 URL: https://issues.apache.org/jira/browse/HDFS-10628
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Jiayi Zhou
Assignee: Jiayi Zhou


Currently, the exit message is logged to stderr. It would be better if we also 
log this to Balancer log when people want to figure out why Balancer is aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Hanisha Koneru (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378293#comment-15378293
 ] 

Hanisha Koneru commented on HDFS-10623:
---

Thank you [~arpitagarwal] for reviewing and committing the patch.

> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
> Fix For: 2.7.3
>
> Attachments: HDFS-10623-branch-2.7.3.000.patch, HDFS-10623.000.patch
>
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-10623:
-
  Resolution: Fixed
Hadoop Flags: Reviewed
   Fix Version/s: 2.7.3
Target Version/s:   (was: 2.7.3)
  Status: Resolved  (was: Patch Available)

Committed to branch-2.7 and branch-2.7.3. Thank you for the contribution 
[~hanishakoneru].

> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
> Fix For: 2.7.3
>
> Attachments: HDFS-10623-branch-2.7.3.000.patch, HDFS-10623.000.patch
>
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10627) Volume Scanner mark a block as "suspect" even if the block sender encounters 'Broken pipe' or 'Connection reset by peer' exception

2016-07-14 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378278#comment-15378278
 ] 

Daryn Sharp commented on HDFS-10627:


I agree it's unfortunate there's no feedback mechanism.  I disagree the serving 
node should ever assume that loosing a client means the block _might_ be 
corrupt.  There are many reasons the client can "unexpectedly" close the 
connection.  Processes shutdown unexpectedly, get killed, etc.

The only time the DN should be suspect of its own block is a local IOE.  I 
checked some of our big clusters: a DN self-reporting a corrupt block during 
pipeline/block recovery is a fraction of a percent (which is being generous).

So let's consider...  Is it worth backlogging a DN with so many false positives 
from broken connections that:
# It takes a day to scan a legitimately bad block detected by a local IOE
# Leading to a rack failure causing temporary data loss
# Scanner isn't doing it's primary job of trawling the storage for bad blocks

> Volume Scanner mark a block as "suspect" even if the block sender encounters 
> 'Broken pipe' or 'Connection reset by peer' exception
> --
>
> Key: HDFS-10627
> URL: https://issues.apache.org/jira/browse/HDFS-10627
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.7.0
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
>
> In the BlockSender code,
> {code:title=BlockSender.java|borderStyle=solid}
> if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
> reset")) {
>   LOG.error("BlockSender.sendChunks() exception: ", e);
> }
> datanode.getBlockScanner().markSuspectBlock(
>   volumeRef.getVolume().getStorageID(),
>   block);
> {code}
> Before HDFS-7686, the block was marked as suspect only if the exception 
> message doesn't start with Broken pipe or Connection reset.
> But after HDFS-7686, the block is marked as corrupt irrespective of the 
> exception message.
> In one of our datanode, it took approximately a whole day (22 hours) to go 
> through all the suspect blocks to scan one corrupt block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378245#comment-15378245
 ] 

Arpit Agarwal commented on HDFS-10623:
--

+1 this doesn't need a full Jenkins run.

The test builds and passes with the patch so I will commit it shortly.

> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
> Attachments: HDFS-10623-branch-2.7.3.000.patch, HDFS-10623.000.patch
>
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9271) Implement basic NN operations

2016-07-14 Thread Anatoli Shein (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anatoli Shein updated HDFS-9271:

Attachment: HDFS-9271.HDFS-8707.003.patch

Please review the new patch.

> Implement basic NN operations
> -
>
> Key: HDFS-9271
> URL: https://issues.apache.org/jira/browse/HDFS-9271
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Reporter: Bob Hansen
>Assignee: Anatoli Shein
> Attachments: HDFS-9271.HDFS-8707.000.patch, 
> HDFS-9271.HDFS-8707.001.patch, HDFS-9271.HDFS-8707.002.patch, 
> HDFS-9271.HDFS-8707.003.patch
>
>
> Expose via C and C++ API:
> * mkdirs
> * rename
> * delete
> * stat
> * chmod
> * chown
> * getListing
> * setOwner



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9271) Implement basic NN operations

2016-07-14 Thread Anatoli Shein (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378193#comment-15378193
 ] 

Anatoli Shein commented on HDFS-9271:
-

Thank you [~bobhansen] for your comments. Here is what I did:

A few comments:
* In GetBlockLocations(hdfspp.h, filesystem.cc), use offset_t or uint64_t 
rather than long. It's less ambiguous.
(/) I used uint64_t and also updated short variables to uint16_t
* In getAbsolutePath (hdfs.cc), how about returning optional(string) rather 
than an empty string on error. It makes the error state explicit and explicitly 
checked.
(/) Done
* Make a new bug to capture supporting ".." semantics
(/) Bug filed: HDFS-10621.
* It appears the majority of hdfs_ext_test.c has been commented out. What this 
intentional, or debugging dirt that slipped in?
(/) Debugging dirt removed
* Can we add a test for relative paths for all the functions where we added 
them in?
(/) Added this to hdfs_ext_test.c
* Can we implement hdfsMove and/or hdfsTruncateFile with just metadata 
operations?
(x) As per our conversation today, this cannot be done yet
* Move to libhdfspp implementations in hdfs_shim for 
GetDefaultBlocksize\[AtPath]
(/) Done
* Implement hdfsUnbufferFile as a no-op?
(/) Done
* Do we support single-dot relative paths? e.g. can I call hdfsGetPathInfo(fs, 
".")? Do we have tests over that?
(x) We do not support this yet. It is captured in HDFS-10621.
* Do we have tests that show that libhdfspp's getReadStatistics match libhdfs's 
getReadStatistics?
(x) We do not track which bytes were remote or local. Temporary assume 
everything is local.

Minor little nits:
* For the absolute path, I personally prefer abs_path = getAbsolutPath(...) 
rather than abs_path(getAbsolutePath). They both compile to the same thing (see 
https://en.wikipedia.org/wiki/Return_value_optimization); I think the 
whitespace with the assignment makes the what and the content separation cleaner
(/) Done
* Refactor CheckSystemAndHandle to use CheckHandle 
(https://en.wikipedia.org/wiki/Don't_repeat_yourself)
(/) Done

> Implement basic NN operations
> -
>
> Key: HDFS-9271
> URL: https://issues.apache.org/jira/browse/HDFS-9271
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Reporter: Bob Hansen
>Assignee: Anatoli Shein
> Attachments: HDFS-9271.HDFS-8707.000.patch, 
> HDFS-9271.HDFS-8707.001.patch, HDFS-9271.HDFS-8707.002.patch
>
>
> Expose via C and C++ API:
> * mkdirs
> * rename
> * delete
> * stat
> * chmod
> * chown
> * getListing
> * setOwner



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Hanisha Koneru (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanisha Koneru updated HDFS-10623:
--
Status: Patch Available  (was: In Progress)

> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
> Attachments: HDFS-10623-branch-2.7.3.000.patch, HDFS-10623.000.patch
>
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Hanisha Koneru (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanisha Koneru updated HDFS-10623:
--
Status: In Progress  (was: Patch Available)

> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
> Attachments: HDFS-10623-branch-2.7.3.000.patch, HDFS-10623.000.patch
>
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Hanisha Koneru (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanisha Koneru updated HDFS-10623:
--
Attachment: HDFS-10623-branch-2.7.3.000.patch

> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
> Attachments: HDFS-10623-branch-2.7.3.000.patch, HDFS-10623.000.patch
>
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-8897) Loadbalancer always exits with : java.io.IOException: Another Balancer is running.. Exiting ...

2016-07-14 Thread John Zhuge (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-8897 started by John Zhuge.

> Loadbalancer always exits with : java.io.IOException: Another Balancer is 
> running..  Exiting ...
> 
>
> Key: HDFS-8897
> URL: https://issues.apache.org/jira/browse/HDFS-8897
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.7.1
> Environment: Centos 6.6
>Reporter: LINTE
>Assignee: John Zhuge
>
> When balancer is launched, it should test if there is already a 
> /system/balancer.id file in HDFS.
> When the file doesn't exist, the balancer don't want to run : 
> 15/08/14 16:35:12 INFO balancer.Balancer: namenodes  = [hdfs://sandbox/, 
> hdfs://sandbox]
> 15/08/14 16:35:12 INFO balancer.Balancer: parameters = 
> Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration 
> = 5, number of nodes to be excluded = 0, number of nodes to be included = 0]
> Time Stamp   Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved
> 15/08/14 16:35:14 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys
> 15/08/14 16:35:14 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys
> 15/08/14 16:35:14 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys
> 15/08/14 16:35:14 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> java.io.IOException: Another Balancer is running..  Exiting ...
> Aug 14, 2015 4:35:14 PM  Balancing took 2.408 seconds
> Looking at the audit log file when trying to run the balancer, the balancer 
> create the /system/balancer.id and then delete it on exiting ... 
> 2015-08-14 16:37:45,844 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=getfileinfo 
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> 2015-08-14 16:37:45,900 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=create  
> src=/system/balancer.id dst=nullperm=hdfs:hadoop:rw-r-  
> proto=rpc
> 2015-08-14 16:37:45,919 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=getfileinfo 
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> 2015-08-14 16:37:46,090 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=getfileinfo 
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> 2015-08-14 16:37:46,112 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=getfileinfo 
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> 2015-08-14 16:37:46,117 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=delete  
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> The error seems to be located in 
> org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java 
> The function checkAndMarkRunning return null even if the /system/balancer.id 
> doesn't exist before entering this function; if it exists, then it is deleted 
> and the balancer exit with the same error.
> 
>   private OutputStream checkAndMarkRunning() throws IOException {
> try {
>   if (fs.exists(idPath)) {
> // try appending to it so that it will fail fast if another balancer 
> is
> // running.
> IOUtils.closeStream(fs.append(idPath));
> fs.delete(idPath, true);
>   }
>   final FSDataOutputStream fsout = fs.create(idPath, false);
>   // mark balancer idPath to be deleted during filesystem closure
>   fs.deleteOnExit(idPath);
>   if (write2IdFile) {
> fsout.writeBytes(InetAddress.getLocalHost().getHostName());
> fsout.hflush();
>   }
>   return fsout;
> } catch(RemoteException e) {
>   
> if(AlreadyBeingCreatedException.class.getName().equals(e.getClassName())){
> return null;
>   } else {
> throw e;
>   }
> }
>   }
> 
> Regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-

[jira] [Assigned] (HDFS-8897) Loadbalancer always exits with : java.io.IOException: Another Balancer is running.. Exiting ...

2016-07-14 Thread John Zhuge (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge reassigned HDFS-8897:


Assignee: John Zhuge

> Loadbalancer always exits with : java.io.IOException: Another Balancer is 
> running..  Exiting ...
> 
>
> Key: HDFS-8897
> URL: https://issues.apache.org/jira/browse/HDFS-8897
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.7.1
> Environment: Centos 6.6
>Reporter: LINTE
>Assignee: John Zhuge
>
> When balancer is launched, it should test if there is already a 
> /system/balancer.id file in HDFS.
> When the file doesn't exist, the balancer don't want to run : 
> 15/08/14 16:35:12 INFO balancer.Balancer: namenodes  = [hdfs://sandbox/, 
> hdfs://sandbox]
> 15/08/14 16:35:12 INFO balancer.Balancer: parameters = 
> Balancer.Parameters[BalancingPolicy.Node, threshold=10.0, max idle iteration 
> = 5, number of nodes to be excluded = 0, number of nodes to be included = 0]
> Time Stamp   Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved
> 15/08/14 16:35:14 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys
> 15/08/14 16:35:14 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys
> 15/08/14 16:35:14 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/08/14 16:35:14 INFO block.BlockTokenSecretManager: Setting block keys
> 15/08/14 16:35:14 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> java.io.IOException: Another Balancer is running..  Exiting ...
> Aug 14, 2015 4:35:14 PM  Balancing took 2.408 seconds
> Looking at the audit log file when trying to run the balancer, the balancer 
> create the /system/balancer.id and then delete it on exiting ... 
> 2015-08-14 16:37:45,844 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=getfileinfo 
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> 2015-08-14 16:37:45,900 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=create  
> src=/system/balancer.id dst=nullperm=hdfs:hadoop:rw-r-  
> proto=rpc
> 2015-08-14 16:37:45,919 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=getfileinfo 
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> 2015-08-14 16:37:46,090 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=getfileinfo 
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> 2015-08-14 16:37:46,112 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=getfileinfo 
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> 2015-08-14 16:37:46,117 INFO FSNamesystem.audit: allowed=true   
> ugi=hdfs@SANDBOX.HADOOP (auth:KERBEROS) ip=/x.x.x.x   cmd=delete  
> src=/system/balancer.id dst=nullperm=null   proto=rpc
> The error seems to be located in 
> org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java 
> The function checkAndMarkRunning return null even if the /system/balancer.id 
> doesn't exist before entering this function; if it exists, then it is deleted 
> and the balancer exit with the same error.
> 
>   private OutputStream checkAndMarkRunning() throws IOException {
> try {
>   if (fs.exists(idPath)) {
> // try appending to it so that it will fail fast if another balancer 
> is
> // running.
> IOUtils.closeStream(fs.append(idPath));
> fs.delete(idPath, true);
>   }
>   final FSDataOutputStream fsout = fs.create(idPath, false);
>   // mark balancer idPath to be deleted during filesystem closure
>   fs.deleteOnExit(idPath);
>   if (write2IdFile) {
> fsout.writeBytes(InetAddress.getLocalHost().getHostName());
> fsout.hflush();
>   }
>   return fsout;
> } catch(RemoteException e) {
>   
> if(AlreadyBeingCreatedException.class.getName().equals(e.getClassName())){
> return null;
>   } else {
> throw e;
>   }
> }
>   }
> 
> Regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: h

[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby

2016-07-14 Thread yunjiong zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yunjiong zhao updated HDFS-10477:
-
Attachment: HDFS-10477.005.patch

Update patch to fix unit test.
When called by tests like 
TestDefaultBlockPlacementPolicy.testPlacementWithLocalRackNodesDecommissioned, 
it might not have write lock.
Thanks [~rakeshr] for suggestion on the InterruptedException. Handler thread is 
daemon thread, but you are right, it's better call 
Thread.currentThread().interrupt() to keep interrupt status.

> Stop decommission a rack of DataNodes caused NameNode fail over to standby
> --
>
> Key: HDFS-10477
> URL: https://issues.apache.org/jira/browse/HDFS-10477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, 
> HDFS-10477.004.patch, HDFS-10477.005.patch, HDFS-10477.patch
>
>
> In our cluster, when we stop decommissioning a rack which have 46 DataNodes, 
> it locked Namesystem for about 7 minutes as below log shows:
> {code}
> 2016-05-26 20:11:41,697 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.27:1004
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.118:1004
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.113:1004
> 2016-05-26 20:12:09,007 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning
> 2016-05-26 20:12:09,008 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.117:1004
> 2016-05-26 20:12:18,055 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning
> 2016-05-26 20:12:18,056 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.130:1004
> 2016-05-26 20:12:25,938 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning
> 2016-05-26 20:12:25,939 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.121:1004
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.33:1004
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.137:1004
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.51:1004
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.12:1004
> 2016-05-26 20:13:08,756 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning
> 2016-05-26 20:13:08,757 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.15:1004
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 

[jira] [Commented] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378101#comment-15378101
 ] 

Hadoop QA commented on HDFS-10623:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  4s{color} 
| {color:red} HDFS-10623 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12817991/HDFS-10623.000.patch |
| JIRA Issue | HDFS-10623 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16058/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
> Attachments: HDFS-10623.000.patch
>
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Hanisha Koneru (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanisha Koneru updated HDFS-10623:
--
Status: Patch Available  (was: Open)

> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
> Attachments: HDFS-10623.000.patch
>
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Hanisha Koneru (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanisha Koneru updated HDFS-10623:
--
Attachment: HDFS-10623.000.patch

> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
> Attachments: HDFS-10623.000.patch
>
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10623) Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.

2016-07-14 Thread Hanisha Koneru (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanisha Koneru updated HDFS-10623:
--
Target Version/s: 2.7.3

> Remove unused import of httpclient.HttpConnection from TestWebHdfsTokens.
> -
>
> Key: HDFS-10623
> URL: https://issues.apache.org/jira/browse/HDFS-10623
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Jitendra Nath Pandey
>Assignee: Hanisha Koneru
>
> TestWebHdfsTokens imports httpclient.HttpConnection, and causes unnecessary 
> reference to httpclient. This can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7686) Re-add rapid rescan of possibly corrupt block feature to the block scanner

2016-07-14 Thread Rakesh R (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R updated HDFS-7686:
---
Assignee: Colin P. McCabe  (was: Rakesh R)

> Re-add rapid rescan of possibly corrupt block feature to the block scanner
> --
>
> Key: HDFS-7686
> URL: https://issues.apache.org/jira/browse/HDFS-7686
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Rushabh S Shah
>Assignee: Colin P. McCabe
>Priority: Blocker
> Fix For: 2.7.0
>
> Attachments: HDFS-7686.002.patch, HDFS-7686.003.patch, 
> HDFS-7686.004.patch
>
>
> When doing a transferTo (aka sendfile operation) from the DataNode to a 
> client, we may hit an I/O error from the disk.  If we believe this is the 
> case, we should be able to tell the block scanner to rescan that block soon.  
> The feature was originally implemented in HDFS-7548 but was removed by 
> HDFS-7430.  We should re-add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-7686) Re-add rapid rescan of possibly corrupt block feature to the block scanner

2016-07-14 Thread Rakesh R (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R reassigned HDFS-7686:
--

Assignee: Rakesh R  (was: Colin P. McCabe)

> Re-add rapid rescan of possibly corrupt block feature to the block scanner
> --
>
> Key: HDFS-7686
> URL: https://issues.apache.org/jira/browse/HDFS-7686
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Rushabh S Shah
>Assignee: Rakesh R
>Priority: Blocker
> Fix For: 2.7.0
>
> Attachments: HDFS-7686.002.patch, HDFS-7686.003.patch, 
> HDFS-7686.004.patch
>
>
> When doing a transferTo (aka sendfile operation) from the DataNode to a 
> client, we may hit an I/O error from the disk.  If we believe this is the 
> case, we should be able to tell the block scanner to rescan that block soon.  
> The feature was originally implemented in HDFS-7548 but was removed by 
> HDFS-7430.  We should re-add it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10600) PlanCommand#getThrsholdPercentage should not use throughput value.

2016-07-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377889#comment-15377889
 ] 

Hudson commented on HDFS-10600:
---

SUCCESS: Integrated in Hadoop-trunk-Commit #10100 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/10100/])
HDFS-10600. PlanCommand#getThrsholdPercentage should not use throughput (lei: 
rev 382dff74751b745de28a212df4897f525111d228)
* hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/diskbalancer/command/PlanCommand.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DiskBalancer.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java


> PlanCommand#getThrsholdPercentage should not use throughput value.
> --
>
> Key: HDFS-10600
> URL: https://issues.apache.org/jira/browse/HDFS-10600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: diskbalancer
>Affects Versions: 2.9.0, 3.0.0-beta1
>Reporter: Lei (Eddy) Xu
>Assignee: Yiqun Lin
> Fix For: 3.0.0-alpha1
>
> Attachments: HDFS-10600.001.patch, HDFS-10600.002.patch
>
>
> In {{PlanCommand#getThresholdPercentage}}
> {code}
>  private double getThresholdPercentage(CommandLine cmd) {
> 
> if ((value <= 0.0) || (value > 100.0)) {
>   value = getConf().getDouble(
>   DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT,
>   DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT_DEFAULT);
> }
> return value;
>   }
> {code}
> {{DISK_THROUGHPUT}} has the unit of "MB", so it does not make sense to return 
> {{throughput}} as a percentage value.
> Btw, we should use {{THROUGHPUT}} instead of {{THRUPUT}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10467) Router-based HDFS federation

2016-07-14 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377362#comment-15377362
 ] 

Jing Zhao commented on HDFS-10467:
--

Assigned the jira to [~elgoiri]. 

bq. Probably, it's a good idea to create a new branch for this effort.

+1. I've created the feature branch HDFS-10467. Please feel free to use it for 
next step development.

> Router-based HDFS federation
> 
>
> Key: HDFS-10467
> URL: https://issues.apache.org/jira/browse/HDFS-10467
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Affects Versions: 2.7.2
>Reporter: Inigo Goiri
>Assignee: Inigo Goiri
> Attachments: HDFS Router Federation.pdf, HDFS-10467.PoC.001.patch, 
> HDFS-10467.PoC.patch, HDFS-Router-Federation-Prototype.patch
>
>
> Add a Router to provide a federated view of multiple HDFS clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10600) PlanCommand#getThrsholdPercentage should not use throughput value.

2016-07-14 Thread Lei (Eddy) Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu updated HDFS-10600:
-
  Resolution: Fixed
   Fix Version/s: (was: 2.9.0)
  3.0.0-alpha1
Target Version/s: 3.0.0-beta1  (was: 2.9.0, 3.0.0-beta1)
  Status: Resolved  (was: Patch Available)

+1. Thanks for the fix, [~linyiqun]. 

Committed to trunk.

> PlanCommand#getThrsholdPercentage should not use throughput value.
> --
>
> Key: HDFS-10600
> URL: https://issues.apache.org/jira/browse/HDFS-10600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: diskbalancer
>Affects Versions: 2.9.0, 3.0.0-beta1
>Reporter: Lei (Eddy) Xu
>Assignee: Yiqun Lin
> Fix For: 3.0.0-alpha1
>
> Attachments: HDFS-10600.001.patch, HDFS-10600.002.patch
>
>
> In {{PlanCommand#getThresholdPercentage}}
> {code}
>  private double getThresholdPercentage(CommandLine cmd) {
> 
> if ((value <= 0.0) || (value > 100.0)) {
>   value = getConf().getDouble(
>   DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT,
>   DFSConfigKeys.DFS_DISK_BALANCER_MAX_DISK_THRUPUT_DEFAULT);
> }
> return value;
>   }
> {code}
> {{DISK_THROUGHPUT}} has the unit of "MB", so it does not make sense to return 
> {{throughput}} as a percentage value.
> Btw, we should use {{THROUGHPUT}} instead of {{THRUPUT}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10467) Router-based HDFS federation

2016-07-14 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-10467:
-
Assignee: Inigo Goiri

> Router-based HDFS federation
> 
>
> Key: HDFS-10467
> URL: https://issues.apache.org/jira/browse/HDFS-10467
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Affects Versions: 2.7.2
>Reporter: Inigo Goiri
>Assignee: Inigo Goiri
> Attachments: HDFS Router Federation.pdf, HDFS-10467.PoC.001.patch, 
> HDFS-10467.PoC.patch, HDFS-Router-Federation-Prototype.patch
>
>
> Add a Router to provide a federated view of multiple HDFS clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-10626:
-
Labels: supportability  (was: )

> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-10626.001.patch
>
>
> VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. 
> The related codes:
> {code}
> public void handle(ExtendedBlock block, IOException e) {
>   FsVolumeSpi volume = scanner.volume;
>   ...
>   try {
> scanner.datanode.reportBadBlocks(block, volume);
>   } catch (IOException ie) {
> // This is bad, but not bad enough to shut down the scanner.
> LOG.warn("Cannot report bad " + block.getBlockId(), e);
>   }
> }
> {code}
> The IOException that printed in the log should be {{ie}} rather than {{e}} 
> which was passed in method {{handle(ExtendedBlock block, IOException e)}}.
> It will be a important info that can help us to know why datanode 
> reporBadBlocks failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377289#comment-15377289
 ] 

Yongjun Zhang commented on HDFS-10626:
--

HI [~linyiqun],

Thanks for reporting and working on the issue. I made a similar comment here

https://issues.apache.org/jira/browse/HDFS-10625?focusedCommentId=15377247&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15377247

I think it's helpful to report more info about the replica when there is any 
issue.

Thanks.


> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
> Attachments: HDFS-10626.001.patch
>
>
> VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. 
> The related codes:
> {code}
> public void handle(ExtendedBlock block, IOException e) {
>   FsVolumeSpi volume = scanner.volume;
>   ...
>   try {
> scanner.datanode.reportBadBlocks(block, volume);
>   } catch (IOException ie) {
> // This is bad, but not bad enough to shut down the scanner.
> LOG.warn("Cannot report bad " + block.getBlockId(), e);
>   }
> }
> {code}
> The IOException that printed in the log should be {{ie}} rather than {{e}} 
> which was passed in method {{handle(ExtendedBlock block, IOException e)}}.
> It will be a important info that can help us to know why datanode 
> reporBadBlocks failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10627) Volume Scanner mark a block as "suspect" even if the block sender encounters 'Broken pipe' or 'Connection reset by peer' exception

2016-07-14 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377268#comment-15377268
 ] 

Wei-Chiu Chuang commented on HDFS-10627:


[~shahrs87] this piece of code can be useful during block transfer due to prior 
pipeline recovery.
In which case, if the receiver detects corruption in the replica, it 
immediately resets connection.

If block transfer is not initiated due to pipeline recovery, the receiver also 
notifies NameNode that the source's replica is corrupt (this is actually not 
accurate because the corruption may due to other issues, and which causes the 
bug described in HDFS-6804).

In short, I think this code is still necessary, because of the lack of feedback 
mechanism in block transfer during pipeline recovery.

> Volume Scanner mark a block as "suspect" even if the block sender encounters 
> 'Broken pipe' or 'Connection reset by peer' exception
> --
>
> Key: HDFS-10627
> URL: https://issues.apache.org/jira/browse/HDFS-10627
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.7.0
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
>
> In the BlockSender code,
> {code:title=BlockSender.java|borderStyle=solid}
> if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
> reset")) {
>   LOG.error("BlockSender.sendChunks() exception: ", e);
> }
> datanode.getBlockScanner().markSuspectBlock(
>   volumeRef.getVolume().getStorageID(),
>   block);
> {code}
> Before HDFS-7686, the block was marked as suspect only if the exception 
> message doesn't start with Broken pipe or Connection reset.
> But after HDFS-7686, the block is marked as corrupt irrespective of the 
> exception message.
> In one of our datanode, it took approximately a whole day (22 hours) to go 
> through all the suspect blocks to scan one corrupt block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10625) VolumeScanner to report why a block is found bad

2016-07-14 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377247#comment-15377247
 ] 

Yongjun Zhang edited comment on HDFS-10625 at 7/14/16 4:44 PM:
---

Thanks [~shahrs87] for the patch and [~jojochuang] for the comment.

It'd be nice to include the length of the replica, visible length, ondisk 
length etc in the report too. Suggest to use the same method used here (see 
HDFS-10587),
 
{code}
016-04-15 22:03:05,066 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
getNumBytes() = 41381376
getBytesOnDisk() = 41381376
getVisibleLength()= 41186444
getVolume() = /hadoop-i/data/current
getBlockFile() = 
/hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
bytesAcked=41186444
bytesOnDisk=41381376
{code}




was (Author: yzhangal):
Thanks [~shahrs87] for the patch and [~jojochuang] for the comment.

It'd be nice to include the length of the replica in the report too.


>  VolumeScanner to report why a block is found bad
> -
>
> Key: HDFS-10625
> URL: https://issues.apache.org/jira/browse/HDFS-10625
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Rushabh S Shah
>  Labels: supportability
> Attachments: HDFS-10625.patch
>
>
> VolumeScanner may report:
> {code}
> WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> blk_1170125248_96458336 on /d/dfs/dn
> {code}
> It would be helpful to report the reason why the block is bad, especially 
> when the block is corrupt, where is the first corrupted chunk in the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10627) Volume Scanner mark a block as "suspect" even if the block sender encounters 'Broken pipe' or 'Connection reset by peer' exception

2016-07-14 Thread Rushabh S Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah updated HDFS-10627:
--
Description: 
In the BlockSender code,
{code:title=BlockSender.java|borderStyle=solid}
if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
reset")) {
  LOG.error("BlockSender.sendChunks() exception: ", e);
}
datanode.getBlockScanner().markSuspectBlock(
  volumeRef.getVolume().getStorageID(),
  block);
{code}

Before HDFS-7686, the block was marked as suspect only if the exception message 
doesn't start with Broken pipe or Connection reset.
But after HDFS-7686, the block is marked as corrupt irrespective of the 
exception message.

In one of our datanode, it took approximately a whole day (22 hours) to go 
through all the suspect blocks to scan one corrupt block.


  was:
In the BlockSender code,
{code:title=BlockSender.java|borderStyle=solid}
if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
reset")) {
  LOG.error("BlockSender.sendChunks() exception: ", e);
}
datanode.getBlockScanner().markSuspectBlock(
  volumeRef.getVolume().getStorageID(),
  block);
{code}

Before HDFS-7686, the block was marked as suspect only if the exception message 
doesn't start with Broken pipe or Connection reset.
But after HDFS-7686, the block is marked as corrupt irrespectively of the 
exception message.

In one of our datanode, it took approximately a whole day (22 hours) to go 
through all the suspect blocks to scan one corrupt block.



> Volume Scanner mark a block as "suspect" even if the block sender encounters 
> 'Broken pipe' or 'Connection reset by peer' exception
> --
>
> Key: HDFS-10627
> URL: https://issues.apache.org/jira/browse/HDFS-10627
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.7.0
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
>
> In the BlockSender code,
> {code:title=BlockSender.java|borderStyle=solid}
> if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
> reset")) {
>   LOG.error("BlockSender.sendChunks() exception: ", e);
> }
> datanode.getBlockScanner().markSuspectBlock(
>   volumeRef.getVolume().getStorageID(),
>   block);
> {code}
> Before HDFS-7686, the block was marked as suspect only if the exception 
> message doesn't start with Broken pipe or Connection reset.
> But after HDFS-7686, the block is marked as corrupt irrespective of the 
> exception message.
> In one of our datanode, it took approximately a whole day (22 hours) to go 
> through all the suspect blocks to scan one corrupt block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-10627) Volume Scanner mark a block as "suspect" even if the block sender encounters 'Broken pipe' or 'Connection reset by peer' exception

2016-07-14 Thread Rushabh S Shah (JIRA)
Rushabh S Shah created HDFS-10627:
-

 Summary: Volume Scanner mark a block as "suspect" even if the 
block sender encounters 'Broken pipe' or 'Connection reset by peer' exception
 Key: HDFS-10627
 URL: https://issues.apache.org/jira/browse/HDFS-10627
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 2.7.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah


In the BlockSender code,
{code:title=BlockSender.java|borderStyle=solid}
if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
reset")) {
  LOG.error("BlockSender.sendChunks() exception: ", e);
}
datanode.getBlockScanner().markSuspectBlock(
  volumeRef.getVolume().getStorageID(),
  block);
{code}

Before HDFS-7686, the block was marked as suspect only if the exception message 
doesn't start with Broken pipe or Connection reset.
But after HDFS-7686, the block is marked as corrupt irrespectively of the 
exception message.

In one of our datanode, it took approximately a whole day (22 hours) to go 
through all the suspect blocks to scan one corrupt block.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10625) VolumeScanner to report why a block is found bad

2016-07-14 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377247#comment-15377247
 ] 

Yongjun Zhang commented on HDFS-10625:
--

Thanks [~shahrs87] for the patch and [~jojochuang] for the comment.

It'd be nice to include the length of the replica in the report too.


>  VolumeScanner to report why a block is found bad
> -
>
> Key: HDFS-10625
> URL: https://issues.apache.org/jira/browse/HDFS-10625
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Rushabh S Shah
>  Labels: supportability
> Attachments: HDFS-10625.patch
>
>
> VolumeScanner may report:
> {code}
> WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> blk_1170125248_96458336 on /d/dfs/dn
> {code}
> It would be helpful to report the reason why the block is bad, especially 
> when the block is corrupt, where is the first corrupted chunk in the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10625) VolumeScanner to report why a block is found bad

2016-07-14 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377228#comment-15377228
 ] 

Wei-Chiu Chuang commented on HDFS-10625:


looks good to me.
VolumeScanner uses BlockSender and a null stream to verify the integrity of the 
block. If the block is corrupt, {{BlockSender#verifyChecksum}} throws a 
ChecksumException, which details the location of the corruption.

{code}
throw new ChecksumException("Checksum failed at " + failedPos,
failedPos);
{code}

>  VolumeScanner to report why a block is found bad
> -
>
> Key: HDFS-10625
> URL: https://issues.apache.org/jira/browse/HDFS-10625
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Rushabh S Shah
>  Labels: supportability
> Attachments: HDFS-10625.patch
>
>
> VolumeScanner may report:
> {code}
> WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> blk_1170125248_96458336 on /d/dfs/dn
> {code}
> It would be helpful to report the reason why the block is bad, especially 
> when the block is corrupt, where is the first corrupted chunk in the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10596) libhdfs++: Implement hdfsFileIsEncrypted

2016-07-14 Thread James Clampffer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377213#comment-15377213
 ] 

James Clampffer commented on HDFS-10596:


{code}
16/07/05 17:12:27 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
{code}

Usually this warning isn't really a big deal, it's just not able to find 
libhadoop.so which provides some some functionallity that need to be done in 
native code e.g. short circuit reads, optimized crc implementations (crc may be 
back in java now) etc.  I don't know much about the security stuff but this 
doesn't look related to the main issue.

Regarding the RemoteException - Have you tried looking at the namenode logs 
(turn logging level to debug if possible) to see if it's actually getting the 
right info about the key service?  Possibly also look at the key service logs 
if applicable.  There might be some useful warnings in there that could help 
point you in the right direction.

> libhdfs++: Implement hdfsFileIsEncrypted
> 
>
> Key: HDFS-10596
> URL: https://issues.apache.org/jira/browse/HDFS-10596
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Reporter: Anatoli Shein
> Attachments: HDFS-10596.HDFS-8707.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10625) VolumeScanner to report why a block is found bad

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377192#comment-15377192
 ] 

Hadoop QA commented on HDFS-10625:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
32s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 74m  
7s{color} | {color:green} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 96m 43s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12817952/HDFS-10625.patch |
| JIRA Issue | HDFS-10625 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 8fc773df0d4f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 54bf14f |
| Default Java | 1.8.0_91 |
| findbugs | v3.0.0 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16057/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16057/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



>  VolumeScanner to report why a block is found bad
> -
>
> Key: HDFS-10625
> URL: https://issues.apache.org/jira/browse/HDFS-10625
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Rushabh S Shah
>  Labels: supportability
> Attachments: HDFS-10625.patch
>
>
> VolumeScanner may repor

[jira] [Commented] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377160#comment-15377160
 ] 

Yongjun Zhang commented on HDFS-10587:
--

Thanks a lot [~vinayrpet] and [~xupener]!

As Vinay pointed, the case Xupeng described looks alike but the corruption 
position is not like this case. I think HDFS-6937 will help on Xupeng's case.

Vinay: About {{recoverRbw}},  since the data the destination DN (the new DN) 
received is valid data, does not truncating at the new DN hurt? 

We actually allow different visibleLength at different replica, see 
https://issues.apache.org/jira/browse/HDFS-10587?focusedCommentId=15374480&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15374480

Though I originally hopeed that the block transfer preserve the visibleLength, 
so that in the block transfer, the target DN can have the same visibleLength as 
the source DN.

Assuming it's ok to have the different visibleLength at the new DN, the block 
transfer seems to have side effect, such that the new chunk after the block 
transfer at the new DN appears corrupted.

Another thing is, if the pipeline recovery is failing, see

https://issues.apache.org/jira/browse/HDFS-10587?focusedCommentId=15376467&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15376467

 why we have more data reaching the new DN (I meant the chunk after block 
transfer) ?

Thanks.


 


> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> 
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41381376
>   getBytesOnDisk()  = 41381376
>   getVisibleLength()= 41186444
>   getVolume()   = /hadoop-i/data/current
>   getBlockFile()= 
> /hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186444
>   bytesOnDisk=41381376
> while the re

[jira] [Commented] (HDFS-10328) Add per-cache-pool default replication num configuration

2016-07-14 Thread xupeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377045#comment-15377045
 ] 

xupeng commented on HDFS-10328:
---

[~cmccabe]

Thanks a lot for the review :)

> Add per-cache-pool default replication num configuration
> 
>
> Key: HDFS-10328
> URL: https://issues.apache.org/jira/browse/HDFS-10328
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: caching
>Reporter: xupeng
>Assignee: xupeng
>Priority: Minor
> Fix For: 2.9.0
>
> Attachments: HDFS-10328.001.patch, HDFS-10328.002.patch, 
> HDFS-10328.003.patch, HDFS-10328.004.patch
>
>
> For now, hdfs cacheadmin can not set a default replication num for cached 
> directive in the same cachepool. Each cache directive added in the same cache 
> pool should set their own replication num individually. 
> Consider this situation, we add daily hive table into cache pool "hive" .Each 
> time i should set the same replication num for every table directive in the 
> same cache pool.  
> I think we should enable setting a default replication num for a cachepool 
> that every cache directive in the pool can inherit replication configuration 
> from the pool. Also cache directive can override replication configuration 
> explicitly by calling "add & modify  directive -replication" command from 
> cli.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10625) VolumeScanner to report why a block is found bad

2016-07-14 Thread Rushabh S Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah reassigned HDFS-10625:
-

Assignee: Rushabh S Shah

>  VolumeScanner to report why a block is found bad
> -
>
> Key: HDFS-10625
> URL: https://issues.apache.org/jira/browse/HDFS-10625
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Rushabh S Shah
>  Labels: supportability
> Attachments: HDFS-10625.patch
>
>
> VolumeScanner may report:
> {code}
> WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> blk_1170125248_96458336 on /d/dfs/dn
> {code}
> It would be helpful to report the reason why the block is bad, especially 
> when the block is corrupt, where is the first corrupted chunk in the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10625) VolumeScanner to report why a block is found bad

2016-07-14 Thread Rushabh S Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah updated HDFS-10625:
--
Attachment: HDFS-10625.patch

>  VolumeScanner to report why a block is found bad
> -
>
> Key: HDFS-10625
> URL: https://issues.apache.org/jira/browse/HDFS-10625
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Rushabh S Shah
>  Labels: supportability
> Attachments: HDFS-10625.patch
>
>
> VolumeScanner may report:
> {code}
> WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> blk_1170125248_96458336 on /d/dfs/dn
> {code}
> It would be helpful to report the reason why the block is bad, especially 
> when the block is corrupt, where is the first corrupted chunk in the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10625) VolumeScanner to report why a block is found bad

2016-07-14 Thread Rushabh S Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah updated HDFS-10625:
--
Status: Patch Available  (was: Open)

Please review.

>  VolumeScanner to report why a block is found bad
> -
>
> Key: HDFS-10625
> URL: https://issues.apache.org/jira/browse/HDFS-10625
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Reporter: Yongjun Zhang
>Assignee: Rushabh S Shah
>  Labels: supportability
> Attachments: HDFS-10625.patch
>
>
> VolumeScanner may report:
> {code}
> WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> blk_1170125248_96458336 on /d/dfs/dn
> {code}
> It would be helpful to report the reason why the block is bad, especially 
> when the block is corrupt, where is the first corrupted chunk in the block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376964#comment-15376964
 ] 

Rushabh S Shah commented on HDFS-10626:
---

ltgm 
+1 (non binding)

> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
> Attachments: HDFS-10626.001.patch
>
>
> VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. 
> The related codes:
> {code}
> public void handle(ExtendedBlock block, IOException e) {
>   FsVolumeSpi volume = scanner.volume;
>   ...
>   try {
> scanner.datanode.reportBadBlocks(block, volume);
>   } catch (IOException ie) {
> // This is bad, but not bad enough to shut down the scanner.
> LOG.warn("Cannot report bad " + block.getBlockId(), e);
>   }
> }
> {code}
> The IOException that printed in the log should be {{ie}} rather than {{e}} 
> which was passed in method {{handle(ExtendedBlock block, IOException e)}}.
> It will be a important info that can help us to know why datanode 
> reporBadBlocks failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376940#comment-15376940
 ] 

Vinayakumar B commented on HDFS-10587:
--

bq. org.apache.hadoop.fs.ChecksumException: Checksum error: 
DFSClient_NONMAPREDUCE_2019484565_1 at 81920 exp: 1352119728 got: -1012279895
Here it says checksum error at 81920, which is at the very beginning itself. 
May be 229 disk have some problem, or during transfer to 77 some corruption due 
to network card would have happened. 
Is not exactly same as current case.

> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> 
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41381376
>   getBytesOnDisk()  = 41381376
>   getVisibleLength()= 41186444
>   getVolume()   = /hadoop-i/data/current
>   getBlockFile()= 
> /hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186444
>   bytesOnDisk=41381376
> while the receiver has the replica as follows:
> 2016-04-15 22:03:05,068 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41186816
>   getBytesOnDisk()  = 41186816
>   getVisibleLength()= 41186816
>   getVolume()   = /hadoop-g/data/current
>   getBlockFile()= 
> /hadoop-g/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186816
>   bytesOnDisk=41186816



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376941#comment-15376941
 ] 

Hadoop QA commented on HDFS-10626:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 60m 
44s{color} | {color:green} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 80m 35s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12817934/HDFS-10626.001.patch |
| JIRA Issue | HDFS-10626 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux f07ba19e828c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / be26c1b |
| Default Java | 1.8.0_91 |
| findbugs | v3.0.0 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16056/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16056/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
> Attachments: HDFS-10626.001.patch
>
>
> VolumeScanner throws incorrect IOEx

[jira] [Commented] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread xupeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376897#comment-15376897
 ] 

xupeng commented on HDFS-10587:
---

hi [~vinayrpet] :

logs related are listed below

134.228
{noformat}
2016-07-13 11:48:29,528 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DataTransfer: Transmitted blk_1116167880_42905642 (numBytes=9911790) to 
/10.6.134.229:5080
2016-07-13 11:48:29,552 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42905642 src: /10.6.130.44:26319 dest: 
/10.6.134.228:5080
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replicablk_1116167880_42905642
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1116167880_42905642, RBW
  getNumBytes() = 9912487
  getBytesOnDisk()  = 9912487
  getVisibleLength()= 9911790
  getVolume()   = /current
  getBlockFile()= /current/current/rbw/blk_1116167880
  bytesAcked=9911790
  bytesOnDisk=9912487
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
truncateBlock: blockFile=/current/current/rbw/blk_1116167880, 
metaFile=/current/current/rbw/blk_1116167880_42905642.meta, oldlen=9912487, 
newlen=9911790
016-07-13 11:49:01,566 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42906656 src: /10.6.130.44:26617 dest: 
/10.6.134.228:5080
2016-07-13 11:49:01,566 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replica blk_1116167880_42906656
2016-07-13 11:49:01,566 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1116167880_42906656, RBW
  getNumBytes() = 15104963
  getBytesOnDisk()  = 15104963
  getVisibleLength()= 15102415
  getVolume()   = /current
  getBlockFile()= /current/current/rbw/blk_1116167880
  bytesAcked=15102415
  bytesOnDisk=15104963
2016-07-13 11:49:01,566 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
truncateBlock: blockFile=/current/rbw/blk_1116167880, 
metaFile=/current/rbw/blk_1116167880_42906656.meta, oldlen=15104963, 
newlen=15102415
2016-07-13 11:49:01,569 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Datanode 2 got response for connect ack  from downstream datanode with 
firstbadlink as 10.6.129.77:5080
2016-07-13 11:49:01,569 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Datanode 2 forwarding connect ack to upstream firstbadlink is 10.6.129.77:5080
2016-07-13 11:49:01,570 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: blk_1116167880_42907145, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.EOFException: Premature EOF: no length prefix available
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2225)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1179)
at java.lang.Thread.run(Thread.java:745)
2016-07-13 11:49:01,570 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for blk_1116167880_42907145
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
{noformat}


134.229
{noformat}
2016-07-13 11:48:29,488 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42905642 src: /10.6.134.228:24286 dest: 
/10.6.134.229:5080
2016-07-13 11:48:29,516 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Convert 
blk_1116167880_42905642 from Temporary to RBW, visible length=9912320
2016-07-13 11:48:29,552 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42905642 src: /10.6.134.228:24321 dest: 
/10.6.134.229:5080
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replica blk_1116167880_42905642
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1116167880_42905642, RBW
  getNumBytes() = 9912320
  getBytesOnDisk()  = 9912320
  getVisibleLength()= 9912320
  getVolume()   = /current
  getBlockFile()= /current/rbw/blk_1116167880
  bytesAcked=9912320
  bytesOnDisk=9912320

2016-07-13 11:49:01,501 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: blk_1116167880_42906656, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.IOException: Connection reset by peer
016-07-13 11:49:01,505 IN

[jira] [Issue Comment Deleted] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread xupeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xupeng updated HDFS-10587:
--
Comment: was deleted

(was: And here are the logs : 

Hbase log
{noformat}
 2016-07-13 11:48:29,475 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 from datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)

2016-07-13 11:48:29,476 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD],
 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]

2016-07-13 11:49:01,499 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 from datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)

2016-07-13 11:49:01,500 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.134.229:5080,DS-8c209fca-9b34-4a6b-919b-6b4d24a3e13a,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]

2016-07-13 11:49:01,566 INFO  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as 10.6.129.77:5080
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1293)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
{noformat}

10.6.128.215Log
{noformat}
2016-07-13 11:48:29,555 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 src: 
/10.6.134.229:19009 dest: /10.6.128.215:5080
2016-07-13 11:48:29,555 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replica BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642
2016-07-13 11:48:29,555 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1116167880_42905642, RBW
  getNumBytes() = 9912487
  getBytesOnDisk()  = 9912487
  getVisibleLength()= 9911790
  getVolume()   = /data12/yarn/dndata/current
  getBlockFile()= 
/data12/yarn/dndata/current/BP-448958278-10.6.130.96-1457941856632/current/rbw/blk_1116167880
  bytesAcked=9911790
  bytesOnDisk=9912487
2016-07-13 11:48:29,555 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
truncateBlock: 
blockFile=/data12/yarn/dndata/current/BP-448958278-10.6.130.96-1457941856632/current/rbw/blk_1116167880,
 
metaFile=/data12/yarn/dndata/current/BP-448958278-10.6.130.96-1457941856632/current/rbw/blk_1116167880_42905642.meta,
 oldlen=99124

[jira] [Updated] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Yiqun Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-10626:
-
Description: 
VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. The 
related codes:
{code}
public void handle(ExtendedBlock block, IOException e) {
  FsVolumeSpi volume = scanner.volume;
  ...
  try {
scanner.datanode.reportBadBlocks(block, volume);
  } catch (IOException ie) {
// This is bad, but not bad enough to shut down the scanner.
LOG.warn("Cannot report bad " + block.getBlockId(), e);
  }
}
{code}
The IOException that printed in the log should be {{ie}} rather than {{e}} 
which was passed in method {{handle(ExtendedBlock block, IOException e)}}.
It will be a important info that can help us to know why datanode 
reporBadBlocks failed.

  was:
VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. The 
related codes:
{code}
public void handle(ExtendedBlock block, IOException e) {
  FsVolumeSpi volume = scanner.volume;
  ...
  try {
scanner.datanode.reportBadBlocks(block, volume);
  } catch (IOException ie) {
// This is bad, but not bad enough to shut down the scanner.
LOG.warn("Cannot report bad " + block.getBlockId(), e);
  }
}
{code}
The IOException that printed in the log should be {{ie}} rather than {{e}} 
which was passed in method {{handle(ExtendedBlock block, IOException e)}}.


> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
> Attachments: HDFS-10626.001.patch
>
>
> VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. 
> The related codes:
> {code}
> public void handle(ExtendedBlock block, IOException e) {
>   FsVolumeSpi volume = scanner.volume;
>   ...
>   try {
> scanner.datanode.reportBadBlocks(block, volume);
>   } catch (IOException ie) {
> // This is bad, but not bad enough to shut down the scanner.
> LOG.warn("Cannot report bad " + block.getBlockId(), e);
>   }
> }
> {code}
> The IOException that printed in the log should be {{ie}} rather than {{e}} 
> which was passed in method {{handle(ExtendedBlock block, IOException e)}}.
> It will be a important info that can help us to know why datanode 
> reporBadBlocks failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Yiqun Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-10626:
-
Attachment: HDFS-10626.001.patch

Attach a simple patch to do a minor change.

> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
> Attachments: HDFS-10626.001.patch
>
>
> VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. 
> The related codes:
> {code}
> public void handle(ExtendedBlock block, IOException e) {
>   FsVolumeSpi volume = scanner.volume;
>   ...
>   try {
> scanner.datanode.reportBadBlocks(block, volume);
>   } catch (IOException ie) {
> // This is bad, but not bad enough to shut down the scanner.
> LOG.warn("Cannot report bad " + block.getBlockId(), e);
>   }
> }
> {code}
> The IOException that printed in the log should be {{ie}} rather than {{e}} 
> which was passed in method {{handle(ExtendedBlock block, IOException e)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Yiqun Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-10626:
-
Status: Patch Available  (was: Open)

> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>
> VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. 
> The related codes:
> {code}
> public void handle(ExtendedBlock block, IOException e) {
>   FsVolumeSpi volume = scanner.volume;
>   ...
>   try {
> scanner.datanode.reportBadBlocks(block, volume);
>   } catch (IOException ie) {
> // This is bad, but not bad enough to shut down the scanner.
> LOG.warn("Cannot report bad " + block.getBlockId(), e);
>   }
> }
> {code}
> The IOException that printed in the log should be {{ie}} rather than {{e}} 
> which was passed in method {{handle(ExtendedBlock block, IOException e)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Yiqun Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDFS-10626:
-
Priority: Minor  (was: Major)

> VolumeScanner prints incorrect IOException in reportBadBlocks operation
> ---
>
> Key: HDFS-10626
> URL: https://issues.apache.org/jira/browse/HDFS-10626
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Minor
>
> VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. 
> The related codes:
> {code}
> public void handle(ExtendedBlock block, IOException e) {
>   FsVolumeSpi volume = scanner.volume;
>   ...
>   try {
> scanner.datanode.reportBadBlocks(block, volume);
>   } catch (IOException ie) {
> // This is bad, but not bad enough to shut down the scanner.
> LOG.warn("Cannot report bad " + block.getBlockId(), e);
>   }
> }
> {code}
> The IOException that printed in the log should be {{ie}} rather than {{e}} 
> which was passed in method {{handle(ExtendedBlock block, IOException e)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-10626) VolumeScanner prints incorrect IOException in reportBadBlocks operation

2016-07-14 Thread Yiqun Lin (JIRA)
Yiqun Lin created HDFS-10626:


 Summary: VolumeScanner prints incorrect IOException in 
reportBadBlocks operation
 Key: HDFS-10626
 URL: https://issues.apache.org/jira/browse/HDFS-10626
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Yiqun Lin
Assignee: Yiqun Lin


VolumeScanner throws incorrect IOException in {{datanode.reportBadBlocks}}. The 
related codes:
{code}
public void handle(ExtendedBlock block, IOException e) {
  FsVolumeSpi volume = scanner.volume;
  ...
  try {
scanner.datanode.reportBadBlocks(block, volume);
  } catch (IOException ie) {
// This is bad, but not bad enough to shut down the scanner.
LOG.warn("Cannot report bad " + block.getBlockId(), e);
  }
}
{code}
The IOException that printed in the log should be {{ie}} rather than {{e}} 
which was passed in method {{handle(ExtendedBlock block, IOException e)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376794#comment-15376794
 ] 

Vinayakumar B commented on HDFS-10587:
--

Hi [~xupeng], Can you add 229 and 77 logs as well for this block? Including 
transfer related logs.

> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> 
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41381376
>   getBytesOnDisk()  = 41381376
>   getVisibleLength()= 41186444
>   getVolume()   = /hadoop-i/data/current
>   getBlockFile()= 
> /hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186444
>   bytesOnDisk=41381376
> while the receiver has the replica as follows:
> 2016-04-15 22:03:05,068 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41186816
>   getBytesOnDisk()  = 41186816
>   getVisibleLength()= 41186816
>   getVolume()   = /hadoop-g/data/current
>   getBlockFile()= 
> /hadoop-g/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186816
>   bytesOnDisk=41186816



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread xupeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376739#comment-15376739
 ] 

xupeng edited comment on HDFS-10587 at 7/14/16 10:54 AM:
-

Hi all :

I encountered the same issue, here is the scenario : 

- hbase writing block to pipeline 10.6.134.228 , 10.6.128.215, 10.6.128.208
-  dn - 10.6.128.208 restarted
-  pipeline recovery, add new datanode - 10.6.134.229 to pipeline.
-  client send transfer_block command , and 10.6.134.228 copy the block file to 
new data node 10.6.134.229  
-  client writing data
-   datanode 10.6.128.215 restarted
-  pipeline recovery, add new datanode - 10.6.129.77 to pipeline.
-  client send transfer_block command , and 10.6.134.229 copy the block file to 
new data node 10.6.129.77
-  129.77 throws "error.java.io.IOException: Unexpected checksum mismatch"


was (Author: xupener):
Hi all :

I encountered the same issue, here is the scenario : 

a. hbase writing block to pipeline 10.6.134.228 , 10.6.128.215, 10.6.128.208
b. dn - 10.6.128.208 restarted
c. pipeline recovery, add new datanode - 10.6.134.229 to pipeline.
d. client send transfer_block command , and 10.6.134.228 copy the block file to 
new data node 10.6.134.229  
e. client writing data
f.  datanode 10.6.128.215 restarted
g. pipeline recovery, add new datanode - 10.6.129.77 to pipeline.
h. 129.77 throws "error.java.io.IOException: Unexpected checksum mismatch"

> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> 
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41381376
>   getBytesOnDisk()  = 41381376
>   getVisibleLength()= 41186444
>   getVolume()   = /hadoop-i/data/current
>   getBlockFile()= 
> /hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186444
>   bytesOnDisk=41381376
> while the receiver

[jira] [Comment Edited] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread xupeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376750#comment-15376750
 ] 

xupeng edited comment on HDFS-10587 at 7/14/16 10:52 AM:
-

And here are the logs : 

Hbase log
{noformat}
 2016-07-13 11:48:29,475 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 from datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)

2016-07-13 11:48:29,476 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD],
 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]

2016-07-13 11:49:01,499 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 from datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)

2016-07-13 11:49:01,500 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.134.229:5080,DS-8c209fca-9b34-4a6b-919b-6b4d24a3e13a,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]

2016-07-13 11:49:01,566 INFO  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as 10.6.129.77:5080
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1293)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
{noformat}

10.6.128.215Log
{noformat}
2016-07-13 11:48:29,555 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 src: 
/10.6.134.229:19009 dest: /10.6.128.215:5080
2016-07-13 11:48:29,555 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replica BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642
2016-07-13 11:48:29,555 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1116167880_42905642, RBW
  getNumBytes() = 9912487
  getBytesOnDisk()  = 9912487
  getVisibleLength()= 9911790
  getVolume()   = /data12/yarn/dndata/current
  getBlockFile()= 
/data12/yarn/dndata/current/BP-448958278-10.6.130.96-1457941856632/current/rbw/blk_1116167880
  bytesAcked=9911790
  bytesOnDisk=9912487
2016-07-13 11:48:29,555 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
truncateBlock: 
blockFile=/data12/yarn/dndata/current/BP-448958278-10.6.130.96-1457941856632/current/rbw/blk_1116167880,
 
metaFile=/data12/yarn/dndata/current/BP-448958278-10.6

[jira] [Comment Edited] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread xupeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376750#comment-15376750
 ] 

xupeng edited comment on HDFS-10587 at 7/14/16 10:49 AM:
-

And here are the logs : 

Hbase log
{noformat}
 2016-07-13 11:48:29,475 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 from datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)

2016-07-13 11:48:29,476 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD],
 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]

2016-07-13 11:49:01,499 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 from datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)

2016-07-13 11:49:01,500 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.134.229:5080,DS-8c209fca-9b34-4a6b-919b-6b4d24a3e13a,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]

2016-07-13 11:49:01,566 INFO  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as 10.6.129.77:5080
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1293)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
{noformat}


was (Author: xupener):
And here are the logs : 

Hbase log
--
 2016-07-13 11:48:29,475 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 from datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)

2016-07-13 11:48:29,476 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:508

[jira] [Comment Edited] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread xupeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376750#comment-15376750
 ] 

xupeng edited comment on HDFS-10587 at 7/14/16 10:47 AM:
-

And here are the logs : 

Hbase log
--
 2016-07-13 11:48:29,475 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 from datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)

2016-07-13 11:48:29,476 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD],
 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]

2016-07-13 11:49:01,499 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 from datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)

2016-07-13 11:49:01,500 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.134.229:5080,DS-8c209fca-9b34-4a6b-919b-6b4d24a3e13a,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]

2016-07-13 11:49:01,566 INFO  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as 10.6.129.77:5080
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1293)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)


was (Author: xupener):
And here are the logs : 

Hbase log
--
2016-07-13 11:48:29,475 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 from datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)
2016-07-13 11:48:29,476 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880

[jira] [Commented] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread xupeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376750#comment-15376750
 ] 

xupeng commented on HDFS-10587:
---

And here are the logs : 

Hbase log
--
2016-07-13 11:48:29,475 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 from datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)
2016-07-13 11:48:29,476 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42905642 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD],
 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.208:5080,DS-b20d6263-ef6b-46ba-9613-faf6d24231da,SSD]
016-07-13 11:49:01,499 WARN  [ResponseProcessor for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] hdfs.DFSClient: 
DFSOutputStream ResponseProcessor exception  for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656
java.io.IOException: Bad response ERROR for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 from datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:909)
2016-07-13 11:49:01,500 WARN  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Error Recovery for block 
BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656 in pipeline 
DatanodeInfoWithStorage[10.6.134.228:5080,DS-ad10b254-5803-4109-a550-e07444a129c9,SSD],
 
DatanodeInfoWithStorage[10.6.134.229:5080,DS-8c209fca-9b34-4a6b-919b-6b4d24a3e13a,SSD],
 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]:
 bad datanode 
DatanodeInfoWithStorage[10.6.128.215:5080,DS-0f4dfb1f-225c-44cd-928a-f7420bcd96b9,SSD]
2016-07-13 11:49:01,566 INFO  [DataStreamer for file 
/ssd2/hbase_tsdb22/WALs/n6-130-044.byted.org,31356,1468326625039/n6-130-044.byted.org%2C31356%2C1468326625039.null1.1468381657104
 block BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656] 
hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as 10.6.129.77:5080
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1293)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)

> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> 
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
>

[jira] [Commented] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption

2016-07-14 Thread xupeng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376739#comment-15376739
 ] 

xupeng commented on HDFS-10587:
---

Hi all :

I encountered the same issue, here is the scenario : 

a. hbase writing block to pipeline 10.6.134.228 , 10.6.128.215, 10.6.128.208
b. dn - 10.6.128.208 restarted
c. pipeline recovery, add new datanode - 10.6.134.229 to pipeline.
d. client send transfer_block command , and 10.6.134.228 copy the block file to 
new data node 10.6.134.229  
e. client writing data
f.  datanode 10.6.128.215 restarted
g. pipeline recovery, add new datanode - 10.6.129.77 to pipeline.
h. 129.77 throws "error.java.io.IOException: Unexpected checksum mismatch"

> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> 
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41381376
>   getBytesOnDisk()  = 41381376
>   getVisibleLength()= 41186444
>   getVolume()   = /hadoop-i/data/current
>   getBlockFile()= 
> /hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186444
>   bytesOnDisk=41381376
> while the receiver has the replica as follows:
> 2016-04-15 22:03:05,068 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes() = 41186816
>   getBytesOnDisk()  = 41186816
>   getVisibleLength()= 41186816
>   getVolume()   = /hadoop-g/data/current
>   getBlockFile()= 
> /hadoop-g/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186816
>   bytesOnDisk=41186816



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, 

[jira] [Commented] (HDFS-10570) Remove classpath conflicts of netty-all jar in hadoop-hdfs-client

2016-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376706#comment-15376706
 ] 

Hadoop QA commented on HDFS-10570:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 13m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
55s{color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} | {color:green} branch-2.8 passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
30s{color} | {color:green} branch-2.8 passed with JDK v1.7.0_101 {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
18s{color} | {color:green} branch-2.8 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} branch-2.8 passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} branch-2.8 passed with JDK v1.7.0_101 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed with JDK v1.7.0_101 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
0s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed with JDK v1.8.0_91 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed with JDK v1.7.0_101 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m  
5s{color} | {color:green} hadoop-hdfs-client in the patch passed with JDK 
v1.7.0_101. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 33m  3s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:5af2af1 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12817915/HDFS-10570-branch-2.8-02.patch
 |
| JIRA Issue | HDFS-10570 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  xml  |
| uname | Linux 817aa2a2a808 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | branch-2.8 / cbd885b |
| Default Java | 1.7.0_101 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_91 
/usr/lib/jvm/java-7-openjdk-amd64:1.7.0_101 |
| JDK v1.7.0_101  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16055/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-client U: 
hadoop-hdfs-project/hadoop-hdfs-client |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/16055/console |
| Powered by

[jira] [Commented] (HDFS-10570) Remove classpath conflicts of netty-all jar in hadoop-hdfs-client

2016-07-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376679#comment-15376679
 ] 

Hudson commented on HDFS-10570:
---

SUCCESS: Integrated in Hadoop-trunk-Commit #10097 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/10097/])
HDFS-10570. Remove classpath conflicts of netty-all jar in (stevel: rev 
1fa1fab695355fadb7898efb6d0d03fc88513466)
* hadoop-hdfs-project/hadoop-hdfs-client/pom.xml


> Remove classpath conflicts of netty-all jar in hadoop-hdfs-client
> -
>
> Key: HDFS-10570
> URL: https://issues.apache.org/jira/browse/HDFS-10570
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
>Priority: Minor
> Attachments: HDFS-10570-01.patch, HDFS-10570-02.patch, 
> HDFS-10570-branch-2.8-02.patch
>
>
> While debugging tests in eclipse, Cannot access DN http url. 
> Also WebHdfs tests cannot run in eclipse due to classes loading from old 
> version of netty jars instead of netty-all jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10570) Remove classpath conflicts of netty-all jar in hadoop-hdfs-client

2016-07-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376650#comment-15376650
 ] 

Steve Loughran commented on HDFS-10570:
---

+1, committed to trunk.

I haven't yet cherry picked it to branch-2; adding a patch for yetus to test 
there before dong that

> Remove classpath conflicts of netty-all jar in hadoop-hdfs-client
> -
>
> Key: HDFS-10570
> URL: https://issues.apache.org/jira/browse/HDFS-10570
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.8.0
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
>Priority: Minor
> Attachments: HDFS-10570-01.patch, HDFS-10570-02.patch, 
> HDFS-10570-branch-2.8-02.patch
>
>
> While debugging tests in eclipse, Cannot access DN http url. 
> Also WebHdfs tests cannot run in eclipse due to classes loading from old 
> version of netty jars instead of netty-all jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   >