[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2016-08-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15442880#comment-15442880
 ] 

Hudson commented on HDFS-4660:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10363 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/10363/])
HDFS-10652. Add a unit test for HDFS-4660. Contributed by Vinayakumar (yzhang: 
rev c25817159af17753b398956cfe6ff14984801b01)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNodeFaultInjector.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientProtocolForPipelineRecovery.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java


> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.0.3-alpha, 3.0.0-alpha1
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.6.4
>
> Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, 
> HDFS-4660.v2.patch, periodic_hflush.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2016-08-25 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15437454#comment-15437454
 ] 

Yongjun Zhang commented on HDFS-4660:
-

Thank you very much [~nroberts]!

> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.0.3-alpha, 3.0.0-alpha1
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.6.4
>
> Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, 
> HDFS-4660.v2.patch, periodic_hflush.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2016-08-25 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15437036#comment-15437036
 ] 

Nathan Roberts commented on HDFS-4660:
--

Hi [~yzhangal]. Had to go back to an old git stash, but I'll attach a sample 
patch to TeraOutputFormat.

> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.0.3-alpha, 3.0.0-alpha1
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.6.4
>
> Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, 
> HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2016-08-24 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435861#comment-15435861
 ] 

Yongjun Zhang commented on HDFS-4660:
-

Hi [~nroberts],

Thanks for your earlier work here. Would you please explain how you did the 
first step

"Modify teragen to hflush() every 1 records"

in

"
https://issues.apache.org/jira/browse/HDFS-4660?focusedCommentId=14542862&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14542862

Thanks much.



> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.0.3-alpha, 3.0.0-alpha1
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.6.4
>
> Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, 
> HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2016-07-12 Thread Wei-Chiu Chuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373488#comment-15373488
 ] 

Wei-Chiu Chuang commented on HDFS-4660:
---

Hello [~kihwal] we are seeing a similar bug on a CDH5.5 cluster, which has this 
fix (HDFS-4660), so it may be a different bug. Would you please take a look at 
HDFS-10587? We've analyzed the log and reconstructed the sequence of events, 
and we are in the process of creating a unit test.

Thanks!

> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.0.3-alpha, 3.0.0-alpha1
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.6.4
>
> Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, 
> HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2016-02-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128690#comment-15128690
 ] 

Junping Du commented on HDFS-4660:
--

I have commit the 2.6 patch to branch-2.6. Thanks [~kihwal] for updating the 
patch.

> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.6.4
>
> Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, 
> HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2016-01-03 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080542#comment-15080542
 ] 

Junping Du commented on HDFS-4660:
--

Hi [~kihwal], shall we backport this patch to 2.6.x branch?

> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590364#comment-14590364
 ] 

Hudson commented on HDFS-4660:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2159 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2159/])
HDFS-4660. Block corruption can happen during pipeline recovery. Contributed by 
Kihwal Lee. (kihwal: rev c74517c46bf00af408ed866b6577623cdec02de1)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590334#comment-14590334
 ] 

Hudson commented on HDFS-4660:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2177 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2177/])
HDFS-4660. Block corruption can happen during pipeline recovery. Contributed by 
Kihwal Lee. (kihwal: rev c74517c46bf00af408ed866b6577623cdec02de1)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java


> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590228#comment-14590228
 ] 

Hudson commented on HDFS-4660:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #229 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/229/])
HDFS-4660. Block corruption can happen during pipeline recovery. Contributed by 
Kihwal Lee. (kihwal: rev c74517c46bf00af408ed866b6577623cdec02de1)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590017#comment-14590017
 ] 

Hudson commented on HDFS-4660:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #220 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/220/])
HDFS-4660. Block corruption can happen during pipeline recovery. Contributed by 
Kihwal Lee. (kihwal: rev c74517c46bf00af408ed866b6577623cdec02de1)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java


> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589830#comment-14589830
 ] 

Hudson commented on HDFS-4660:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #231 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/231/])
HDFS-4660. Block corruption can happen during pipeline recovery. Contributed by 
Kihwal Lee. (kihwal: rev c74517c46bf00af408ed866b6577623cdec02de1)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java


> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589816#comment-14589816
 ] 

Hudson commented on HDFS-4660:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #961 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/961/])
HDFS-4660. Block corruption can happen during pipeline recovery. Contributed by 
Kihwal Lee. (kihwal: rev c74517c46bf00af408ed866b6577623cdec02de1)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java


> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-16 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589235#comment-14589235
 ] 

Vinayakumar B commented on HDFS-4660:
-

bq. After stress testing using the setup mentioned above, we have deployed the 
fix to the production cluster that generated checksum errors frequently. We 
have not seen any corruption so far. We are confident that it fixes the issue.
Thanks for the info and contribution [~kihwal].

> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588940#comment-14588940
 ] 

Hudson commented on HDFS-4660:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8028 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8028/])
HDFS-4660. Block corruption can happen during pipeline recovery. Contributed by 
Kihwal Lee. (kihwal: rev c74517c46bf00af408ed866b6577623cdec02de1)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1
>
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-16 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588723#comment-14588723
 ] 

Kihwal Lee commented on HDFS-4660:
--

Thanks, Nathan.  With Vinay's binding +1 and Nathan's review, I wil commit this.

> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-16 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588695#comment-14588695
 ] 

Nathan Roberts commented on HDFS-4660:
--

+1 on the patch. I have reviewed the patch previously and it is currently 
running in production at scale. 

The stress test we ran against this in 
https://issues.apache.org/jira/browse/HDFS-4660?focusedCommentId=14542862&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14542862
 heavily exercised this path. 


> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.0.3-alpha
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)