[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-08-10 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15416357#comment-15416357
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

[~daryn] That is a good suggestion. Zombies should be handled by the 
heartbeat's pruning of excess storages.
Why do we need to wait until block reports for all the storages in the 
heartbeat are processed? 
Do you want to submit a patch for this?

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Fix For: 2.7.4
>
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch, 
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-08-18 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427460#comment-15427460
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Thanks [~shv] for summarizing how zombies can be detected and appropriately 
handled using the existing mechanism in heartbeat. I am working on a patch that 
implements this. 

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Fix For: 2.7.4
>
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch, 
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10809) getNumEncryptionZones causes NPE in branch-2.7

2016-08-29 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10809:
-
Attachment: HDFS-10809.001.patch

> getNumEncryptionZones causes NPE in branch-2.7
> --
>
> Key: HDFS-10809
> URL: https://issues.apache.org/jira/browse/HDFS-10809
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, namenode
>Affects Versions: 2.7.4
>Reporter: Zhe Zhang
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-10809.001.patch
>
>
> This bug was caused by the fact that we did HDFS-10458 from trunk to 
> branch-2.7, but we did HDFS-8721 initially up to branch-2.8. So from 
> branch-2.8 and up, the order is HDFS-8721 -> HDFS-10458. But in branch-2.7, 
> we have the reverse order. Hence the inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10809) getNumEncryptionZones causes NPE in branch-2.7

2016-08-29 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447449#comment-15447449
 ] 

Vinitha Reddy Gankidi commented on HDFS-10809:
--

[~zhz] I have uploaded a patch. Please take a look. 

> getNumEncryptionZones causes NPE in branch-2.7
> --
>
> Key: HDFS-10809
> URL: https://issues.apache.org/jira/browse/HDFS-10809
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, namenode
>Affects Versions: 2.7.4
>Reporter: Zhe Zhang
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-10809.001.patch
>
>
> This bug was caused by the fact that we did HDFS-10458 from trunk to 
> branch-2.7, but we did HDFS-8721 initially up to branch-2.8. So from 
> branch-2.8 and up, the order is HDFS-8721 -> HDFS-10458. But in branch-2.7, 
> we have the reverse order. Hence the inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-10814) Add assertion for getNumEncryptionZones when no EZ is created

2016-08-29 Thread Vinitha Reddy Gankidi (JIRA)
Vinitha Reddy Gankidi created HDFS-10814:


 Summary: Add assertion for getNumEncryptionZones when no EZ is 
created
 Key: HDFS-10814
 URL: https://issues.apache.org/jira/browse/HDFS-10814
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Reporter: Vinitha Reddy Gankidi
Priority: Minor


HDFS-10809 adds an additional assertion to TestEncryptionZones to validate that 
getNumEncryptionZones returns 0 if there is no EZ. This is a useful check to 
add to trunk as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10814) Add assertion for getNumEncryptionZones when no EZ is created

2016-08-29 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi reassigned HDFS-10814:


Assignee: Vinitha Reddy Gankidi

> Add assertion for getNumEncryptionZones when no EZ is created
> -
>
> Key: HDFS-10814
> URL: https://issues.apache.org/jira/browse/HDFS-10814
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>Priority: Minor
>
> HDFS-10809 adds an additional assertion to TestEncryptionZones to validate 
> that getNumEncryptionZones returns 0 if there is no EZ. This is a useful 
> check to add to trunk as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10814) Add assertion for getNumEncryptionZones when no EZ is created

2016-08-29 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10814:
-
Attachment: HDFS-10814.001.patch

> Add assertion for getNumEncryptionZones when no EZ is created
> -
>
> Key: HDFS-10814
> URL: https://issues.apache.org/jira/browse/HDFS-10814
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>Priority: Minor
> Attachments: HDFS-10814.001.patch
>
>
> HDFS-10809 adds an additional assertion to TestEncryptionZones to validate 
> that getNumEncryptionZones returns 0 if there is no EZ. This is a useful 
> check to add to trunk as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10809) getNumEncryptionZones causes NPE in branch-2.7

2016-08-29 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10809:
-
Attachment: (was: HDFS-10809.001.patch)

> getNumEncryptionZones causes NPE in branch-2.7
> --
>
> Key: HDFS-10809
> URL: https://issues.apache.org/jira/browse/HDFS-10809
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, namenode
>Affects Versions: 2.7.4
>Reporter: Zhe Zhang
>Assignee: Vinitha Reddy Gankidi
>
> This bug was caused by the fact that we did HDFS-10458 from trunk to 
> branch-2.7, but we did HDFS-8721 initially up to branch-2.8. So from 
> branch-2.8 and up, the order is HDFS-8721 -> HDFS-10458. But in branch-2.7, 
> we have the reverse order. Hence the inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10809) getNumEncryptionZones causes NPE in branch-2.7

2016-08-29 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10809:
-
Attachment: HDFS-10809-branch-2.7.001.patch

> getNumEncryptionZones causes NPE in branch-2.7
> --
>
> Key: HDFS-10809
> URL: https://issues.apache.org/jira/browse/HDFS-10809
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, namenode
>Affects Versions: 2.7.4
>Reporter: Zhe Zhang
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-10809-branch-2.7.001.patch
>
>
> This bug was caused by the fact that we did HDFS-10458 from trunk to 
> branch-2.7, but we did HDFS-8721 initially up to branch-2.8. So from 
> branch-2.8 and up, the order is HDFS-8721 -> HDFS-10458. But in branch-2.7, 
> we have the reverse order. Hence the inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10814) Add assertion for getNumEncryptionZones when no EZ is created

2016-08-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449598#comment-15449598
 ] 

Vinitha Reddy Gankidi commented on HDFS-10814:
--

Thanks Zhe and Andrew!

> Add assertion for getNumEncryptionZones when no EZ is created
> -
>
> Key: HDFS-10814
> URL: https://issues.apache.org/jira/browse/HDFS-10814
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>Priority: Minor
> Fix For: 2.8.0, 3.0.0-alpha2
>
> Attachments: HDFS-10814.001.patch
>
>
> HDFS-10809 adds an additional assertion to TestEncryptionZones to validate 
> that getNumEncryptionZones returns 0 if there is no EZ. This is a useful 
> check to add to trunk as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10809) getNumEncryptionZones causes NPE in branch-2.7

2016-08-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15449841#comment-15449841
 ] 

Vinitha Reddy Gankidi commented on HDFS-10809:
--

Thanks [~zhz]. I could not reproduce the test failures locally as well. 

> getNumEncryptionZones causes NPE in branch-2.7
> --
>
> Key: HDFS-10809
> URL: https://issues.apache.org/jira/browse/HDFS-10809
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, namenode
>Affects Versions: 2.7.4
>Reporter: Zhe Zhang
>Assignee: Vinitha Reddy Gankidi
> Fix For: 2.7.4
>
> Attachments: HDFS-10809-branch-2.7.001.patch
>
>
> This bug was caused by the fact that we did HDFS-10458 from trunk to 
> branch-2.7, but we did HDFS-8721 initially up to branch-2.8. So from 
> branch-2.8 and up, the order is HDFS-8721 -> HDFS-10458. But in branch-2.7, 
> we have the reverse order. Hence the inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-09-12 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485884#comment-15485884
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Upon thorough investigation of heartbeat logic I have verified that unreported 
storages do get removed without any code change. Attached patch 014 eliminates 
the state and the zombie storage removal logic introduced in HDFS-7960. 
I have added a unit test that verifies that when a DN storage with blocks is 
removed, this storage is removed from the DatanodeDescriptor as well and does 
not linger forever. Unreported storages are marked as FAILED in  
{{updateHeartbeatState}} method when {{checkFailedStorages}} is true. Thus when 
a DN storage is removed, it will be marked as FAILED in the next heartbeat. 
The storage removal happens in 2 steps after that (Refer Step 2 & 3 in 
https://issues.apache.org/jira/browse/HDFS-10301?focusedCommentId=15427387&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15427387).
 
The test {{testRemovingStorageDoesNotProduceZombies}} introduced in HDFS-7960 
passes by reducing the heartbeat recheck interval so that the test doesn't 
timeout. By default, the Heartbeat Manager removes blocks associated with 
failed storages every 5 minutes.
I have ignored {{testProcessOverReplicatedAndMissingStripedBlock}} in this 
patch. Please refer to HDFS-10854 for more details.


> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Fix For: 2.7.4
>
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch, 
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-09-12 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.014.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Fix For: 2.7.4
>
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-09-12 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485956#comment-15485956
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

[~arpiagariu] In the latest patch, BR lease is removed when 
{{context.getTotalRpcs() == context.getCurRpc() + 1}}. If BRs are processed out 
of order/interleaved, the BR lease for the DN will be removed before all the 
BRs from the DN are processed. So, I have modified the {{checkLease}} method in 
{{BlockReportLeaseManager}} to return true when {{node.leaseId == 0}}. Please 
let me know if you see any issues with this approach.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Fix For: 2.7.4
>
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-09-14 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491142#comment-15491142
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

[~arpiagariu]  Storage reports are anyway sent in heartbeats and these reports 
have the information required to prune zombie storages. These storages are only 
marked as FAILED in the heartbeat. The replicas are removed in background by 
the HeartbeatManager. Why exactly do you think zombie removal in heartbeats is 
not safe? Why do we need to wait for all storage block reports from a FBR?

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-09-15 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495211#comment-15495211
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

[~jingzhao] 
??Then can this cover DN hotswap case??
Yes, I will explain how it does below.

??For DN hotswap, I think the DN only sends FBR to notify NN about the change??
That is right.

During hotswap {{DataNode.reconfigurePropertyImpl()}} is invoked which 
identifies the newly added/removed volumes. For all the volumes to be removed, 
{{FsDatasetImpl.removeVolumes()}} is called. This also removes the block infos 
from the FsDataset. It does so by adding these blocks to the 
{{blkToInvalidate}} map. Then the {{FsDatasetImpl.invalidate()}} method is 
invoked for all the blocks in the map.
{code}
   * Invalidate a block but does not delete the actual on-disk block file.
   *
   * It should only be used when deactivating disks.
   *
   * @param bpid the block pool ID.
   * @param block The block to be invalidated.
   */
  public void invalidate(String bpid, ReplicaInfo block) {
// If a DFSClient has the replica in its cache of short-circuit file
// descriptors (and the client is using ShortCircuitShm), invalidate it.
datanode.getShortCircuitRegistry().processBlockInvalidation(
new ExtendedBlockId(block.getBlockId(), bpid));

// If the block is cached, start uncaching it.
cacheManager.uncacheBlock(bpid, block.getBlockId());

datanode.notifyNamenodeDeletedBlock(new ExtendedBlock(bpid, block),
block.getStorageUuid());
  }
{code}

As you can see, these blocks are reported to the NN as deleted. So, the NN 
eventually removes all the blocks associated with this volume. Once this is 
done, the volume is actually pruned by {{DatanodeDescriptor.pruneStorageMap()}} 
in the subsequent heartbeat.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-09-19 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504561#comment-15504561
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

[~arpiagariu]  I understand that we may bypass the leaseID check if the storage 
report processing happens out of order. Are there any issues with this 
workaround? What needs to be modified?
We do not need to detect the last storage report in this implementation as the 
pruning of storages happens in the heartbeat. 

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-09-23 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517992#comment-15517992
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Why do we need to detect the last-report? I don't see any potential problems 
with the checkLease change. Like Konstantin mentioned, what exactly do you mean 
by the last-report? It will be helpful if you can give a scenario where this 
particular change can cause problems.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-09-23 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15518131#comment-15518131
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

i) When BRs are split into multiple RPCS: Say 2 BRs from the same DN are 
processed at the same time. If we process the last storage report of the second 
BR before processing all the storage reports in the first BR, then the 
remaining storage reports in the first BR will be ignored as checkLease would 
return false.
{code}
if (context != null) {
if (context.getTotalRpcs() == context.getCurRpc() + 1) {
  long leaseId = this.getBlockReportLeaseManager().removeLease(node);
  BlockManagerFaultInjector.getInstance().
  removeBlockReportLease(node, leaseId);
}
{code}
ii) For single RPC BRs: As all storage reports in the single RPC BR satisfy the 
condition that triggers removal of the lease, all storage reports after the 
first storage report will be ignored without the change.


> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-04 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546390#comment-15546390
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Patch 15 has the changes mentioned in 
https://issues.apache.org/jira/browse/HDFS-10301?focusedCommentId=15536676&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15536676.
 Kindly review.

??It does not solve the race between a timed out BR and the repeating BR in 
multi-RPC BR case.??
When there is a race, the per-storage BRs that arrive after the removal of the 
node lease would not be processed. I think that is okay. BR retransmissions are 
handled by the underlying RPC layer. The same RPC request is retried as per the 
specified Retry policy. Since these retransmitted BRs are identical, it is 
sufficient if we process all the per-storage BRs once. It seems okay to ignore 
the subsequent retransmitted BRs from the same node once {{curRpc + 1 == 
totalRpcs}} is satisfied. Does that sound reasonable?

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-04 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.015.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.015.patch, HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-04 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546645#comment-15546645
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

The test failure seems unrelated. It passes locally.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.015.patch, HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8028) TestNNHandlesBlockReportPerStorage/TestNNHandlesCombinedBlockReport Failed after patched HDFS-7704

2016-10-07 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15556846#comment-15556846
 ] 

Vinitha Reddy Gankidi commented on HDFS-8028:
-

These tests fail on branch 2-7 after HDFS-7704 but they pass after 
HDFS-7430.This doesn't need to be fixed in branch-2.7. Initialization values 
for DN_RESCAN_INTERVAL and DN_RESCAN_EXTRA_WAIT need to be modified as per 
HDFS-7430 to fix it temporarily.

> TestNNHandlesBlockReportPerStorage/TestNNHandlesCombinedBlockReport Failed 
> after patched HDFS-7704
> --
>
> Key: HDFS-8028
> URL: https://issues.apache.org/jira/browse/HDFS-8028
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.7.0
>Reporter: hongyu bi
>Assignee: hongyu bi
>Priority: Minor
> Attachments: HDFS-8028-v0.patch
>
>
> HDFS-7704 makes BadBlockReport asynchronously however 
> BlockReportTestBase#blockreport_02 doesn't wait for a while after blockreport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-13 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.016.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.015.patch, HDFS-10301.016.patch, HDFS-10301.branch-2.7.patch, 
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-13 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572936#comment-15572936
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Updated the patch. The conflict was due to a recent patch pushed upstream.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.016.patch, HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-13 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: (was: HDFS-10301.015.patch)

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.016.patch, HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-13 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: (was: HDFS-10301.016.patch)

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.015.patch, HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-13 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.015.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.015.patch, HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10712) Fix TestDataNodeVolumeFailure on 2.* branches.

2016-10-14 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575943#comment-15575943
 ] 

Vinitha Reddy Gankidi commented on HDFS-10712:
--

[~shv] Somehow lost track of this one. Can you commit it?

> Fix TestDataNodeVolumeFailure on 2.* branches.
> --
>
> Key: HDFS-10712
> URL: https://issues.apache.org/jira/browse/HDFS-10712
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.4
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-10712.branch-2.7.patch, HDFS-10712.branch-2.patch
>
>
> {{TestDataNodeVolumeFailure.testVolumeFailure()}} should pass not null 
> {{BlockReportContext}}.
> This has been fixed on trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-17 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.branch-2.7.015.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.015.patch, HDFS-10301.branch-2.015.patch, 
> HDFS-10301.branch-2.7.015.patch, HDFS-10301.branch-2.7.patch, 
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-10-17 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583534#comment-15583534
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Attached the patch for branch-2.7.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.014.patch, 
> HDFS-10301.015.patch, HDFS-10301.branch-2.015.patch, 
> HDFS-10301.branch-2.7.015.patch, HDFS-10301.branch-2.7.patch, 
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10733) NameNode terminated after full GC thinking QJM is unresponsive.

2016-10-17 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi reassigned HDFS-10733:


Assignee: Vinitha Reddy Gankidi

> NameNode terminated after full GC thinking QJM is unresponsive.
> ---
>
> Key: HDFS-10733
> URL: https://issues.apache.org/jira/browse/HDFS-10733
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, qjm
>Affects Versions: 2.6.4
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>
> NameNode went into full GC while in {{AsyncLoggerSet.waitForWriteQuorum()}}. 
> After completing GC it checks if the timeout for quorum is reached. If the GC 
> was long enough the timeout can expire, and {{QuorumCall.waitFor()}} will 
> throw {{TimeoutExcpetion}}. Finally {{FSEditLog.logSync()}} catches the 
> exception and terminates NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-11313) Segmented Block Reports

2017-03-24 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi reassigned HDFS-11313:


Assignee: Vinitha Reddy Gankidi

> Segmented Block Reports
> ---
>
> Key: HDFS-11313
> URL: https://issues.apache.org/jira/browse/HDFS-11313
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.2
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>
> Block reports from a single DataNode can be currently split into multiple 
> RPCs each reporting a single DataNode storage (disk). The reports are still 
> large since disks are getting bigger. Splitting blockReport RPCs into 
> multiple smaller calls would improve NameNode performance and overall HDFS 
> stability.
> This was discussed in multiple jiras. Here the approach is to let NameNode 
> divide blockID space into segments and then ask DataNodes to report replicas 
> in a particular range of IDs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11313) Segmented Block Reports

2017-03-24 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15941294#comment-15941294
 ] 

Vinitha Reddy Gankidi commented on HDFS-11313:
--

Assigning it to myself. Will attach a design doc soon.

> Segmented Block Reports
> ---
>
> Key: HDFS-11313
> URL: https://issues.apache.org/jira/browse/HDFS-11313
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.2
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>
> Block reports from a single DataNode can be currently split into multiple 
> RPCs each reporting a single DataNode storage (disk). The reports are still 
> large since disks are getting bigger. Splitting blockReport RPCs into 
> multiple smaller calls would improve NameNode performance and overall HDFS 
> stability.
> This was discussed in multiple jiras. Here the approach is to let NameNode 
> divide blockID space into segments and then ask DataNodes to report replicas 
> in a particular range of IDs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949593#comment-15949593
 ] 

Vinitha Reddy Gankidi commented on HDFS-11384:
--

Two other approaches to fix this:

1. In {{getBlockList()}} Dispatcher fetches the blocks belonging to a 
particular DN from the NN. And then it moves those blocks from the source DN to 
the target DN. Dispatcher can instead get the blocks directly from the 
particular DN. This makes {{getBlocksList()}} a distributed operation and 
doesn't impact any specific node.

2. Dispatcher can fetch the blocks from the Standby NN instead of the active. 
Balancer should be able to tolerate reasonable degree of staleness.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949757#comment-15949757
 ] 

Vinitha Reddy Gankidi commented on HDFS-11384:
--

If we were to offload the calls to DN, dispersing calls wouldn't be a pressing 
issue. I would like to get some feedback  on the various approaches discussed. 
[~benoyantony] [~daryn] [~liuml07] I would love to hear your opinions.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949757#comment-15949757
 ] 

Vinitha Reddy Gankidi edited comment on HDFS-11384 at 3/30/17 8:36 PM:
---

If we were to offload the calls to DN, dispersing calls wouldn't be a pressing 
issue. I would like to get some feedback  on the various approaches discussed. 
[~benoyantony] [~daryn] [~liuml07] [~zhaoyunjiong] I would love to hear your 
opinions.


was (Author: redvine):
If we were to offload the calls to DN, dispersing calls wouldn't be a pressing 
issue. I would like to get some feedback  on the various approaches discussed. 
[~benoyantony] [~daryn] [~liuml07] I would love to hear your opinions.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949757#comment-15949757
 ] 

Vinitha Reddy Gankidi edited comment on HDFS-11384 at 3/30/17 8:36 PM:
---

If we were to offload the calls to DN, dispersing calls wouldn't be a pressing 
issue. I would like to get some feedback  on the various approaches discussed. 
[~benoyantony], [~daryn], [~liuml07] and [~zhaoyunjiong] I would love to hear 
your opinions.


was (Author: redvine):
If we were to offload the calls to DN, dispersing calls wouldn't be a pressing 
issue. I would like to get some feedback  on the various approaches discussed. 
[~benoyantony] [~daryn] [~liuml07] [~zhaoyunjiong] I would love to hear your 
opinions.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949978#comment-15949978
 ] 

Vinitha Reddy Gankidi commented on HDFS-11384:
--

[~shv] I'm leaning towards reading from (4) instead of (3).
{{isGoodBlockCandidate}} needs a global view of the block replicas. Also there 
is some additional logic to deal with erasure coded(EC) blocks and this may be 
a blocker for reading from DNs. [~zhz] you probably have more context regarding 
the EC blocks.
{code}
 /**
   * Decide if the block/blockGroup is a good candidate to be moved from source
   * to target. A block is a good candidate if
   * 1. the block is not in the process of being moved/has not been moved;
   * 2. the block does not have a replica/internalBlock on the target;
   * 3. doing the move does not reduce the number of racks that the block has
   */
  private boolean isGoodBlockCandidate(StorageGroup source, StorageGroup target,
  StorageType targetStorageType, DBlock block) {
{code}

I agree that (2) and (4) are complimentary. 

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-03-30 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949978#comment-15949978
 ] 

Vinitha Reddy Gankidi edited comment on HDFS-11384 at 3/30/17 10:29 PM:


[~shv] I'm leaning towards (4) instead of (3).
{{isGoodBlockCandidate}} needs a global view of the block replicas. Also there 
is some additional logic to deal with erasure coded(EC) blocks and this may be 
a blocker for reading from DNs. [~zhz] you probably have more context regarding 
the EC blocks.
{code}
 /**
   * Decide if the block/blockGroup is a good candidate to be moved from source
   * to target. A block is a good candidate if
   * 1. the block is not in the process of being moved/has not been moved;
   * 2. the block does not have a replica/internalBlock on the target;
   * 3. doing the move does not reduce the number of racks that the block has
   */
  private boolean isGoodBlockCandidate(StorageGroup source, StorageGroup target,
  StorageType targetStorageType, DBlock block) {
{code}

I agree that (2) and (4) are complimentary. 


was (Author: redvine):
[~shv] I'm leaning towards reading from (4) instead of (3).
{{isGoodBlockCandidate}} needs a global view of the block replicas. Also there 
is some additional logic to deal with erasure coded(EC) blocks and this may be 
a blocker for reading from DNs. [~zhz] you probably have more context regarding 
the EC blocks.
{code}
 /**
   * Decide if the block/blockGroup is a good candidate to be moved from source
   * to target. A block is a good candidate if
   * 1. the block is not in the process of being moved/has not been moved;
   * 2. the block does not have a replica/internalBlock on the target;
   * 3. doing the move does not reduce the number of racks that the block has
   */
  private boolean isGoodBlockCandidate(StorageGroup source, StorageGroup target,
  StorageType targetStorageType, DBlock block) {
{code}

I agree that (2) and (4) are complimentary. 

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike

2017-04-10 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963707#comment-15963707
 ] 

Vinitha Reddy Gankidi commented on HDFS-11384:
--

[~shv] The delay logic looks good to me. It would be great if we can make 
BALANCER_NUM_RPC_PER_SEC configurable with a default value of 20.The test does 
not ensure that there are indeed 20 getBlocks calls per second and it probably 
is not straightforward to ensure that. So I would like to have the ability to 
configure BALANCER_NUM_RPC_PER_SEC.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.7.3
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11313) Segmented Block Reports

2017-04-12 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11313:
-
Attachment: SegmentedBlockReports.pdf

> Segmented Block Reports
> ---
>
> Key: HDFS-11313
> URL: https://issues.apache.org/jira/browse/HDFS-11313
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.2
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
> Attachments: SegmentedBlockReports.pdf
>
>
> Block reports from a single DataNode can be currently split into multiple 
> RPCs each reporting a single DataNode storage (disk). The reports are still 
> large since disks are getting bigger. Splitting blockReport RPCs into 
> multiple smaller calls would improve NameNode performance and overall HDFS 
> stability.
> This was discussed in multiple jiras. Here the approach is to let NameNode 
> divide blockID space into segments and then ask DataNodes to report replicas 
> in a particular range of IDs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11313) Segmented Block Reports

2017-04-12 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965511#comment-15965511
 ] 

Vinitha Reddy Gankidi commented on HDFS-11313:
--

Attached the design doc. Please take a look. I would appreciate any feedback on 
the design. Once we finalize on it, I'll create subtasks for the 
implementation. 

> Segmented Block Reports
> ---
>
> Key: HDFS-11313
> URL: https://issues.apache.org/jira/browse/HDFS-11313
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.2
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
> Attachments: SegmentedBlockReports.pdf
>
>
> Block reports from a single DataNode can be currently split into multiple 
> RPCs each reporting a single DataNode storage (disk). The reports are still 
> large since disks are getting bigger. Splitting blockReport RPCs into 
> multiple smaller calls would improve NameNode performance and overall HDFS 
> stability.
> This was discussed in multiple jiras. Here the approach is to let NameNode 
> divide blockID space into segments and then ask DataNodes to report replicas 
> in a particular range of IDs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11634) Optimize BlockIterator when interating starts in the middle.

2017-04-12 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966967#comment-15966967
 ] 

Vinitha Reddy Gankidi commented on HDFS-11634:
--

It's a good improvement. One minor nit: 
{{index}} is initialized to zero twice

[~zhz] raised a good point. It seems like we don't need the iterators for the 
skipped storages.  

> Optimize BlockIterator when interating starts in the middle.
> 
>
> Key: HDFS-11634
> URL: https://issues.apache.org/jira/browse/HDFS-11634
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.6.5
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Attachments: HDFS-11634.001.patch, HDFS-11634.002.patch, 
> HDFS-11634.003.patch, HDFS-11634.004.patch
>
>
> {{BlockManager.getBlocksWithLocations()}} needs to iterate blocks from a 
> randomly selected {{startBlock}} index. It creates an iterator which points 
> to the first block and then skips all blocks until {{startBlock}}. It is 
> inefficient when DN has multiple storages. Instead of skipping blocks one by 
> one we can skip entire storages. Should be more efficient on average.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11808) Backport HDFS-8549 to branch-2.7: Abort the balancer if an upgrade is in progress

2017-05-11 Thread Vinitha Reddy Gankidi (JIRA)
Vinitha Reddy Gankidi created HDFS-11808:


 Summary: Backport HDFS-8549 to branch-2.7: Abort the balancer if 
an upgrade is in progress
 Key: HDFS-11808
 URL: https://issues.apache.org/jira/browse/HDFS-11808
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinitha Reddy Gankidi
Assignee: Vinitha Reddy Gankidi


As per discussussion in [mailling 
list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
 backport HDFS-8549 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-11808) Backport HDFS-8549 to branch-2.7: Abort the balancer if an upgrade is in progress

2017-05-12 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi reassigned HDFS-11808:


Assignee: (was: Vinitha Reddy Gankidi)

> Backport HDFS-8549 to branch-2.7: Abort the balancer if an upgrade is in 
> progress
> -
>
> Key: HDFS-11808
> URL: https://issues.apache.org/jira/browse/HDFS-11808
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-8549 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11837) Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in batches

2017-05-16 Thread Vinitha Reddy Gankidi (JIRA)
Vinitha Reddy Gankidi created HDFS-11837:


 Summary: Backport HDFS-9710 to branch-2.7: Change DN to send block 
receipt IBRs in batches
 Key: HDFS-11837
 URL: https://issues.apache.org/jira/browse/HDFS-11837
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinitha Reddy Gankidi
Assignee: Vinitha Reddy Gankidi






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11837) Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in batches

2017-05-16 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11837:
-
Description: As per discussussion in [mailling 
list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
 backport HDFS-9710 to branch-2.7

> Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in 
> batches
> -
>
> Key: HDFS-11837
> URL: https://issues.apache.org/jira/browse/HDFS-11837
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9710 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11837) Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in batches

2017-05-16 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013604#comment-16013604
 ] 

Vinitha Reddy Gankidi commented on HDFS-11837:
--

This patch depends on two other patches that aren't in branch-2.7: HDFS-7990 
and HDFS-9726. Will create separate JIRAs to track these two backports.

> Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in 
> batches
> -
>
> Key: HDFS-11837
> URL: https://issues.apache.org/jira/browse/HDFS-11837
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9710 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11838) Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed

2017-05-16 Thread Vinitha Reddy Gankidi (JIRA)
Vinitha Reddy Gankidi created HDFS-11838:


 Summary: Backport HDFS-7990 to branch-2.7: IBR delete ack should 
not be delayed
 Key: HDFS-11838
 URL: https://issues.apache.org/jira/browse/HDFS-11838
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Vinitha Reddy Gankidi
Assignee: Vinitha Reddy Gankidi


As per discussussion in [mailling 
list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
 backport HDFS-7990 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11839) Backport HDFS-9726 to branch-2.7: Refactor IBR code to a new class

2017-05-16 Thread Vinitha Reddy Gankidi (JIRA)
Vinitha Reddy Gankidi created HDFS-11839:


 Summary: Backport HDFS-9726 to branch-2.7: Refactor IBR code to a 
new class
 Key: HDFS-11839
 URL: https://issues.apache.org/jira/browse/HDFS-11839
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinitha Reddy Gankidi
Assignee: Vinitha Reddy Gankidi
Priority: Minor


As per discussussion in [mailling 
list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
 backport HDFS-9726 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11838) Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed

2017-05-16 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11838:
-
Attachment: HDFS-7990-branch-2.7.00.patch

> Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed
> --
>
> Key: HDFS-11838
> URL: https://issues.apache.org/jira/browse/HDFS-11838
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-7990-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-7990 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11838) Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed

2017-05-16 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013623#comment-16013623
 ] 

Vinitha Reddy Gankidi commented on HDFS-11838:
--

[~shv] Please review the patch

> Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed
> --
>
> Key: HDFS-11838
> URL: https://issues.apache.org/jira/browse/HDFS-11838
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-7990-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-7990 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11838) Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed

2017-05-16 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11838:
-
Status: Patch Available  (was: Open)

> Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed
> --
>
> Key: HDFS-11838
> URL: https://issues.apache.org/jira/browse/HDFS-11838
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-7990-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-7990 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11838) Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed

2017-05-17 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015059#comment-16015059
 ] 

Vinitha Reddy Gankidi commented on HDFS-11838:
--

Good catch. Thanks Konstantin. Attached a new patch removing {{startTime}}.

> Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed
> --
>
> Key: HDFS-11838
> URL: https://issues.apache.org/jira/browse/HDFS-11838
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-7990-branch-2.7.00.patch, 
> HDFS-7990-branch-2.7.01.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-7990 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11838) Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed

2017-05-17 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11838:
-
Attachment: HDFS-7990-branch-2.7.01.patch

> Backport HDFS-7990 to branch-2.7: IBR delete ack should not be delayed
> --
>
> Key: HDFS-11838
> URL: https://issues.apache.org/jira/browse/HDFS-11838
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-7990-branch-2.7.00.patch, 
> HDFS-7990-branch-2.7.01.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-7990 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11839) Backport HDFS-9726 to branch-2.7: Refactor IBR code to a new class

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11839:
-
Status: Patch Available  (was: Open)

> Backport HDFS-9726 to branch-2.7: Refactor IBR code to a new class
> --
>
> Key: HDFS-11839
> URL: https://issues.apache.org/jira/browse/HDFS-11839
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>Priority: Minor
> Attachments: HDFS-9726.branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9726 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11839) Backport HDFS-9726 to branch-2.7: Refactor IBR code to a new class

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11839:
-
Attachment: HDFS-9726.branch-2.7.00.patch

> Backport HDFS-9726 to branch-2.7: Refactor IBR code to a new class
> --
>
> Key: HDFS-11839
> URL: https://issues.apache.org/jira/browse/HDFS-11839
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>Priority: Minor
> Attachments: HDFS-9726.branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9726 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11839) Backport HDFS-9726 to branch-2.7: Refactor IBR code to a new class

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016678#comment-16016678
 ] 

Vinitha Reddy Gankidi commented on HDFS-11839:
--

[~shv] Can you please review the patch? Regarding the checkstyle, other than 
the unused import the remaining issues should be there in the original patch as 
well. 

> Backport HDFS-9726 to branch-2.7: Refactor IBR code to a new class
> --
>
> Key: HDFS-11839
> URL: https://issues.apache.org/jira/browse/HDFS-11839
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>Priority: Minor
> Attachments: HDFS-9726.branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9726 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11839) Backport HDFS-9726 to branch-2.7: Refactor IBR code to a new class

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11839:
-
Attachment: HDFS-9726-branch-2.7.01.patch

> Backport HDFS-9726 to branch-2.7: Refactor IBR code to a new class
> --
>
> Key: HDFS-11839
> URL: https://issues.apache.org/jira/browse/HDFS-11839
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>Priority: Minor
> Attachments: HDFS-9726.branch-2.7.00.patch, 
> HDFS-9726-branch-2.7.01.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9726 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9726) Refactor IBR code to a new class

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-9726:

Attachment: HDFS-9726-branch-2.7.01.patch

> Refactor IBR code to a new class
> 
>
> Key: HDFS-9726
> URL: https://issues.apache.org/jira/browse/HDFS-9726
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Minor
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: h9726_20160131.patch, h9726_20160201.patch, 
> h9726_20160203.patch, h9726_20160204.patch, HDFS-9726-branch-2.7.01.patch
>
>
> The IBR code currently is mainly in BPServiceActor.  The JIRA is to refactor 
> it to a new class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11855) Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too long to complete

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)
Vinitha Reddy Gankidi created HDFS-11855:


 Summary: Backport HDFS-9412 to branch-2.7: getBlocks occupies 
FSLock and takes too long to complete
 Key: HDFS-11855
 URL: https://issues.apache.org/jira/browse/HDFS-11855
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinitha Reddy Gankidi
Assignee: Vinitha Reddy Gankidi


As per discussussion in [mailling 
list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
 backport HDFS-9412 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11854) Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too long to complete

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)
Vinitha Reddy Gankidi created HDFS-11854:


 Summary: Backport HDFS-9412 to branch-2.7: getBlocks occupies 
FSLock and takes too long to complete
 Key: HDFS-11854
 URL: https://issues.apache.org/jira/browse/HDFS-11854
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinitha Reddy Gankidi
Assignee: Vinitha Reddy Gankidi


As per discussussion in [mailling 
list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
 backport HDFS-9412 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11855) Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too long to complete

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11855:
-
Attachment: HDFS-9412-branch-2.7.00.patch

> Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too 
> long to complete
> --
>
> Key: HDFS-11855
> URL: https://issues.apache.org/jira/browse/HDFS-11855
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-9412-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9412 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11855) Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too long to complete

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11855:
-
Status: Patch Available  (was: Open)

> Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too 
> long to complete
> --
>
> Key: HDFS-11855
> URL: https://issues.apache.org/jira/browse/HDFS-11855
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-9412-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9412 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11855) Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too long to complete

2017-05-18 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016928#comment-16016928
 ] 

Vinitha Reddy Gankidi commented on HDFS-11855:
--

[~shv] Please take a look.

> Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too 
> long to complete
> --
>
> Key: HDFS-11855
> URL: https://issues.apache.org/jira/browse/HDFS-11855
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-9412-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9412 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11854) Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too long to complete

2017-05-19 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018042#comment-16018042
 ] 

Vinitha Reddy Gankidi commented on HDFS-11854:
--

[~arpiagariu] Yes. Let me resolve it. Thanks for that. The first time I tried 
to create I got an error but looks like it was actually successful.

> Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too 
> long to complete
> --
>
> Key: HDFS-11854
> URL: https://issues.apache.org/jira/browse/HDFS-11854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9412 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-11854) Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too long to complete

2017-05-19 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi resolved HDFS-11854.
--
Resolution: Duplicate

> Backport HDFS-9412 to branch-2.7: getBlocks occupies FSLock and takes too 
> long to complete
> --
>
> Key: HDFS-11854
> URL: https://issues.apache.org/jira/browse/HDFS-11854
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9412 to branch-2.7. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11837) Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in batches

2017-05-22 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11837:
-
Attachment: HDFS-9710-branch-2.7.00.patch

> Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in 
> batches
> -
>
> Key: HDFS-11837
> URL: https://issues.apache.org/jira/browse/HDFS-11837
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-9710-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9710 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11837) Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in batches

2017-05-22 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11837:
-
Status: Patch Available  (was: Open)

> Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in 
> batches
> -
>
> Key: HDFS-11837
> URL: https://issues.apache.org/jira/browse/HDFS-11837
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-9710-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9710 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11837) Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in batches

2017-05-23 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16021617#comment-16021617
 ] 

Vinitha Reddy Gankidi commented on HDFS-11837:
--

[~shv] Please take a look. I've verified that all these tests pass locally.

> Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in 
> batches
> -
>
> Key: HDFS-11837
> URL: https://issues.apache.org/jira/browse/HDFS-11837
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-9710-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9710 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11837) Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in batches

2017-05-23 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022199#comment-16022199
 ] 

Vinitha Reddy Gankidi commented on HDFS-11837:
--

[~shv] ReplaceDatanodeOnFailure is used here in TestBatchIbr:
conf.setBoolean(ReplaceDatanodeOnFailure.BEST_EFFORT_KEY, true);
Is there something I'm missing?

> Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in 
> batches
> -
>
> Key: HDFS-11837
> URL: https://issues.apache.org/jira/browse/HDFS-11837
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-9710-branch-2.7.00.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9710 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11837) Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in batches

2017-05-24 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-11837:
-
Attachment: HDFS-9710-branch-2.7.01.patch

> Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in 
> batches
> -
>
> Key: HDFS-11837
> URL: https://issues.apache.org/jira/browse/HDFS-11837
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-9710-branch-2.7.00.patch, 
> HDFS-9710-branch-2.7.01.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9710 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11837) Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in batches

2017-05-24 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16023350#comment-16023350
 ] 

Vinitha Reddy Gankidi commented on HDFS-11837:
--

You are right. I was looking at a different branch. I have uploaded a new patch 
removing the unused import.

> Backport HDFS-9710 to branch-2.7: Change DN to send block receipt IBRs in 
> batches
> -
>
> Key: HDFS-11837
> URL: https://issues.apache.org/jira/browse/HDFS-11837
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinitha Reddy Gankidi
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-9710-branch-2.7.00.patch, 
> HDFS-9710-branch-2.7.01.patch
>
>
> As per discussussion in [mailling 
> list|http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201705.mbox/browser]
>  backport HDFS-9710 to branch-2.7



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10733) NameNode terminated after full GC thinking QJM is unresponsive.

2017-01-09 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15812538#comment-15812538
 ] 

Vinitha Reddy Gankidi commented on HDFS-10733:
--

[~kihwal] Thanks for the great suggestion. 

I have attached a patch that increases the endtime/timeout if there is a long 
pause due to a Full GC in NN. The unit test included asserts that a timeout 
exception is thrown instead of increasing the timeout as in the case of a Full 
GC if there indeed aren't any responses from the journal nodes. Please take a 
look. 

> NameNode terminated after full GC thinking QJM is unresponsive.
> ---
>
> Key: HDFS-10733
> URL: https://issues.apache.org/jira/browse/HDFS-10733
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, qjm
>Affects Versions: 2.6.4
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-10733.001.patch
>
>
> NameNode went into full GC while in {{AsyncLoggerSet.waitForWriteQuorum()}}. 
> After completing GC it checks if the timeout for quorum is reached. If the GC 
> was long enough the timeout can expire, and {{QuorumCall.waitFor()}} will 
> throw {{TimeoutExcpetion}}. Finally {{FSEditLog.logSync()}} catches the 
> exception and terminates NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10733) NameNode terminated after full GC thinking QJM is unresponsive.

2017-01-09 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10733:
-
Attachment: HDFS-10733.001.patch

> NameNode terminated after full GC thinking QJM is unresponsive.
> ---
>
> Key: HDFS-10733
> URL: https://issues.apache.org/jira/browse/HDFS-10733
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, qjm
>Affects Versions: 2.6.4
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-10733.001.patch
>
>
> NameNode went into full GC while in {{AsyncLoggerSet.waitForWriteQuorum()}}. 
> After completing GC it checks if the timeout for quorum is reached. If the GC 
> was long enough the timeout can expire, and {{QuorumCall.waitFor()}} will 
> throw {{TimeoutExcpetion}}. Finally {{FSEditLog.logSync()}} catches the 
> exception and terminates NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10733) NameNode terminated after full GC thinking QJM is unresponsive.

2017-01-10 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10733:
-
Attachment: HDFS-10733.002.patch

> NameNode terminated after full GC thinking QJM is unresponsive.
> ---
>
> Key: HDFS-10733
> URL: https://issues.apache.org/jira/browse/HDFS-10733
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, qjm
>Affects Versions: 2.6.4
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-10733.001.patch, HDFS-10733.002.patch
>
>
> NameNode went into full GC while in {{AsyncLoggerSet.waitForWriteQuorum()}}. 
> After completing GC it checks if the timeout for quorum is reached. If the GC 
> was long enough the timeout can expire, and {{QuorumCall.waitFor()}} will 
> throw {{TimeoutExcpetion}}. Finally {{FSEditLog.logSync()}} catches the 
> exception and terminates NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10733) NameNode terminated after full GC thinking QJM is unresponsive.

2017-01-10 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816705#comment-15816705
 ] 

Vinitha Reddy Gankidi commented on HDFS-10733:
--

[~shv] I agree. Attached a new patch with this change.

> NameNode terminated after full GC thinking QJM is unresponsive.
> ---
>
> Key: HDFS-10733
> URL: https://issues.apache.org/jira/browse/HDFS-10733
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, qjm
>Affects Versions: 2.6.4
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-10733.001.patch, HDFS-10733.002.patch
>
>
> NameNode went into full GC while in {{AsyncLoggerSet.waitForWriteQuorum()}}. 
> After completing GC it checks if the timeout for quorum is reached. If the GC 
> was long enough the timeout can expire, and {{QuorumCall.waitFor()}} will 
> throw {{TimeoutExcpetion}}. Finally {{FSEditLog.logSync()}} catches the 
> exception and terminates NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10733) NameNode terminated after full GC thinking QJM is unresponsive.

2017-01-10 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10733:
-
Status: Patch Available  (was: Open)

> NameNode terminated after full GC thinking QJM is unresponsive.
> ---
>
> Key: HDFS-10733
> URL: https://issues.apache.org/jira/browse/HDFS-10733
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, qjm
>Affects Versions: 2.6.4
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
> Attachments: HDFS-10733.001.patch, HDFS-10733.002.patch
>
>
> NameNode went into full GC while in {{AsyncLoggerSet.waitForWriteQuorum()}}. 
> After completing GC it checks if the timeout for quorum is reached. If the GC 
> was long enough the timeout can expire, and {{QuorumCall.waitFor()}} will 
> throw {{TimeoutExcpetion}}. Finally {{FSEditLog.logSync()}} catches the 
> exception and terminates NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11313) Segmented Block Reports

2017-01-10 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816933#comment-15816933
 ] 

Vinitha Reddy Gankidi commented on HDFS-11313:
--

[~shv] This idea seems promising. I would like to work on it. I wanted to note 
that HDFS-7923 is related in the sense that the blocks reports by the DN are 
sent only when the NN gives the signal. Even with this patch, the issue of 
processing large DN reports under a global namespace lock still remains.  

> Segmented Block Reports
> ---
>
> Key: HDFS-11313
> URL: https://issues.apache.org/jira/browse/HDFS-11313
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.2
>Reporter: Konstantin Shvachko
>
> Block reports from a single DataNode can be currently split into multiple 
> RPCs each reporting a single DataNode storage (disk). The reports are still 
> large since disks are getting bigger. Splitting blockReport RPCs into 
> multiple smaller calls would improve NameNode performance and overall HDFS 
> stability.
> This was discussed in multiple jiras. Here the approach is to let NameNode 
> divide blockID space into segments and then ask DataNodes to report replicas 
> in a particular range of IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-05-23 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi reassigned HDFS-10301:


Assignee: Vinitha Reddy Gankidi  (was: Colin Patrick McCabe)

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.01.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-05-23 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.004.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.01.patch, HDFS-10301.sample.patch, 
> zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-05-23 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297526#comment-15297526
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Assigning the ticket to myself so that I can upload a patch. Please review.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.01.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-05-23 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297563#comment-15297563
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

I uploaded the patch HDFS-10301.004.patch. I have implemented the idea that 
Konstantin suggested, i.e, DNs explicitly report storages that they have. This 
eliminates NN guessing which storage is the last in the block report RPC. In 
the case of FBR, NameNodeRPCServer can retrieve the list of storages from the 
storage block report array. In the case that block reports are split, DNs send 
an additional StorageReportOnly RPC after sending the block reports for each 
individual storage. This StorageReportOnly RPC is sent as a FBR. This rpc 
contains all the storages that the DN has with -1 number of blocks. A new enum 
STORAGE_REPORT_ONLY is introduced in BlockListsAsLong for this purpose.

Zombie storage removal is triggered from the NameNodeRPCServer instead of the 
BlockManager since the RPCServer now has all the information required to 
construct the list of storages that the DN is reporting. After processing the 
block reports as usual, zombie storages are removed by comparing the list of 
storages in the block report and the list of storages that the NN is aware of 
for that DN.



> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.01.patch, HDFS-10301.sample.patch, 
> zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-05-24 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299255#comment-15299255
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Thanks for your review [~cmccabe]. By legacy reports do you mean block reports 
from DNs before the concept of leases was introduced for block reports? 

{code}
public synchronized boolean checkLease(DatanodeDescriptor dn,
 long monotonicNowMs, long id) {
if (id == 0) {
  LOG.debug("Datanode {} is using BR lease id 0x0 to bypass " +
  "rate-limiting.", dn.getDatanodeUuid());
  return true;
}
NodeData node = nodes.get(dn.getDatanodeUuid());
if (node == null) {
  LOG.info("BR lease 0x{} is not valid for unknown datanode {}",
  Long.toHexString(id), dn.getDatanodeUuid());
  return false;
}
if (node.leaseId == 0) {
  LOG.warn("BR lease 0x{} is not valid for DN {}, because the DN " +
   "is not in the pending set.",
   Long.toHexString(id), dn.getDatanodeUuid());
  return false;
}
{code}

Isn't {{id}} equal to 0 for legacy block reports and when block reports are 
manually triggered? My understanding is that {{node.leaseId}} is set to zero 
only when the lease is removed. In my patch, the lease is removed by looking at 
the current rpc index in the block report context.

{code}
if (context != null) {
if (context.getTotalRpcs() == context.getCurRpc() + 1) {
  long leaseId = this.getBlockReportLeaseManager().removeLease(node);
  BlockManagerFaultInjector.getInstance().removeBlockReportLease(node, 
leaseId);
}
{code}

When processing of storage report happens out of order, we may set 
{{node.leaseId=0}} before all DN storage reports are processed. Therefore, we 
log a message and continue to process the storage report even if 
{{node.leaseId=0}}. Please let me know if you see any issue with this approach.

During upgrades, we do not remove zombie storages. Once the upgrade is 
finalized, we go ahead and remove the zombie storages. 
{code}
if (nn.getFSImage().isUpgradeFinalized() && noStaleStorages) {
  Set storageIDsInBlockReport = new HashSet<>();
  if (context.getTotalRpcs() == 1) {
for (StorageBlockReport report : reports) {
  storageIDsInBlockReport.add(report.getStorage().getStorageID());
}
bm.removeZombieStorages(nodeReg, context, storageIDsInBlockReport);
  }
}
{code}

Can you please elaborate on what you meant by "In general, your solution 
doesn't fix the problem during upgrade". What problems do you foresee?

I am currently investigating why the test 
{{TestAddOverReplicatedStripedBlocks}} failed.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.01.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-05-26 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303272#comment-15303272
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

I looked into why the test {{TestAddOverReplicatedStripedBlocks}} fails with 
patch 004. I don't completely understand why the test relies on the fact that 
zombie storages should be removed when the DN has stale storages. Probably the 
test needs to be modified. Here are my findings:

With the patch, the test fails with the following error:
{code}
java.lang.AssertionError: expected:<10> but was:<11>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks.testProcessOverReplicatedAndMissingStripedBlock(TestAddOverReplicatedStripedBlocks.java:281)
{code}

In the test, {{DFSUtil.createStripedFile}} is invoked in the beginning.
{code}
 /**
   * Creates the metadata of a file in striped layout. This method only
   * manipulates the NameNode state without injecting data to DataNode.
   * You should disable periodical heartbeat before use this.
   *  @param file Path of the file to create
   * @param dir Parent path of the file
   * @param numBlocks Number of striped block groups to add to the file
   * @param numStripesPerBlk Number of striped cells in each block
   * @param toMkdir
   */
  public static void createStripedFile(MiniDFSCluster cluster, Path file, Path 
dir,
  int numBlocks, int numStripesPerBlk, boolean toMkdir) throws Exception {
{code}

This internally calls the {{DFSUtil.addBlockToFile}} method that mimics block 
reports. While processing these incremental storages, we update the datanode 
storages. In the test output, you can see the storages being added.
{code}
2016-05-26 17:10:03,330 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
9505a2ad-78f4-45d7-9c13-2ecd92a06866 for DN 127.0.0.1:60835
2016-05-26 17:10:03,331 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
d4bb2f70-4a1e-451f-9d47-a2967f819130 for DN 127.0.0.1:60839
2016-05-26 17:10:03,332 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
841fc92f-fa15-4ced-8487-96ca4e6996d0 for DN 127.0.0.1:60844
2016-05-26 17:10:03,332 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
304aaeeb-e2d0-4427-81c6-c79e4d0b6a4e for DN 127.0.0.1:60849
2016-05-26 17:10:03,332 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
2d046d66-26fc-448f-938c-04dda2ecf34a for DN 127.0.0.1:60853
2016-05-26 17:10:03,333 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
381d3151-e75e-434a-86f8-da5c83f22b19 for DN 127.0.0.1:60857
2016-05-26 17:10:03,333 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
71f72bc9-9c66-478f-a0d7-3f0c7fc23964 for DN 127.0.0.1:60861
2016-05-26 17:10:03,333 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
4dc539f3-b7a9-4145-a313-fa99ca1dd779 for DN 127.0.0.1:60865
2016-05-26 17:10:03,333 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
734ea366-e635-4715-97d5-196bfcdccb18 for DN 127.0.0.1:60869
2016-05-26 17:10:03,334 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
c639de06-e85c-4e93-92d2-506a49d4e41c for DN 127.0.0.1:60835
2016-05-26 17:10:03,343 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
a82ff231-d630-4799-907d-f0a72ff06b38 for DN 127.0.0.1:60839
2016-05-26 17:10:03,343 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
328c3467-0507-45fd-9aac-73a38165f741 for DN 127.0.0.1:60844
2016-05-26 17:10:03,343 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
0b2a3b7f-e065-4e9a-9908-024091393738 for DN 127.0.0.1:60849
2016-05-26 17:10:03,344 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
3654a0ce-8389-40bf-b8d3-08cc49895a7d for DN 127.0.0.1:60853
2016-05-26 17:

[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-05-26 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303287#comment-15303287
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

 If we do have the check for stale storages before zombie storage removal, 
{{noStaleStorages}} in NameNodeRpcServer should be set to true when 
{{isStorageReport}} is true.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.01.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-06-15 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.006.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.01.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-06-15 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333146#comment-15333146
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

I uploaded another patch (006) that is similar to 005 but doesn't add any new 
RPCs. Please review it. 
In the case that block reports are split, information about other storages in 
the DN is sent along with the last storage BR RPC.  
{{TestAddOverReplicatedStripedBlocks}} test passes with this patch since zombie 
storages are removed even if there are stale storages. 

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Colin Patrick McCabe
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.01.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-06-15 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi reassigned HDFS-10301:


Assignee: Vinitha Reddy Gankidi  (was: Colin Patrick McCabe)

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.01.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-06-16 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334862#comment-15334862
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Failed tests don't seem to be introduced by the patch. These tests pass locally 
with the patch. 

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.01.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-06-23 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347427#comment-15347427
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Thanks for the review Colin. I have addressed your comments below:
{quote}
This won't work in the presence of reordered RPCs. If the RPCs are reordered so 
that curRpc 1 arrives before curRpc 0, the lease will be removed and RPC 0 will 
be rejected.
{quote}

If curRpc 1 arrives before curRpc 0, the lease will be removed and 
{{node.leaseId}} will be set to zero. I have modified BlockReportLeaseManager 
to return true when  {{node.leaseId = 0}}. I explained the same in my previous 
comment:
https://issues.apache.org/jira/browse/HDFS-10301?focusedCommentId=15299255&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15299255
Please let me know if you see any issues with this approach.

{quote}
Using object equality to compare two BlockListAsLongs objects is very 
surprising to anyone reading the code.
{quote}
 
I uploaded a new patch (007) to address this issue. I have added a method 
{{isStorageReportOnly()}} to BlockListsAsLongs that returns true only for 
STORAGE_REPORT_ONLY BlockListsAsLong.

In the upgrade case, there is no way to detect the zombie storages since the 
old DNs do not send the information about the storages in the BR in the last 
RPC. In practice, hot-swapping of DN drives and upgrading the DN may not happen 
at the same time. 



> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.01.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-06-23 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.007.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.01.patch, HDFS-10301.sample.patch, 
> zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-07-13 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.008.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.01.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-07-13 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375518#comment-15375518
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

I apologize for attaching a wrong patch. Thanks for pointing it out [~cmccabe]. 
I uploaded the correct patch now (008) that calls the isStorageReport method. 
Adding an optional list of storage ID strings in the .proto file would add more 
overhead since these optional parameters would have to be sent with default 
values for all other block report RPCs in addition to the last RPC of the block 
report. I can add more comments in the code to explain what's going on. 
Thoughts?

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.01.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-07-18 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.009.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-07-18 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382959#comment-15382959
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

Attached a new patch (009) addressing Konstantin's comments. I cannot make 
STORAGE_REPORT final since it needs to be referenced from a static context. 
Instead, I renamed it to 'Storage_Report'. 

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-07-18 Thread Vinitha Reddy Gankidi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinitha Reddy Gankidi updated HDFS-10301:
-
Attachment: HDFS-10301.010.patch

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.sample.patch, 
> zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-07-18 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383224#comment-15383224
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

I have made STORAGE_REPORT {{static final}} in the 010 patch.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.sample.patch, 
> zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

2016-07-19 Thread Vinitha Reddy Gankidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384611#comment-15384611
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
--

> For example, an RPC could have gotten duplicated by something in the network. 
[~cmccabe] Doesn't TCP ignore duplicate packets? Can you explain how this can 
happen? If the RPC does get duplicated, then we shouldn't return true right 
when {{node.leaseId == 0}} ?

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> 
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.1
>Reporter: Konstantin Shvachko
>Assignee: Vinitha Reddy Gankidi
>Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.sample.patch, 
> zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   >