[jira] [Created] (HDFS-4222) NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use LADP and LDAP has issues
Xiaobo Peng created HDFS-4222: - Summary: NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use LADP and LDAP has issues Key: HDFS-4222 URL: https://issues.apache.org/jira/browse/HDFS-4222 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.23.3 Reporter: Xiaobo Peng Priority: Minor For Hadoop clusters configured to access directory information by LDAP, the FSNamesystem calls on behave of DFS clients might hang due to LDAP issues (including LDAP access issues caused by networking issues) while holding the single lock of FSNamesystem. That will result in the NN unresponsive and loss of the heartbeats from DNs. The places LDAP got accessed by FSNamesystem calls are the instantiation of FSPermissionChecker, which could be moved out of the lock scope since the instantiation does not need the FSNamesystem lock. After the move, a DFS client hang will not affect other threads by hogging the single lock. This is especially helpful when we use separate RPC servers for ClientProtocol and DatanodeProtocol since the calls for DatanodeProtocol do not need to access LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be able to process the requests (including heartbeats) from DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4222) NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use LADP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502504#comment-13502504 ] Xiaobo Peng commented on HDFS-4222: --- The following code snippets show a simple way to change FSNamesystem::renameTo in branch-0.23.4. Changes to other methods are similar. existent code /** Rename src to dst */ void renameTo(String src, String dst, Options.Rename... options) throws IOException, UnresolvedLinkException { ... writeLock(); try { renameToInternal(src, dst, options); if (auditLog.isInfoEnabled() && isExternalInvocation()) { resultingStat = dir.getFileInfo(dst, false); } } finally { writeUnlock(); } ... } private void renameToInternal(String src, String dst, Options.Rename... options) throws IOException { ... if (isPermissionEnabled) { checkParentAccess(src, FsAction.WRITE); checkAncestorAccess(dst, FsAction.WRITE); } ... } private FSPermissionChecker checkParentAccess(String path, FsAction access ) throws AccessControlException, UnresolvedLinkException { return checkPermission(path, false, null, access, null, null); } private FSPermissionChecker checkPermission(String path, boolean doCheckOwner, FsAction ancestorAccess, FsAction parentAccess, FsAction access, FsAction subAccess) throws AccessControlException, UnresolvedLinkException { FSPermissionChecker pc = new FSPermissionChecker( fsOwner.getShortUserName(), supergroup); if (!pc.isSuper) { dir.waitForReady(); readLock(); try { pc.checkPermission(path, dir.rootDir, doCheckOwner, ancestorAccess, parentAccess, access, subAccess); } finally { readUnlock(); } } return pc; } proposed changes /** Rename src to dst */ void renameTo(String src, String dst, Options.Rename... options) throws IOException, UnresolvedLinkException { ... FSPermissionChecker pc = new FSPermissionChecker( fsOwner.getShortUserName(), supergroup); writeLock(); try { renameToInternal(pc, src, dst, options); if (auditLog.isInfoEnabled() && isExternalInvocation()) { resultingStat = dir.getFileInfo(dst, false); } } finally { writeUnlock(); } ... } private void renameToInternal(FSPermissionChecker pc, String src, String dst, Options.Rename... options) throws IOException { ... if (isPermissionEnabled) { checkParentAccess(pc, src, FsAction.WRITE); checkAncestorAccess(pc, dst, FsAction.WRITE); } ... } private FSPermissionChecker checkParentAccess(FSPermissionChecker pc, String path, FsAction access ) throws AccessControlException, UnresolvedLinkException { return checkPermission(pc, path, false, null, access, null, null); } private FSPermissionChecker checkPermission(FSPermissionChecker pc, String path, boolean doCheckOwner, FsAction ancestorAccess, FsAction parentAccess, FsAction access, FsAction subAccess) throws AccessControlException, UnresolvedLinkException { if (!pc.isSuper) { dir.waitForReady(); readLock(); try { pc.checkPermission(path, dir.rootDir, doCheckOwner, ancestorAccess, parentAccess, access, subAccess); } finally { readUnlock(); } } return pc; } > NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use > LADP and LDAP has issues > -- > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Priority: Minor > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do not need to access > LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be > able to process the requests (including heartbeats) from DNs. -- This mess
[jira] [Updated] (HDFS-4222) NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use LADP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4222: -- Assignee: Xiaobo Peng > NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use > LADP and LDAP has issues > -- > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do not need to access > LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be > able to process the requests (including heartbeats) from DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4222) NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use LADP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502521#comment-13502521 ] Xiaobo Peng commented on HDFS-4222: --- Sorry the former comment did not format well. I'm trying to format it now. The following code snippets show a simple way to change FSNamesystem::renameTo in branch-0.23.4. Changes to other methods are similar. existent code {code:borderStyle=solid} /** Rename src to dst */ void renameTo(String src, String dst, Options.Rename... options) throws IOException, UnresolvedLinkException { ... writeLock(); try { renameToInternal(src, dst, options); if (auditLog.isInfoEnabled() && isExternalInvocation()) { resultingStat = dir.getFileInfo(dst, false); } } finally { writeUnlock(); } ... } private void renameToInternal(String src, String dst, Options.Rename... options) throws IOException { ... if (isPermissionEnabled) { checkParentAccess(src, FsAction.WRITE); checkAncestorAccess(dst, FsAction.WRITE); } ... } private FSPermissionChecker checkParentAccess(String path, FsAction access ) throws AccessControlException, UnresolvedLinkException { return checkPermission(path, false, null, access, null, null); } private FSPermissionChecker checkPermission(String path, boolean doCheckOwner, FsAction ancestorAccess, FsAction parentAccess, FsAction access, FsAction subAccess) throws AccessControlException, UnresolvedLinkException { FSPermissionChecker pc = new FSPermissionChecker( fsOwner.getShortUserName(), supergroup); if (!pc.isSuper) { dir.waitForReady(); readLock(); try { pc.checkPermission(path, dir.rootDir, doCheckOwner, ancestorAccess, parentAccess, access, subAccess); } finally { readUnlock(); } } return pc; } {code} proposed changes {code:borderStyle=solid} /** Rename src to dst */ void renameTo(String src, String dst, Options.Rename... options) throws IOException, UnresolvedLinkException { ... FSPermissionChecker pc = new FSPermissionChecker( fsOwner.getShortUserName(), supergroup); writeLock(); try { renameToInternal(pc, src, dst, options); if (auditLog.isInfoEnabled() && isExternalInvocation()) { resultingStat = dir.getFileInfo(dst, false); } } finally { writeUnlock(); } ... } private void renameToInternal(FSPermissionChecker pc, String src, String dst, Options.Rename... options) throws IOException { ... if (isPermissionEnabled) { checkParentAccess(pc, src, FsAction.WRITE); checkAncestorAccess(pc, dst, FsAction.WRITE); } ... } private FSPermissionChecker checkParentAccess(FSPermissionChecker pc, String path, FsAction access ) throws AccessControlException, UnresolvedLinkException { return checkPermission(pc, path, false, null, access, null, null); } private FSPermissionChecker checkPermission(FSPermissionChecker pc, String path, boolean doCheckOwner, FsAction ancestorAccess, FsAction parentAccess, FsAction access, FsAction subAccess) throws AccessControlException, UnresolvedLinkException { if (!pc.isSuper) { dir.waitForReady(); readLock(); try { pc.checkPermission(path, dir.rootDir, doCheckOwner, ancestorAccess, parentAccess, access, subAccess); } finally { readUnlock(); } } return pc; } {code} > NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use > LADP and LDAP has issues > -- > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do n
[jira] [Updated] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4222: -- Attachment: hdfs-4222-release-1.0.3.patch hdfs-4222-branch-0.23.3.patch I manually tested the patch on release-1.0.3. Looks like we have some perf gain also. I will test hdfs-4222-branch-0.23.3.patch soon. > NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to > use LDAP and LDAP has issues > --- > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: hdfs-4222-branch-0.23.3.patch, > hdfs-4222-release-1.0.3.patch > > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do not need to access > LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be > able to process the requests (including heartbeats) from DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576388#comment-13576388 ] Xiaobo Peng commented on HDFS-4222: --- Please review my patches. hdfs-4222-branch-0.23.3.patch should be easier to review. hdfs-4222-release-1.0.3.patch is messy since moving synchronized scopes result in reformatting. I think we should enclose all FSPermissionChecker instantiations in "if (isPermissionEnabled)" blocks as follows, FSPermissionChecker pc = null; if (isPermissionEnabled) { pc = new FSPermissionChecker(fsOwner.getShortUserName(), supergroup); } But I have not done so because I want to limit the impact of this changes. If you know it is totally safe to do such changes, please let me know. Thanks a lot. > NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to > use LDAP and LDAP has issues > --- > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: hdfs-4222-branch-0.23.3.patch, > hdfs-4222-release-1.0.3.patch > > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do not need to access > LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be > able to process the requests (including heartbeats) from DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13579683#comment-13579683 ] Xiaobo Peng commented on HDFS-4222: --- What branches should my patches be prepared for? trunk or branch-1 or branch-2 or branch-0.23.6? Thanks > NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to > use LDAP and LDAP has issues > --- > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: hdfs-4222-branch-0.23.3.patch, > hdfs-4222-release-1.0.3.patch > > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do not need to access > LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be > able to process the requests (including heartbeats) from DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13581637#comment-13581637 ] Xiaobo Peng commented on HDFS-4222: --- Suresh, Thanks a lot. I should finish what you suggested within 2 days. > NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to > use LDAP and LDAP has issues > --- > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: hdfs-4222-branch-0.23.3.patch, > hdfs-4222-release-1.0.3.patch > > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do not need to access > LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be > able to process the requests (including heartbeats) from DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4222: -- Attachment: HDFS-4222-branch-1.patch Many thanks to Suresh, Kihwal, and Daryn for the help. I am uploading the patch for branch-1. Please review. I have to move the synchronized keyword from the method level to the block level. So there are many lines got reformatted. I reviewed my patch several times. It was time consuming. Thanks a lot. > NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to > use LDAP and LDAP has issues > --- > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.0.0, 0.23.3, 2.0.0-alpha >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Fix For: 0.23.7, 2.0.4-beta > > Attachments: HDFS-4222.23.patch, hdfs-4222-branch-0.23.3.patch, > HDFS-4222-branch-1.patch, HDFS-4222.patch, HDFS-4222.patch, > hdfs-4222-release-1.0.3.patch > > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do not need to access > LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be > able to process the requests (including heartbeats) from DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585654#comment-13585654 ] Xiaobo Peng commented on HDFS-4222: --- Thank you very much for the quick action, Suresh. > NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to > use LDAP and LDAP has issues > --- > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.0.0, 0.23.3, 2.0.0-alpha >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Fix For: 1.2.0, 0.23.7, 2.0.4-beta > > Attachments: HDFS-4222.23.patch, hdfs-4222-branch-0.23.3.patch, > HDFS-4222-branch-1.patch, HDFS-4222.patch, HDFS-4222.patch, > hdfs-4222-release-1.0.3.patch > > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do not need to access > LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be > able to process the requests (including heartbeats) from DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-4535) port HDFS-1457 to branch-1.1
Xiaobo Peng created HDFS-4535: - Summary: port HDFS-1457 to branch-1.1 Key: HDFS-4535 URL: https://issues.apache.org/jira/browse/HDFS-4535 Project: Hadoop HDFS Issue Type: Bug Reporter: Xiaobo Peng Assignee: Xiaobo Peng Priority: Minor port HDFS-1457 (configuration option to enable limiting the transfer rate used when sending the image and edits for checkpointing) to branch-1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1
[ https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4535: -- Component/s: namenode Affects Version/s: 1.1.2 > port HDFS-1457 to branch-1.1 > > > Key: HDFS-4535 > URL: https://issues.apache.org/jira/browse/HDFS-4535 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.1.2 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > > port HDFS-1457 (configuration option to enable limiting the transfer rate > used when sending the image and edits for checkpointing) to branch-1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1
[ https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4535: -- Attachment: HDFS-4535-branch-1.1.patch > port HDFS-1457 to branch-1.1 > > > Key: HDFS-4535 > URL: https://issues.apache.org/jira/browse/HDFS-4535 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.1.2 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: HDFS-4535-branch-1.1.patch > > > port HDFS-1457 (configuration option to enable limiting the transfer rate > used when sending the image and edits for checkpointing) to branch-1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1
[ https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4535: -- Attachment: (was: HDFS-4535-branch-1.1.patch) > port HDFS-1457 to branch-1.1 > > > Key: HDFS-4535 > URL: https://issues.apache.org/jira/browse/HDFS-4535 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.1.2 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > > port HDFS-1457 (configuration option to enable limiting the transfer rate > used when sending the image and edits for checkpointing) to branch-1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1
[ https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4535: -- Attachment: HDFS-4535-branch-1.1.patch > port HDFS-1457 to branch-1.1 > > > Key: HDFS-4535 > URL: https://issues.apache.org/jira/browse/HDFS-4535 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.1.2 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: HDFS-4535-branch-1.1.patch > > > port HDFS-1457 (configuration option to enable limiting the transfer rate > used when sending the image and edits for checkpointing) to branch-1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1
[ https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4535: -- Attachment: (was: HDFS-4535-branch-1.1.patch) > port HDFS-1457 to branch-1.1 > > > Key: HDFS-4535 > URL: https://issues.apache.org/jira/browse/HDFS-4535 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.1.2 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > > port HDFS-1457 (configuration option to enable limiting the transfer rate > used when sending the image and edits for checkpointing) to branch-1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1
[ https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4535: -- Attachment: HDFS-4535-branch-1.1.patch I deleted the old patch since it missed the removed BlockTransferThrottler.java. I performed similar manual tests to HDFS-1457 and HDFS-3515 on the patch. I verified that the throttle method was entered, the throttling period was correct, and fsimage & edits log files were throttled at the rate bandwidthPerSec. In short, the patch works. > port HDFS-1457 to branch-1.1 > > > Key: HDFS-4535 > URL: https://issues.apache.org/jira/browse/HDFS-4535 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.1.2 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: HDFS-4535-branch-1.1.patch > > > port HDFS-1457 (configuration option to enable limiting the transfer rate > used when sending the image and edits for checkpointing) to branch-1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Attachment: HDFS-3981-trunk.patch HDFS-3981-branch-2.patch HDFS-3981-branch-0.23.patch > access time is set without holding writelock in FSNamesystem > > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: HDFS-3981-branch-0.23.4.patch, > HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch > > > Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to > updating times without write lock. In most cases this condition will force > {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not > need to be updated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13597834#comment-13597834 ] Xiaobo Peng commented on HDFS-3981: --- Attached the patches for trunk, branch-2 and 0.23. Note: my previous comment "Also, seems we need to release readlock before trying to acquire writelock..." is invalid. > access time is set without holding writelock in FSNamesystem > > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: HDFS-3981-branch-0.23.4.patch, > HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch > > > Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to > updating times without write lock. In most cases this condition will force > {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not > need to be updated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Affects Version/s: 0.23.5 Status: Patch Available (was: Open) > access time is set without holding writelock in FSNamesystem > > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.5, 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: HDFS-3981-branch-0.23.4.patch, > HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch > > > Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to > updating times without write lock. In most cases this condition will force > {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not > need to be updated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Priority: Major (was: Minor) > access time is set without holding writelock in FSNamesystem > > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3, 0.23.5 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng > Attachments: HDFS-3981-branch-0.23.4.patch, > HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch > > > Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to > updating times without write lock. In most cases this condition will force > {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not > need to be updated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Affects Version/s: 2.0.3-alpha > access time is set without holding writelock in FSNamesystem > > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3, 2.0.3-alpha, 0.23.5 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng > Attachments: HDFS-3981-branch-0.23.4.patch, > HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch > > > Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to > updating times without write lock. In most cases this condition will force > {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not > need to be updated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4535) port HDFS-1457 to branch-1.1
[ https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13599342#comment-13599342 ] Xiaobo Peng commented on HDFS-4535: --- Sure, Suresh. I thought you knew HDFS-3515 already ported it to branch-1 (I included that link when creating this one) and still wanted a patch for branch-1.1. > port HDFS-1457 to branch-1.1 > > > Key: HDFS-4535 > URL: https://issues.apache.org/jira/browse/HDFS-4535 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.1.2 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: HDFS-4535-branch-1.1.patch > > > port HDFS-1457 (configuration option to enable limiting the transfer rate > used when sending the image and edits for checkpointing) to branch-1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4535) port HDFS-1457 to branch-1.1
[ https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13599618#comment-13599618 ] Xiaobo Peng commented on HDFS-4535: --- No problem. I need to know those practices. Thanks, Suresh. > port HDFS-1457 to branch-1.1 > > > Key: HDFS-4535 > URL: https://issues.apache.org/jira/browse/HDFS-4535 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.1.2 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: HDFS-4535-branch-1.1.patch > > > port HDFS-1457 (configuration option to enable limiting the transfer rate > used when sending the image and edits for checkpointing) to branch-1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4222) NN is unresponsive and loses heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues
[ https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-4222: -- Summary: NN is unresponsive and loses heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues (was: NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues) > NN is unresponsive and loses heartbeats of DNs when Hadoop is configured to > use LDAP and LDAP has issues > > > Key: HDFS-4222 > URL: https://issues.apache.org/jira/browse/HDFS-4222 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 1.0.0, 0.23.3, 2.0.0-alpha >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Fix For: 1.2.0, 0.23.7, 2.0.5-beta > > Attachments: HDFS-4222.23.patch, hdfs-4222-branch-0.23.3.patch, > HDFS-4222-branch-1.patch, HDFS-4222.patch, HDFS-4222.patch, > hdfs-4222-release-1.0.3.patch > > > For Hadoop clusters configured to access directory information by LDAP, the > FSNamesystem calls on behave of DFS clients might hang due to LDAP issues > (including LDAP access issues caused by networking issues) while holding the > single lock of FSNamesystem. That will result in the NN unresponsive and loss > of the heartbeats from DNs. > The places LDAP got accessed by FSNamesystem calls are the instantiation of > FSPermissionChecker, which could be moved out of the lock scope since the > instantiation does not need the FSNamesystem lock. After the move, a DFS > client hang will not affect other threads by hogging the single lock. This is > especially helpful when we use separate RPC servers for ClientProtocol and > DatanodeProtocol since the calls for DatanodeProtocol do not need to access > LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be > able to process the requests (including heartbeats) from DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3981) access time is set without holding FSNamesystem write lock
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Summary: access time is set without holding FSNamesystem write lock (was: access time is set without holding writelock in FSNamesystem) > access time is set without holding FSNamesystem write lock > -- > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3, 2.0.3-alpha, 0.23.5 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng > Attachments: HDFS-3981-branch-0.23.4.patch, > HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch > > > Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to > updating times without write lock. In most cases this condition will force > {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not > need to be updated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3655) datenode recoverRbw could hang sometime
[ https://issues.apache.org/jira/browse/HDFS-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3655: -- Attachment: HDFS-3655-0.22-use-join-instead-of-wait.patch HDFS-3655-0.22.patch keeps the code that waits for termination of the old writer in stopWriter method of class ReplicaInPipeline. We pay the price to pass FSDataset to it in order to release the monitor during waiting. HDFS-3655-0.22-use-join-instead-of-wait.patch moves join/wait code from ReplicaInPipeline to FSDataset. It looks cleaner. But we have to refactor some FSDataset code to not duplicate them. > datenode recoverRbw could hang sometime > --- > > Key: HDFS-3655 > URL: https://issues.apache.org/jira/browse/HDFS-3655 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.22.0, 1.0.3, 2.0.1-alpha >Reporter: Ming Ma > Fix For: 0.22.1 > > Attachments: HDFS-3655-0.22-use-join-instead-of-wait.patch, > HDFS-3655-0.22.patch > > > This bug seems to apply to 0.22 and hadoop 2.0. I will upload the initial fix > done by my colleague Xiaobo Peng shortly ( there is some logistics issue > being worked on so that he can upload patch himself later ). > recoverRbw try to kill the old writer thread, but it took the lock (FSDataset > monitor object) which the old writer thread is waiting on ( for example the > call to data.getTmpInputStreams ). > "DataXceiver for client /10.110.3.43:40193 [Receiving block > blk_-3037542385914640638_57111747 > client=DFSClient_attempt_201206021424_0001_m_000401_0]" daemon prio=10 > tid=0x7facf8111800 nid=0x6b64 in Object.wait() [0x7facd1ddb000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1186) > ■locked <0x0007856c1200> (a org.apache.hadoop.util.Daemon) > at java.lang.Thread.join(Thread.java:1239) > at > org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:158) > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.recoverRbw(FSDataset.java:1347) > ■locked <0x0007838398c0> (a > org.apache.hadoop.hdfs.server.datanode.FSDataset) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:119) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlockInternal(DataXceiver.java:391) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:327) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:405) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:344) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:183) > at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-3981) access time is set without holding writelock in FSNamesystem
Xiaobo Peng created HDFS-3981: - Summary: access time is set without holding writelock in FSNamesystem Key: HDFS-3981 URL: https://issues.apache.org/jira/browse/HDFS-3981 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.1-alpha Reporter: Xiaobo Peng Priority: Minor -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Description: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks like we need to change the code to if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The code in branch-2.0.1-alpha private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (now <= inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } } dir.setTimes(src, inode, -1, now, false); } return blockManager.createLocatedBlocks(inode.getBlocks(), inode.computeFileSize(false), inode.isUnderConstruction(), offset, length, needBlockToken); } finally { if (attempt == 0) { readUnlock(); } else { writeUnlock(); } } } return null; // can never reach here } > access time is set without holding writelock in FSNamesystem > > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 2.0.1-alpha >Reporter: Xiaobo Peng >Priority: Minor > > If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, > we will call dir.setTimes(src, inode, -1, now, false) without writelock. So > looks like we need to change the code to > if (doAccessTime && isAccessTimeSupported()) { > if (now > inode.getAccessTime() + getAccessTimePrecision()) { > // if we have to set access time but we only have the readlock, > then > // restart this entire operation with the writeLock. > if (attempt == 0) { > continue; > } > dir.setTimes(src, inode, -1, now, false); > } > } > Also, seems we need to release readlock before trying to acquire writelock. > Otherwise, we might end up with still holding readlock after the function > call. > The code in branch-2.0.1-alpha > private LocatedBlocks getBlockLocationsUpdateTimes(String src, >long offset, >long length, >boolean doAccessTime, >boolean needBlockToken) > throws FileNotFoundException, UnresolvedLinkException, IOException { > for (int attempt = 0; attempt < 2; attempt++) { > if (attempt == 0) { // first attempt is with readlock > readLock(); > } else { // second attempt is with write lock > writeLock(); // writelock is needed to set accesstime > } > try { > checkOperation(OperationCategory.READ); > // if the namenode
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Description: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks like we need to change the code to {noformat} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {noformat} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The code in branch-2.0.1-alpha private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (now <= inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } } dir.setTimes(src, inode, -1, now, false); } return blockManager.createLocatedBlocks(inode.getBlocks(), inode.computeFileSize(false), inode.isUnderConstruction(), offset, length, needBlockToken); } finally { if (attempt == 0) { readUnlock(); } else { writeUnlock(); } } } return null; // can never reach here } was: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks like we need to change the code to if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The code in branch-2.0.1-alpha private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (now <= inode.getAccessTime() + getAccessTimePrecision())
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Description: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks like we need to change the code to {noformat} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {noformat} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {noformat} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (now <= inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } } dir.setTimes(src, inode, -1, now, false); } return blockManager.createLocatedBlocks(inode.getBlocks(), inode.computeFileSize(false), inode.isUnderConstruction(), offset, length, needBlockToken); } finally { if (attempt == 0) { readUnlock(); } else { writeUnlock(); } } } return null; // can never reach here } {noformat} was: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks like we need to change the code to {noformat} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {noformat} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The code in branch-2.0.1-alpha private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (n
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Description: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So there will be races and an ealier access time might overwrite a later access time. Looks like we need to change the code to {noformat} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {noformat} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {noformat} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (now <= inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } } dir.setTimes(src, inode, -1, now, false); } return blockManager.createLocatedBlocks(inode.getBlocks(), inode.computeFileSize(false), inode.isUnderConstruction(), offset, length, needBlockToken); } finally { if (attempt == 0) { readUnlock(); } else { writeUnlock(); } } } return null; // can never reach here } {noformat} was: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks like we need to change the code to {noformat} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {noformat} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {noformat} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src);
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Description: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So there will be races and an ealier access time might overwrite a later access time. Looks like we need to change the code to {noformat} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {noformat} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {code:title=FSNamesystem.java|borderStyle=solid} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (now <= inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } } dir.setTimes(src, inode, -1, now, false); } return blockManager.createLocatedBlocks(inode.getBlocks(), inode.computeFileSize(false), inode.isUnderConstruction(), offset, length, needBlockToken); } finally { if (attempt == 0) { readUnlock(); } else { writeUnlock(); } } } return null; // can never reach here } {code} was: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So there will be races and an ealier access time might overwrite a later access time. Looks like we need to change the code to {noformat} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {noformat} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {noformat} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFil
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Description: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So there will be races and an ealier access time might overwrite a later access time. Looks like we need to change the code to {code} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {code} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {code:title=FSNamesystem.java|borderStyle=solid} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (now <= inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } } dir.setTimes(src, inode, -1, now, false); } return blockManager.createLocatedBlocks(inode.getBlocks(), inode.computeFileSize(false), inode.isUnderConstruction(), offset, length, needBlockToken); } finally { if (attempt == 0) { readUnlock(); } else { writeUnlock(); } } } return null; // can never reach here } {code} was: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So there will be races and an ealier access time might overwrite a later access time. Looks like we need to change the code to {noformat} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {noformat} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {code:title=FSNamesystem.java|borderStyle=solid} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now();
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Description: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So there will be races and an ealier access time might overwrite a later access time. Looks like we need to change the code to {code} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {code} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {code:title=FSNamesystem.java|borderStyle=solid} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (now <= inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } } dir.setTimes(src, inode, -1, now, false); } return blockManager.createLocatedBlocks(inode.getBlocks(), inode.computeFileSize(false), inode.isUnderConstruction(), offset, length, needBlockToken); } finally { if (attempt == 0) { readUnlock(); } else { writeUnlock(); } } } return null; // can never reach here } {code} was: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So there will be races and an ealier access time might overwrite a later access time. Looks like we need to change the code to {code} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {code} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {code:title=FSNamesystem.java|borderStyle=solid} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INode
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Description: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So there will be races and an ealier access time might overwrite a later access time. Looks like we need to change the code to {code} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {code} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. Or we could simply remove the condition "if (attempt == 0)" for readUnlock(), i.e. readUnlock() should be called even if attempt is 1. The following code is from branch-2.0.1-alpha {code:title=FSNamesystem.java|borderStyle=solid} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then do not update access time if (isInSafeMode()) { doAccessTime = false; } long now = now(); INodeFile inode = dir.getFileINode(src); if (inode == null) { throw new FileNotFoundException("File does not exist: " + src); } assert !inode.isLink(); if (doAccessTime && isAccessTimeSupported()) { if (now <= inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } } dir.setTimes(src, inode, -1, now, false); } return blockManager.createLocatedBlocks(inode.getBlocks(), inode.computeFileSize(false), inode.isUnderConstruction(), offset, length, needBlockToken); } finally { if (attempt == 0) { readUnlock(); } else { writeUnlock(); } } } return null; // can never reach here } {code} was: If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we will call dir.setTimes(src, inode, -1, now, false) without writelock. So there will be races and an ealier access time might overwrite a later access time. Looks like we need to change the code to {code} if (doAccessTime && isAccessTimeSupported()) { if (now > inode.getAccessTime() + getAccessTimePrecision()) { // if we have to set access time but we only have the readlock, then // restart this entire operation with the writeLock. if (attempt == 0) { continue; } dir.setTimes(src, inode, -1, now, false); } } {code} Also, seems we need to release readlock before trying to acquire writelock. Otherwise, we might end up with still holding readlock after the function call. The following code is from branch-2.0.1-alpha {code:title=FSNamesystem.java|borderStyle=solid} private LocatedBlocks getBlockLocationsUpdateTimes(String src, long offset, long length, boolean doAccessTime, boolean needBlockToken) throws FileNotFoundException, UnresolvedLinkException, IOException { for (int attempt = 0; attempt < 2; attempt++) { if (attempt == 0) { // first attempt is with readlock readLock(); } else { // second attempt is with write lock writeLock(); // writelock is needed to set accesstime } try { checkOperation(OperationCategory.READ); // if the namenode is in safemode, then
[jira] [Updated] (HDFS-3655) Datanode recoverRbw could hang sometime
[ https://issues.apache.org/jira/browse/HDFS-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3655: -- Assignee: Xiaobo Peng > Datanode recoverRbw could hang sometime > --- > > Key: HDFS-3655 > URL: https://issues.apache.org/jira/browse/HDFS-3655 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.22.0, 1.0.3, 2.0.0-alpha >Reporter: Ming Ma >Assignee: Xiaobo Peng > Attachments: HDFS-3655-0.22.patch, > HDFS-3655-0.22-use-join-instead-of-wait.patch > > > This bug seems to apply to 0.22 and hadoop 2.0. I will upload the initial fix > done by my colleague Xiaobo Peng shortly ( there is some logistics issue > being worked on so that he can upload patch himself later ). > recoverRbw try to kill the old writer thread, but it took the lock (FSDataset > monitor object) which the old writer thread is waiting on ( for example the > call to data.getTmpInputStreams ). > "DataXceiver for client /10.110.3.43:40193 [Receiving block > blk_-3037542385914640638_57111747 > client=DFSClient_attempt_201206021424_0001_m_000401_0]" daemon prio=10 > tid=0x7facf8111800 nid=0x6b64 in Object.wait() [0x7facd1ddb000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1186) > ■locked <0x0007856c1200> (a org.apache.hadoop.util.Daemon) > at java.lang.Thread.join(Thread.java:1239) > at > org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:158) > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.recoverRbw(FSDataset.java:1347) > ■locked <0x0007838398c0> (a > org.apache.hadoop.hdfs.server.datanode.FSDataset) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:119) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlockInternal(DataXceiver.java:391) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:327) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:405) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:344) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:183) > at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465046#comment-13465046 ] Xiaobo Peng commented on HDFS-3981: --- Thank you, Konstantin. I will follow your suggestions and create a patch for 0.23.4 > access time is set without holding writelock in FSNamesystem > > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > > Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to > updating times without write lock. In most cases this condition will force > {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not > need to be updated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaobo Peng updated HDFS-3981: -- Attachment: HDFS-3981-branch-0.23.4.patch I prepare the pathc by following the instructions at http://wiki.apache.org/hadoop/HowToContribute. Must we have new unit tests for this patch? Thanks. > access time is set without holding writelock in FSNamesystem > > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.23.3 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng >Priority: Minor > Attachments: HDFS-3981-branch-0.23.4.patch > > > Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to > updating times without write lock. In most cases this condition will force > {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not > need to be updated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3981) access time is set without holding FSNamesystem write lock
[ https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13685863#comment-13685863 ] Xiaobo Peng commented on HDFS-3981: --- Thanks a lot for finishing it, Todd. > access time is set without holding FSNamesystem write lock > -- > > Key: HDFS-3981 > URL: https://issues.apache.org/jira/browse/HDFS-3981 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.23.3, 2.0.3-alpha, 0.23.5 >Reporter: Xiaobo Peng >Assignee: Xiaobo Peng > Fix For: 3.0.0, 2.1.0-beta > > Attachments: HDFS-3981-branch-0.23.4.patch, > HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch, > hdfs-3981.txt > > > Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to > updating times without write lock. In most cases this condition will force > {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not > need to be updated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira