[jira] [Created] (HDFS-4222) NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use LADP and LDAP has issues

2012-11-21 Thread Xiaobo Peng (JIRA)
Xiaobo Peng created HDFS-4222:
-

 Summary: NN is unresponsive and lose hearbeats of DNs when Hadoop 
is configured to use LADP and LDAP has issues
 Key: HDFS-4222
 URL: https://issues.apache.org/jira/browse/HDFS-4222
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.3
Reporter: Xiaobo Peng
Priority: Minor


For Hadoop clusters configured to access directory information by LDAP, the 
FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
(including LDAP access issues caused by networking issues) while holding the 
single lock of FSNamesystem. That will result in the NN unresponsive and loss 
of the heartbeats from DNs.

The places LDAP got accessed by FSNamesystem calls are the instantiation of 
FSPermissionChecker, which could be moved out of the lock scope since the 
instantiation does not need the FSNamesystem lock. After the move, a DFS client 
hang will not affect other threads by hogging the single lock. This is 
especially helpful when we use separate RPC servers for ClientProtocol and 
DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be able 
to process the requests (including heartbeats) from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4222) NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use LADP and LDAP has issues

2012-11-21 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502504#comment-13502504
 ] 

Xiaobo Peng commented on HDFS-4222:
---

The following code snippets show a simple way to change FSNamesystem::renameTo 
in branch-0.23.4. Changes to other methods are similar.

 existent code
  /** Rename src to dst */
  void renameTo(String src, String dst, Options.Rename... options)
  throws IOException, UnresolvedLinkException {
...
writeLock();
try {
  renameToInternal(src, dst, options);
  if (auditLog.isInfoEnabled() && isExternalInvocation()) {
resultingStat = dir.getFileInfo(dst, false); 
  }
} finally {
  writeUnlock();
}
...
  }


  private void renameToInternal(String src, String dst,
  Options.Rename... options) throws IOException {
...
if (isPermissionEnabled) {
  checkParentAccess(src, FsAction.WRITE);
  checkAncestorAccess(dst, FsAction.WRITE);
}
...
  }


  private FSPermissionChecker checkParentAccess(String path, FsAction access
  ) throws AccessControlException, UnresolvedLinkException {
return checkPermission(path, false, null, access, null, null);
  }


  private FSPermissionChecker checkPermission(String path, boolean doCheckOwner,
  FsAction ancestorAccess, FsAction parentAccess, FsAction access,
  FsAction subAccess) throws AccessControlException, 
UnresolvedLinkException {
FSPermissionChecker pc = new FSPermissionChecker(
fsOwner.getShortUserName(), supergroup);
if (!pc.isSuper) {
  dir.waitForReady();
  readLock();
  try {
pc.checkPermission(path, dir.rootDir, doCheckOwner,
ancestorAccess, parentAccess, access, subAccess);
  } finally {
readUnlock();
  } 
}
return pc;
  }

 proposed changes
  /** Rename src to dst */
  void renameTo(String src, String dst, Options.Rename... options)
  throws IOException, UnresolvedLinkException {
...
FSPermissionChecker pc = new FSPermissionChecker(
fsOwner.getShortUserName(), supergroup);

writeLock();
try {
  renameToInternal(pc, src, dst, options);
  if (auditLog.isInfoEnabled() && isExternalInvocation()) {
resultingStat = dir.getFileInfo(dst, false); 
  }
} finally {
  writeUnlock();
}
...
  }


  private void renameToInternal(FSPermissionChecker pc, String src, String dst,
  Options.Rename... options) throws IOException {
...
if (isPermissionEnabled) {
  checkParentAccess(pc, src, FsAction.WRITE);
  checkAncestorAccess(pc, dst, FsAction.WRITE);
}
...
  }


  private FSPermissionChecker checkParentAccess(FSPermissionChecker pc, String 
path, FsAction access
  ) throws AccessControlException, UnresolvedLinkException {
return checkPermission(pc, path, false, null, access, null, null);
  }


  private FSPermissionChecker checkPermission(FSPermissionChecker pc, String 
path, boolean doCheckOwner,
  FsAction ancestorAccess, FsAction parentAccess, FsAction access,
  FsAction subAccess) throws AccessControlException, 
UnresolvedLinkException {
if (!pc.isSuper) {
  dir.waitForReady();
  readLock();
  try {
pc.checkPermission(path, dir.rootDir, doCheckOwner,
ancestorAccess, parentAccess, access, subAccess);
  } finally {
readUnlock();
  } 
}
return pc;
  }


> NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use 
> LADP and LDAP has issues
> --
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Priority: Minor
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
> LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be 
> able to process the requests (including heartbeats) from DNs.

--
This mess

[jira] [Updated] (HDFS-4222) NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use LADP and LDAP has issues

2012-11-21 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4222:
--

Assignee: Xiaobo Peng

> NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use 
> LADP and LDAP has issues
> --
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
> LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be 
> able to process the requests (including heartbeats) from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4222) NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use LADP and LDAP has issues

2012-11-21 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502521#comment-13502521
 ] 

Xiaobo Peng commented on HDFS-4222:
---

Sorry the former comment did not format well. I'm trying to format it now.

The following code snippets show a simple way to change FSNamesystem::renameTo 
in branch-0.23.4. Changes to other methods are similar.

 existent code
{code:borderStyle=solid}
  /** Rename src to dst */
  void renameTo(String src, String dst, Options.Rename... options)
  throws IOException, UnresolvedLinkException {
...
writeLock();
try {
  renameToInternal(src, dst, options);
  if (auditLog.isInfoEnabled() && isExternalInvocation()) {
resultingStat = dir.getFileInfo(dst, false); 
  }
} finally {
  writeUnlock();
}
...
  }


  private void renameToInternal(String src, String dst,
  Options.Rename... options) throws IOException {
...
if (isPermissionEnabled) {
  checkParentAccess(src, FsAction.WRITE);
  checkAncestorAccess(dst, FsAction.WRITE);
}
...
  }


  private FSPermissionChecker checkParentAccess(String path, FsAction access
  ) throws AccessControlException, UnresolvedLinkException {
return checkPermission(path, false, null, access, null, null);
  }


  private FSPermissionChecker checkPermission(String path, boolean doCheckOwner,
  FsAction ancestorAccess, FsAction parentAccess, FsAction access,
  FsAction subAccess) throws AccessControlException, 
UnresolvedLinkException {
FSPermissionChecker pc = new FSPermissionChecker(
fsOwner.getShortUserName(), supergroup);
if (!pc.isSuper) {
  dir.waitForReady();
  readLock();
  try {
pc.checkPermission(path, dir.rootDir, doCheckOwner,
ancestorAccess, parentAccess, access, subAccess);
  } finally {
readUnlock();
  } 
}
return pc;
  }
{code}

 proposed changes
{code:borderStyle=solid}
  /** Rename src to dst */
  void renameTo(String src, String dst, Options.Rename... options)
  throws IOException, UnresolvedLinkException {
...
FSPermissionChecker pc = new FSPermissionChecker(
fsOwner.getShortUserName(), supergroup);

writeLock();
try {
  renameToInternal(pc, src, dst, options);
  if (auditLog.isInfoEnabled() && isExternalInvocation()) {
resultingStat = dir.getFileInfo(dst, false); 
  }
} finally {
  writeUnlock();
}
...
  }


  private void renameToInternal(FSPermissionChecker pc, String src, String dst,
  Options.Rename... options) throws IOException {
...
if (isPermissionEnabled) {
  checkParentAccess(pc, src, FsAction.WRITE);
  checkAncestorAccess(pc, dst, FsAction.WRITE);
}
...
  }


  private FSPermissionChecker checkParentAccess(FSPermissionChecker pc, String 
path, FsAction access
  ) throws AccessControlException, UnresolvedLinkException {
return checkPermission(pc, path, false, null, access, null, null);
  }


  private FSPermissionChecker checkPermission(FSPermissionChecker pc, String 
path, boolean doCheckOwner,
  FsAction ancestorAccess, FsAction parentAccess, FsAction access,
  FsAction subAccess) throws AccessControlException, 
UnresolvedLinkException {
if (!pc.isSuper) {
  dir.waitForReady();
  readLock();
  try {
pc.checkPermission(path, dir.rootDir, doCheckOwner,
ancestorAccess, parentAccess, access, subAccess);
  } finally {
readUnlock();
  } 
}
return pc;
  }
{code}

> NN is unresponsive and lose hearbeats of DNs when Hadoop is configured to use 
> LADP and LDAP has issues
> --
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do n

[jira] [Updated] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues

2013-02-11 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4222:
--

Attachment: hdfs-4222-release-1.0.3.patch
hdfs-4222-branch-0.23.3.patch

I manually tested the patch on release-1.0.3. Looks like we have some perf gain 
also.

I will test hdfs-4222-branch-0.23.3.patch soon.

> NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to 
> use LDAP and LDAP has issues
> ---
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: hdfs-4222-branch-0.23.3.patch, 
> hdfs-4222-release-1.0.3.patch
>
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
> LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be 
> able to process the requests (including heartbeats) from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues

2013-02-11 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576388#comment-13576388
 ] 

Xiaobo Peng commented on HDFS-4222:
---

Please review my patches. hdfs-4222-branch-0.23.3.patch should be easier to 
review. hdfs-4222-release-1.0.3.patch is messy since moving synchronized scopes 
result in reformatting.
 
I think we should enclose all FSPermissionChecker instantiations in "if 
(isPermissionEnabled)" blocks as follows,

FSPermissionChecker pc = null;
if (isPermissionEnabled) {
  pc = new FSPermissionChecker(fsOwner.getShortUserName(), supergroup);
}

But I have not done so because I want to limit the impact of this changes. If 
you know it is totally safe to do such changes, please let me know. Thanks a 
lot.

> NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to 
> use LDAP and LDAP has issues
> ---
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: hdfs-4222-branch-0.23.3.patch, 
> hdfs-4222-release-1.0.3.patch
>
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
> LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be 
> able to process the requests (including heartbeats) from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues

2013-02-15 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13579683#comment-13579683
 ] 

Xiaobo Peng commented on HDFS-4222:
---

What branches should my patches be prepared for? trunk or branch-1 or branch-2 
or branch-0.23.6? Thanks

> NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to 
> use LDAP and LDAP has issues
> ---
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: hdfs-4222-branch-0.23.3.patch, 
> hdfs-4222-release-1.0.3.patch
>
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
> LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be 
> able to process the requests (including heartbeats) from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues

2013-02-19 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13581637#comment-13581637
 ] 

Xiaobo Peng commented on HDFS-4222:
---

Suresh, Thanks a lot. I should finish what you suggested within 2 days.

> NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to 
> use LDAP and LDAP has issues
> ---
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: hdfs-4222-branch-0.23.3.patch, 
> hdfs-4222-release-1.0.3.patch
>
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
> LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be 
> able to process the requests (including heartbeats) from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues

2013-02-22 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4222:
--

Attachment: HDFS-4222-branch-1.patch

Many thanks to Suresh, Kihwal, and Daryn for the help. I am uploading the patch 
for branch-1. Please review.

I have to move the synchronized keyword from the method level to the block 
level. So there are many lines got reformatted. I reviewed my patch several 
times. It was time consuming. Thanks a lot.

> NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to 
> use LDAP and LDAP has issues
> ---
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.0.0, 0.23.3, 2.0.0-alpha
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Fix For: 0.23.7, 2.0.4-beta
>
> Attachments: HDFS-4222.23.patch, hdfs-4222-branch-0.23.3.patch, 
> HDFS-4222-branch-1.patch, HDFS-4222.patch, HDFS-4222.patch, 
> hdfs-4222-release-1.0.3.patch
>
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
> LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be 
> able to process the requests (including heartbeats) from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4222) NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues

2013-02-24 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585654#comment-13585654
 ] 

Xiaobo Peng commented on HDFS-4222:
---

Thank you very much for the quick action, Suresh.

> NN is unresponsive and lose heartbeats of DNs when Hadoop is configured to 
> use LDAP and LDAP has issues
> ---
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.0.0, 0.23.3, 2.0.0-alpha
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Fix For: 1.2.0, 0.23.7, 2.0.4-beta
>
> Attachments: HDFS-4222.23.patch, hdfs-4222-branch-0.23.3.patch, 
> HDFS-4222-branch-1.patch, HDFS-4222.patch, HDFS-4222.patch, 
> hdfs-4222-release-1.0.3.patch
>
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
> LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be 
> able to process the requests (including heartbeats) from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HDFS-4535) port HDFS-1457 to branch-1.1

2013-02-26 Thread Xiaobo Peng (JIRA)
Xiaobo Peng created HDFS-4535:
-

 Summary: port HDFS-1457 to branch-1.1
 Key: HDFS-4535
 URL: https://issues.apache.org/jira/browse/HDFS-4535
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Xiaobo Peng
Assignee: Xiaobo Peng
Priority: Minor


port HDFS-1457 (configuration option to enable limiting the transfer rate used 
when sending the image and edits for checkpointing) to branch-1.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1

2013-02-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4535:
--

  Component/s: namenode
Affects Version/s: 1.1.2

> port HDFS-1457 to branch-1.1
> 
>
> Key: HDFS-4535
> URL: https://issues.apache.org/jira/browse/HDFS-4535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.1.2
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
>
> port HDFS-1457 (configuration option to enable limiting the transfer rate 
> used when sending the image and edits for checkpointing) to branch-1.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1

2013-02-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4535:
--

Attachment: HDFS-4535-branch-1.1.patch

> port HDFS-1457 to branch-1.1
> 
>
> Key: HDFS-4535
> URL: https://issues.apache.org/jira/browse/HDFS-4535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.1.2
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: HDFS-4535-branch-1.1.patch
>
>
> port HDFS-1457 (configuration option to enable limiting the transfer rate 
> used when sending the image and edits for checkpointing) to branch-1.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1

2013-02-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4535:
--

Attachment: (was: HDFS-4535-branch-1.1.patch)

> port HDFS-1457 to branch-1.1
> 
>
> Key: HDFS-4535
> URL: https://issues.apache.org/jira/browse/HDFS-4535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.1.2
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
>
> port HDFS-1457 (configuration option to enable limiting the transfer rate 
> used when sending the image and edits for checkpointing) to branch-1.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1

2013-02-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4535:
--

Attachment: HDFS-4535-branch-1.1.patch

> port HDFS-1457 to branch-1.1
> 
>
> Key: HDFS-4535
> URL: https://issues.apache.org/jira/browse/HDFS-4535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.1.2
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: HDFS-4535-branch-1.1.patch
>
>
> port HDFS-1457 (configuration option to enable limiting the transfer rate 
> used when sending the image and edits for checkpointing) to branch-1.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1

2013-03-05 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4535:
--

Attachment: (was: HDFS-4535-branch-1.1.patch)

> port HDFS-1457 to branch-1.1
> 
>
> Key: HDFS-4535
> URL: https://issues.apache.org/jira/browse/HDFS-4535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.1.2
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
>
> port HDFS-1457 (configuration option to enable limiting the transfer rate 
> used when sending the image and edits for checkpointing) to branch-1.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4535) port HDFS-1457 to branch-1.1

2013-03-05 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4535:
--

Attachment: HDFS-4535-branch-1.1.patch

I deleted the old patch since it missed the removed BlockTransferThrottler.java.

I performed similar manual tests to HDFS-1457 and HDFS-3515 on the patch. I 
verified that the throttle method was entered, the throttling period was 
correct, and fsimage & edits log files were throttled at the rate 
bandwidthPerSec. In short, the patch works.

> port HDFS-1457 to branch-1.1
> 
>
> Key: HDFS-4535
> URL: https://issues.apache.org/jira/browse/HDFS-4535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.1.2
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: HDFS-4535-branch-1.1.patch
>
>
> port HDFS-1457 (configuration option to enable limiting the transfer rate 
> used when sending the image and edits for checkpointing) to branch-1.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2013-03-08 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Attachment: HDFS-3981-trunk.patch
HDFS-3981-branch-2.patch
HDFS-3981-branch-0.23.patch

> access time is set without holding writelock in FSNamesystem
> 
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: HDFS-3981-branch-0.23.4.patch, 
> HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch
>
>
> Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to 
> updating times without write lock. In most cases this condition will force 
> {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not 
> need to be updated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2013-03-08 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13597834#comment-13597834
 ] 

Xiaobo Peng commented on HDFS-3981:
---

Attached the patches for trunk, branch-2 and 0.23. 

Note: my previous comment "Also, seems we need to release readlock before 
trying to acquire writelock..." is invalid.

> access time is set without holding writelock in FSNamesystem
> 
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: HDFS-3981-branch-0.23.4.patch, 
> HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch
>
>
> Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to 
> updating times without write lock. In most cases this condition will force 
> {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not 
> need to be updated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2013-03-08 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Affects Version/s: 0.23.5
   Status: Patch Available  (was: Open)

> access time is set without holding writelock in FSNamesystem
> 
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.5, 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: HDFS-3981-branch-0.23.4.patch, 
> HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch
>
>
> Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to 
> updating times without write lock. In most cases this condition will force 
> {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not 
> need to be updated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2013-03-10 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Priority: Major  (was: Minor)

> access time is set without holding writelock in FSNamesystem
> 
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3, 0.23.5
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
> Attachments: HDFS-3981-branch-0.23.4.patch, 
> HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch
>
>
> Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to 
> updating times without write lock. In most cases this condition will force 
> {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not 
> need to be updated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2013-03-10 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Affects Version/s: 2.0.3-alpha

> access time is set without holding writelock in FSNamesystem
> 
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3, 2.0.3-alpha, 0.23.5
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
> Attachments: HDFS-3981-branch-0.23.4.patch, 
> HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch
>
>
> Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to 
> updating times without write lock. In most cases this condition will force 
> {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not 
> need to be updated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4535) port HDFS-1457 to branch-1.1

2013-03-11 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13599342#comment-13599342
 ] 

Xiaobo Peng commented on HDFS-4535:
---

Sure, Suresh. I thought you knew HDFS-3515 already ported it to branch-1 (I 
included that link when creating this one) and still wanted a patch for 
branch-1.1.

> port HDFS-1457 to branch-1.1
> 
>
> Key: HDFS-4535
> URL: https://issues.apache.org/jira/browse/HDFS-4535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.1.2
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: HDFS-4535-branch-1.1.patch
>
>
> port HDFS-1457 (configuration option to enable limiting the transfer rate 
> used when sending the image and edits for checkpointing) to branch-1.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4535) port HDFS-1457 to branch-1.1

2013-03-11 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13599618#comment-13599618
 ] 

Xiaobo Peng commented on HDFS-4535:
---

No problem. I need to know those practices. Thanks, Suresh.

> port HDFS-1457 to branch-1.1
> 
>
> Key: HDFS-4535
> URL: https://issues.apache.org/jira/browse/HDFS-4535
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.1.2
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: HDFS-4535-branch-1.1.patch
>
>
> port HDFS-1457 (configuration option to enable limiting the transfer rate 
> used when sending the image and edits for checkpointing) to branch-1.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4222) NN is unresponsive and loses heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues

2013-03-24 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-4222:
--

Summary: NN is unresponsive and loses heartbeats of DNs when Hadoop is 
configured to use LDAP and LDAP has issues  (was: NN is unresponsive and lose 
heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues)

> NN is unresponsive and loses heartbeats of DNs when Hadoop is configured to 
> use LDAP and LDAP has issues
> 
>
> Key: HDFS-4222
> URL: https://issues.apache.org/jira/browse/HDFS-4222
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 1.0.0, 0.23.3, 2.0.0-alpha
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Fix For: 1.2.0, 0.23.7, 2.0.5-beta
>
> Attachments: HDFS-4222.23.patch, hdfs-4222-branch-0.23.3.patch, 
> HDFS-4222-branch-1.patch, HDFS-4222.patch, HDFS-4222.patch, 
> hdfs-4222-release-1.0.3.patch
>
>
> For Hadoop clusters configured to access directory information by LDAP, the 
> FSNamesystem calls on behave of DFS clients might hang due to LDAP issues 
> (including LDAP access issues caused by networking issues) while holding the 
> single lock of FSNamesystem. That will result in the NN unresponsive and loss 
> of the heartbeats from DNs.
> The places LDAP got accessed by FSNamesystem calls are the instantiation of 
> FSPermissionChecker, which could be moved out of the lock scope since the 
> instantiation does not need the FSNamesystem lock. After the move, a DFS 
> client hang will not affect other threads by hogging the single lock. This is 
> especially helpful when we use separate RPC servers for ClientProtocol and 
> DatanodeProtocol since the calls for DatanodeProtocol do not need to access 
> LDAP. So even if DFS clients hang due to LDAP issues, the NN will still be 
> able to process the requests (including heartbeats) from DNs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3981) access time is set without holding FSNamesystem write lock

2013-03-24 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Summary: access time is set without holding FSNamesystem write lock  (was: 
access time is set without holding writelock in FSNamesystem)

> access time is set without holding FSNamesystem write lock
> --
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3, 2.0.3-alpha, 0.23.5
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
> Attachments: HDFS-3981-branch-0.23.4.patch, 
> HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch
>
>
> Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to 
> updating times without write lock. In most cases this condition will force 
> {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not 
> need to be updated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3655) datenode recoverRbw could hang sometime

2012-07-13 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3655:
--

Attachment: HDFS-3655-0.22-use-join-instead-of-wait.patch

HDFS-3655-0.22.patch keeps the code that waits for termination of the old 
writer in stopWriter method of class ReplicaInPipeline. We pay the price to 
pass FSDataset to it in order to release the monitor during waiting.

HDFS-3655-0.22-use-join-instead-of-wait.patch moves join/wait code from 
ReplicaInPipeline to FSDataset. It looks cleaner. But we have to refactor some 
FSDataset code to not duplicate them.

> datenode recoverRbw could hang sometime
> ---
>
> Key: HDFS-3655
> URL: https://issues.apache.org/jira/browse/HDFS-3655
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.22.0, 1.0.3, 2.0.1-alpha
>Reporter: Ming Ma
> Fix For: 0.22.1
>
> Attachments: HDFS-3655-0.22-use-join-instead-of-wait.patch, 
> HDFS-3655-0.22.patch
>
>
> This bug seems to apply to 0.22 and hadoop 2.0. I will upload the initial fix 
> done by my colleague Xiaobo Peng shortly ( there is some logistics issue 
> being worked on so that he can upload patch himself later ).
> recoverRbw try to kill the old writer thread, but it took the lock (FSDataset 
> monitor object) which the old writer thread is waiting on ( for example the 
> call to data.getTmpInputStreams ).
> "DataXceiver for client /10.110.3.43:40193 [Receiving block 
> blk_-3037542385914640638_57111747 
> client=DFSClient_attempt_201206021424_0001_m_000401_0]" daemon prio=10 
> tid=0x7facf8111800 nid=0x6b64 in Object.wait() [0x7facd1ddb000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1186)
> ■locked <0x0007856c1200> (a org.apache.hadoop.util.Daemon)
> at java.lang.Thread.join(Thread.java:1239)
> at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:158)
> at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.recoverRbw(FSDataset.java:1347)
> ■locked <0x0007838398c0> (a 
> org.apache.hadoop.hdfs.server.datanode.FSDataset)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:119)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlockInternal(DataXceiver.java:391)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:327)
> at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:405)
> at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:344)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:183)
> at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-26 Thread Xiaobo Peng (JIRA)
Xiaobo Peng created HDFS-3981:
-

 Summary: access time is set without holding writelock in 
FSNamesystem
 Key: HDFS-3981
 URL: https://issues.apache.org/jira/browse/HDFS-3981
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.1-alpha
Reporter: Xiaobo Peng
Priority: Minor




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Description: 
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks 
like we need to change the code to 

if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}


Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The code in branch-2.0.1-alpha
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (now <= inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
  }
  dir.setTimes(src, inode, -1, now, false);
}
return blockManager.createLocatedBlocks(inode.getBlocks(),
inode.computeFileSize(false), inode.isUnderConstruction(),
offset, length, needBlockToken);
  } finally {
if (attempt == 0) {
  readUnlock();
} else {
  writeUnlock();
}
  }
}
return null; // can never reach here
  }


> access time is set without holding writelock in FSNamesystem
> 
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 2.0.1-alpha
>Reporter: Xiaobo Peng
>Priority: Minor
>
> If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, 
> we will call dir.setTimes(src, inode, -1, now, false) without writelock. So 
> looks like we need to change the code to 
> if (doAccessTime && isAccessTimeSupported()) {
>   if (now > inode.getAccessTime() + getAccessTimePrecision()) {
> // if we have to set access time but we only have the readlock, 
> then
> // restart this entire operation with the writeLock.
> if (attempt == 0) {
>   continue;
> }
> dir.setTimes(src, inode, -1, now, false);
>   }
> }
> Also, seems we need to release readlock before trying to acquire writelock. 
> Otherwise, we might end up with still holding readlock after the function 
> call.
> The code in branch-2.0.1-alpha
>   private LocatedBlocks getBlockLocationsUpdateTimes(String src,
>long offset, 
>long length,
>boolean doAccessTime, 
>boolean needBlockToken)
>   throws FileNotFoundException, UnresolvedLinkException, IOException {
> for (int attempt = 0; attempt < 2; attempt++) {
>   if (attempt == 0) { // first attempt is with readlock
> readLock();
>   }  else { // second attempt is with  write lock
> writeLock(); // writelock is needed to set accesstime
>   }
>   try {
> checkOperation(OperationCategory.READ);
> // if the namenode

[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Description: 
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks 
like we need to change the code to 
{noformat}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{noformat}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The code in branch-2.0.1-alpha
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (now <= inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
  }
  dir.setTimes(src, inode, -1, now, false);
}
return blockManager.createLocatedBlocks(inode.getBlocks(),
inode.computeFileSize(false), inode.isUnderConstruction(),
offset, length, needBlockToken);
  } finally {
if (attempt == 0) {
  readUnlock();
} else {
  writeUnlock();
}
  }
}
return null; // can never reach here
  }


  was:
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks 
like we need to change the code to 

if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}


Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The code in branch-2.0.1-alpha
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (now <= inode.getAccessTime() + getAccessTimePrecision()) 

[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Description: 
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks 
like we need to change the code to 
{noformat}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{noformat}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{noformat}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (now <= inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
  }
  dir.setTimes(src, inode, -1, now, false);
}
return blockManager.createLocatedBlocks(inode.getBlocks(),
inode.computeFileSize(false), inode.isUnderConstruction(),
offset, length, needBlockToken);
  } finally {
if (attempt == 0) {
  readUnlock();
} else {
  writeUnlock();
}
  }
}
return null; // can never reach here
  }
{noformat}

  was:
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks 
like we need to change the code to 
{noformat}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{noformat}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The code in branch-2.0.1-alpha
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (n

[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Description: 
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So there 
will be races and an ealier access time might overwrite a later access time. 
Looks like we need to change the code to 
{noformat}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{noformat}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{noformat}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (now <= inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
  }
  dir.setTimes(src, inode, -1, now, false);
}
return blockManager.createLocatedBlocks(inode.getBlocks(),
inode.computeFileSize(false), inode.isUnderConstruction(),
offset, length, needBlockToken);
  } finally {
if (attempt == 0) {
  readUnlock();
} else {
  writeUnlock();
}
  }
}
return null; // can never reach here
  }
{noformat}

  was:
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So looks 
like we need to change the code to 
{noformat}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{noformat}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{noformat}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
  

[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Description: 
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So there 
will be races and an ealier access time might overwrite a later access time. 
Looks like we need to change the code to 
{noformat}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{noformat}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{code:title=FSNamesystem.java|borderStyle=solid}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (now <= inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
  }
  dir.setTimes(src, inode, -1, now, false);
}
return blockManager.createLocatedBlocks(inode.getBlocks(),
inode.computeFileSize(false), inode.isUnderConstruction(),
offset, length, needBlockToken);
  } finally {
if (attempt == 0) {
  readUnlock();
} else {
  writeUnlock();
}
  }
}
return null; // can never reach here
  }
{code}

  was:
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So there 
will be races and an ealier access time might overwrite a later access time. 
Looks like we need to change the code to 
{noformat}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{noformat}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{noformat}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFil

[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Description: 
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So there 
will be races and an ealier access time might overwrite a later access time. 
Looks like we need to change the code to 
{code}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{code}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{code:title=FSNamesystem.java|borderStyle=solid}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (now <= inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
  }
  dir.setTimes(src, inode, -1, now, false);
}
return blockManager.createLocatedBlocks(inode.getBlocks(),
inode.computeFileSize(false), inode.isUnderConstruction(),
offset, length, needBlockToken);
  } finally {
if (attempt == 0) {
  readUnlock();
} else {
  writeUnlock();
}
  }
}
return null; // can never reach here
  }
{code}

  was:
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So there 
will be races and an ealier access time might overwrite a later access time. 
Looks like we need to change the code to 
{noformat}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{noformat}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{code:title=FSNamesystem.java|borderStyle=solid}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
  

[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Description: 
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So there 
will be races and an ealier access time might overwrite a later access time. 
Looks like we need to change the code to 
{code}

if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{code}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{code:title=FSNamesystem.java|borderStyle=solid}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (now <= inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
  }
  dir.setTimes(src, inode, -1, now, false);
}
return blockManager.createLocatedBlocks(inode.getBlocks(),
inode.computeFileSize(false), inode.isUnderConstruction(),
offset, length, needBlockToken);
  } finally {
if (attempt == 0) {
  readUnlock();
} else {
  writeUnlock();
}
  }
}
return null; // can never reach here
  }
{code}

  was:
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So there 
will be races and an ealier access time might overwrite a later access time. 
Looks like we need to change the code to 
{code}
if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{code}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{code:title=FSNamesystem.java|borderStyle=solid}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INode

[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-26 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Description: 
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So there 
will be races and an ealier access time might overwrite a later access time. 
Looks like we need to change the code to 
{code}

if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{code}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call. 
Or we could simply remove the condition "if (attempt == 0)" for readUnlock(), 
i.e. readUnlock() should be called even if attempt is 1.

The following code is from branch-2.0.1-alpha
{code:title=FSNamesystem.java|borderStyle=solid}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then do not update access time
if (isInSafeMode()) {
  doAccessTime = false;
}

long now = now();
INodeFile inode = dir.getFileINode(src);
if (inode == null) {
  throw new FileNotFoundException("File does not exist: " + src);
}
assert !inode.isLink();
if (doAccessTime && isAccessTimeSupported()) {
  if (now <= inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
  }
  dir.setTimes(src, inode, -1, now, false);
}
return blockManager.createLocatedBlocks(inode.getBlocks(),
inode.computeFileSize(false), inode.isUnderConstruction(),
offset, length, needBlockToken);
  } finally {
if (attempt == 0) {
  readUnlock();
} else {
  writeUnlock();
}
  }
}
return null; // can never reach here
  }
{code}

  was:
If now > inode.getAccessTime() + getAccessTimePrecision() when attempt == 0, we 
will call dir.setTimes(src, inode, -1, now, false) without writelock. So there 
will be races and an ealier access time might overwrite a later access time. 
Looks like we need to change the code to 
{code}

if (doAccessTime && isAccessTimeSupported()) {
  if (now > inode.getAccessTime() + getAccessTimePrecision()) {
// if we have to set access time but we only have the readlock, then
// restart this entire operation with the writeLock.
if (attempt == 0) {
  continue;
}
dir.setTimes(src, inode, -1, now, false);
  }
}
{code}

Also, seems we need to release readlock before trying to acquire writelock. 
Otherwise, we might end up with still holding readlock after the function call.

The following code is from branch-2.0.1-alpha
{code:title=FSNamesystem.java|borderStyle=solid}
  private LocatedBlocks getBlockLocationsUpdateTimes(String src,
   long offset, 
   long length,
   boolean doAccessTime, 
   boolean needBlockToken)
  throws FileNotFoundException, UnresolvedLinkException, IOException {

for (int attempt = 0; attempt < 2; attempt++) {
  if (attempt == 0) { // first attempt is with readlock
readLock();
  }  else { // second attempt is with  write lock
writeLock(); // writelock is needed to set accesstime
  }
  try {
checkOperation(OperationCategory.READ);

// if the namenode is in safemode, then 

[jira] [Updated] (HDFS-3655) Datanode recoverRbw could hang sometime

2012-09-27 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3655:
--

Assignee: Xiaobo Peng

> Datanode recoverRbw could hang sometime
> ---
>
> Key: HDFS-3655
> URL: https://issues.apache.org/jira/browse/HDFS-3655
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.22.0, 1.0.3, 2.0.0-alpha
>Reporter: Ming Ma
>Assignee: Xiaobo Peng
> Attachments: HDFS-3655-0.22.patch, 
> HDFS-3655-0.22-use-join-instead-of-wait.patch
>
>
> This bug seems to apply to 0.22 and hadoop 2.0. I will upload the initial fix 
> done by my colleague Xiaobo Peng shortly ( there is some logistics issue 
> being worked on so that he can upload patch himself later ).
> recoverRbw try to kill the old writer thread, but it took the lock (FSDataset 
> monitor object) which the old writer thread is waiting on ( for example the 
> call to data.getTmpInputStreams ).
> "DataXceiver for client /10.110.3.43:40193 [Receiving block 
> blk_-3037542385914640638_57111747 
> client=DFSClient_attempt_201206021424_0001_m_000401_0]" daemon prio=10 
> tid=0x7facf8111800 nid=0x6b64 in Object.wait() [0x7facd1ddb000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1186)
> ■locked <0x0007856c1200> (a org.apache.hadoop.util.Daemon)
> at java.lang.Thread.join(Thread.java:1239)
> at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:158)
> at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.recoverRbw(FSDataset.java:1347)
> ■locked <0x0007838398c0> (a 
> org.apache.hadoop.hdfs.server.datanode.FSDataset)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:119)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlockInternal(DataXceiver.java:391)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:327)
> at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:405)
> at 
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:344)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:183)
> at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-09-27 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465046#comment-13465046
 ] 

Xiaobo Peng commented on HDFS-3981:
---

Thank you, Konstantin. I will follow your suggestions and create a patch for 
0.23.4

> access time is set without holding writelock in FSNamesystem
> 
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
>
> Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to 
> updating times without write lock. In most cases this condition will force 
> {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not 
> need to be updated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3981) access time is set without holding writelock in FSNamesystem

2012-10-16 Thread Xiaobo Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaobo Peng updated HDFS-3981:
--

Attachment: HDFS-3981-branch-0.23.4.patch

I prepare the pathc by following the instructions at 
http://wiki.apache.org/hadoop/HowToContribute.

Must we have new unit tests for this patch?  Thanks.

> access time is set without holding writelock in FSNamesystem
> 
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
>Priority: Minor
> Attachments: HDFS-3981-branch-0.23.4.patch
>
>
> Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to 
> updating times without write lock. In most cases this condition will force 
> {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not 
> need to be updated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3981) access time is set without holding FSNamesystem write lock

2013-06-17 Thread Xiaobo Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13685863#comment-13685863
 ] 

Xiaobo Peng commented on HDFS-3981:
---

Thanks a lot for finishing it, Todd.

> access time is set without holding FSNamesystem write lock
> --
>
> Key: HDFS-3981
> URL: https://issues.apache.org/jira/browse/HDFS-3981
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3, 2.0.3-alpha, 0.23.5
>Reporter: Xiaobo Peng
>Assignee: Xiaobo Peng
> Fix For: 3.0.0, 2.1.0-beta
>
> Attachments: HDFS-3981-branch-0.23.4.patch, 
> HDFS-3981-branch-0.23.patch, HDFS-3981-branch-2.patch, HDFS-3981-trunk.patch, 
> hdfs-3981.txt
>
>
> Incorrect condition in {{FSNamesystem.getBlockLocatoins()}} can lead to 
> updating times without write lock. In most cases this condition will force 
> {{FSNamesystem.getBlockLocatoins()}} to hold write lock, even if times do not 
> need to be updated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira