[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-11-14 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212757#comment-14212757
 ] 

Yongjun Zhang commented on HDFS-4239:
-

HI [~qwertymaniac],

My bad that I did not notice your earlier comment 
{quote}
I just noticed Steve's comment referring the same - should've gone through 
properly before spending google cycles. I feel HDFS-1362 implemented would 
solve half of this - and the other half would be to make the removals 
automatic. Right now the checkDiskError does not eject if its slow - as long as 
its succeed, which would have to be done via this JIRA I think. The re-add 
would be possible via HDFS-1362.
{quote}
until now. So we need to use the functionality provided by HDFS-1362 to 
automatically remove a sick disk. It seems the original goal of HDFS-4239 is 
the same as HDFS-1362 (right?), and we can create a new jira for  automatically 
removing a sick disk?

Thanks.


 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Yongjun Zhang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-09-05 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123701#comment-14123701
 ] 

Yongjun Zhang commented on HDFS-4239:
-

HI [~jxiang], thanks for your earlier work on this issue. I wonder if you will 
have time to work on this? if not, do you mind I take it over? Thanks.


 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-09-05 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123715#comment-14123715
 ] 

Jimmy Xiang commented on HDFS-4239:
---

Sure. Assigned it to you.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Yongjun Zhang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-09-05 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123718#comment-14123718
 ] 

Yongjun Zhang commented on HDFS-4239:
-

Thanks Jimmy.



 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Yongjun Zhang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-04-16 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972160#comment-13972160
 ] 

Colin Patrick McCabe commented on HDFS-4239:


Thanks for picking this up again, [~jxiang].  {{hdfs-4239_v5.patch}} did not 
apply cleanly to trunk for me; can you re-generate this patch?

{{DataNode#checkSuperuserPrivilege}}: As Yongjun commented, it's kind of 
unfortunate that you are skipping this check when Kerberos is disabled.  This 
will make unit testing the permission denied case harder than it has to be.  
I suggest using {{FSPermissionChecker}}.  The constructor takes three 
arguments: the current UGI, the superuser, and supergroup, and by calling 
{{FSPermissionChecker#checkSuperuserPrivilege}}, you can figure out if you have 
superuser.  I realize you were following the pattern in 
{{checkBlockLocalPathAccess}}, but that code is legacy now (only used in the 
legacy local block reader).

{{VERIFICATION_PREFIX}}: as Yongjun commented, I think you meant to keep the 
private on here

{{DataStorage.java}}: the only change here is a whitespace change.  Seems like 
you meant to cut this out of the diff.

{{BlockPoolSliceScanner}}: I don't think the file reopening belongs in the 
slice scanner.  {{LogFileHandler}} is actually a pretty abstract interface 
which is unconcerned with the details of where things are in files.  We get it 
out of {{FsDataSetSpi#createRollingLogs}}.  Since the normal file rolling stuff 
is handled in {{RollingLogsImpl}}, that's where the re-opening of verification 
logs should be handled as well.  It's the same kind of thing.

Another way of thinking about it is: while it's true that currently the 
verification logs reside in files somewhere in directories on disk, this is an 
implementation detail.  You can easily imagine a different implementation of 
{{FsDatasetSpi}} where there are no on-disk directories, or where the 
verification log is kept in memory.

Also, I noticed you were trying to copy the old verification log from the 
failed disk in your {{relocateVerificationLogs}} function.  This is not a good 
idea, since reading from a failed disk may hang or misbehave, causing issues 
with the DataNode's threads.  We want to treat a failed disk as radioactive and 
not do any more reads or writes from there if we can help it.

{code}
@@ -930,12 +940,15 @@ synchronized Packet waitForAckHead(long seqno) throws 
InterruptedException {
  */
 @Override
 public synchronized void close() {
-  while (isRunning()  ackQueue.size() != 0) {
-try {
-  wait();
-} catch (InterruptedException e) {
-  running = false;
-  Thread.currentThread().interrupt();
+  try {
+while (isRunning()  ackQueue.size() != 0) {
+  wait(runningCheckInterval);
+}
+  } catch (InterruptedException e) {
+Thread.currentThread().interrupt();
+  } catch (IOException ioe) {
+if(LOG.isDebugEnabled()) {
+  LOG.debug(myString, ioe);
{code}

Why are we catching and swallowing {{IOException}} here?

{code}
+  @Override //FsDatasetSpi
+  public FsVolumeImpl markDownVolume(File location
{code}

I don't like the assumption that our volumes are on files here.  This may not 
be true for all {{FsDataSetSpi}} implementations.  Instead, how about changing 
this to take a URI instead?

Similarly, let's change the user API (in DistributeFileSystem, etc.) to take a 
URI as well, up to the user level.  So the user can ask us to mark down 
{{file:///data/1}} or something like that.  That way, when we later implement 
different volumes which aren't on files, we can easily refer to them.

Also, how do you feel about {{disableVolume}} instead of {{markDownVolume}}?  
markdown just makes me think of the markup language (maybe that's just me?)

I think we need a way of listing all the volume URIs on a particular DN, and a 
way of listing all the currently disabled volume URIs on a DN.  Otherwise it 
makes life hard for sysadmins.

Another comment: we need to notify outstanding {{BlockReaderLocal}} instances 
that the volume is disabled.  We can do this by using the shared memory 
segment.  This should avoid the problem that JD pointed out here:

bq. We tried it today, it worked fine. We did encounter an interesting problem 
tho, the region server on the same node continued to use that disk directly 
since it's configured with local reads.

If you want to do this in a follow-up change then file a JIRA for that?

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, 

[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-14 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901848#comment-13901848
 ] 

Yongjun Zhang commented on HDFS-4239:
-

HI Jimmy, 

Thanks for the good work. I went through patch v4. It looks good to me.  I only 
had a few comments, mostly cosmetic things and I may be wrong myself. 

1. In DataNode.java:
 private void checkSuperuserPrivilege(String method) throws IOException {
if (checkKerberosAuthMethod(method)) {
  ...
}
  }

The above function check super privilege only when kerberos authentication
is enabled. This seems not restrictive enough to me.  However, I saw existing 
code in same file also does that, such as:

  private void checkBlockLocalPathAccess() throws IOException {
checkKerberosAuthMethod(getBlockLocalPathInfo());
...
  }

So I'm not actually not sure. Please correct me if I'm wrong. Say, I found some 
other existing code that checks superuser privilege like

./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
  public void checkSuperuserPrivilege()

Which seems to do thing differently.

2. In DataNode.java:

 /** Ensure that authentication method is kerberos */
 boolean checkKerberosAuthMethod(String msg) throws IOException {

Suggest to change  (both comments and interface) to something like:
 /** Check whether authentication method is kerberos, return true 
   * if so and false otherwise 
   */
 boolean isKerberosAuthMethodEnabled(...)...


3. In BlockPoolSliceScanner.java
  private static final String VERIFICATION_PREFIX = 
dncp_block_verification.log;

  You removed private from the interface, I wonder if it's what you intended. 
  Seems it should stay private.

4. In DatablockScanner.java:

  void volumeMarkedDown(FsVolumeSpi vol) throws IOException {

I wonder whether if we can change it to 
  /**
   * relocate verification logs for volume that's marked down
   * ...
   */ 
  void relocateVerificationLogs(FsVolumeSpi volMarkedDown) ...

to make it more clear?

5.  In BlockPoolSliceScanner.java,
  void relocateVerificationLogs(FsVolumeSpi vol) throws IOException {
 if (verificationLog != null) {
   // block of code
 }
 // no code here
   }

If the block of code is large, it would be helpful to change
it to
  void relocateVerificationLogs(FsVolumeSpi vol) throws IOException {
if (verificationLog == null) {
  return;
}
// block of code
  }

This helps removing one level of indentation, to make it easier to read.

Thanks.



 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-10 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897069#comment-13897069
 ] 

Jimmy Xiang commented on HDFS-4239:
---

Ping. Can anyone take  a look patch v4? Thanks.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13891534#comment-13891534
 ] 

Hadoop QA commented on HDFS-4239:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12626985/hdfs-4239_v5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/6027//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6027//console

This message is automatically generated.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13891541#comment-13891541
 ] 

Hadoop QA commented on HDFS-4239:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12626985/hdfs-4239_v5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/6028//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6028//console

This message is automatically generated.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch, hdfs-4239_v5.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-03 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889769#comment-13889769
 ] 

Todd Lipcon commented on HDFS-4239:
---

Couple quick high-level comments:

- what's the authorization requirement here? The patch doesn't seem to do any 
access control, but I wouldn't want a non-admin to make these changes.
- it seems odd that the mark this volume dead is non-persistent across 
restarts. If a disk is dying, I'm nervous that someone would mark it bad, and 
then a later rolling restart of the service would revive it. Something like a 
config file of blacklisted volume IDs and a 'refresh' RPC might be more 
resistant to this type of issue -- or a marker file like disallow_this_volume 
in the storage directory?

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-03 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889841#comment-13889841
 ] 

Jimmy Xiang commented on HDFS-4239:
---

Good point. Let me handle the access control in next patch. As to blacklisted 
volume IDs, can we handle it in a separate issue?

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-02-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890401#comment-13890401
 ] 

Hadoop QA commented on HDFS-4239:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12626799/hdfs-4239_v4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestReplaceDatanodeOnFailure
  org.apache.hadoop.hdfs.server.namenode.TestAuditLogs

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/6020//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6020//console

This message is automatically generated.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
 hdfs-4239_v4.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13888051#comment-13888051
 ] 

Hadoop QA commented on HDFS-4239:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12626054/hdfs-4239_v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestPersistBlocks

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/6000//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6000//console

This message is automatically generated.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-30 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13887291#comment-13887291
 ] 

Jimmy Xiang commented on HDFS-4239:
---

This test failure is not related.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886014#comment-13886014
 ] 

Hadoop QA commented on HDFS-4239:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12625766/hdfs-4239_v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestInjectionForSimulatedStorage
  org.apache.hadoop.hdfs.TestPread
  org.apache.hadoop.hdfs.TestReplication
  org.apache.hadoop.hdfs.TestSmallBlock
  
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup
  org.apache.hadoop.hdfs.server.datanode.TestDataNodeMetrics
  
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithEncryptedTransfer
  org.apache.hadoop.hdfs.TestFileCreation
  org.apache.hadoop.hdfs.TestSetrepIncreasing
  
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes
  
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes
  
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS
  org.apache.hadoop.hdfs.server.namenode.TestFileLimit
  org.apache.hadoop.hdfs.server.balancer.TestBalancer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5983//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5983//console

This message is automatically generated.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886294#comment-13886294
 ] 

Hadoop QA commented on HDFS-4239:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12626054/hdfs-4239_v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.TestAuditLogs

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5989//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5989//console

This message is automatically generated.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-28 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13884897#comment-13884897
 ] 

Jimmy Xiang commented on HDFS-4239:
---

Cool. I agree. Attached v2 that released all references to the volume marked 
down. In my test, I don't see any open file descriptor pointing to the volume 
marked down.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch, hdfs-4239_v2.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-27 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883728#comment-13883728
 ] 

Jimmy Xiang commented on HDFS-4239:
---

We can release the lock after the volume is marked down. No new block will be 
allocated to this volume. How about those blocks on this volume being writing? 
The writing could take forever, for example, a rarely updated HLog file. I was 
thinking to fail the writing pipeline so that the client can set up another 
pipeline. Any problem with that?

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-27 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13883784#comment-13883784
 ] 

stack commented on HDFS-4239:
-

I think throwing up an exception the right thing to do.  The volume is going 
away at the operators volition.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875263#comment-13875263
 ] 

Hadoop QA commented on HDFS-4239:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12623271/hdfs-4239.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5913//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5913//console

This message is automatically generated.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-17 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875479#comment-13875479
 ] 

stack commented on HDFS-4239:
-

This is lovely [~jxiang].  The addition of the +  System.err.println( 
  [-markDownVolume datanode-host:port location); is particularly so.  In 
your testing, did you see that after removal of the volume, if the volume's 
'in_use.lock' was cleaned up?  (See the comment by Andy above -- 
https://issues.apache.org/jira/browse/HDFS-4239?focusedCommentId=13506791page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13506791)
  Thanks.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-17 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875538#comment-13875538
 ] 

Jimmy Xiang commented on HDFS-4239:
---

File 'in_use.lock' is still there after the volume is marked down.  Let me take 
another look.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-01-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872990#comment-13872990
 ] 

Hadoop QA commented on HDFS-4239:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12623271/hdfs-4239.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5891//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5891//console

This message is automatically generated.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack
Assignee: Jimmy Xiang
 Attachments: hdfs-4239.patch


 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2013-03-22 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610080#comment-13610080
 ] 

Harsh J commented on HDFS-4239:
---

HDFS-1362 matches this JIRA's needs I think.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2013-03-22 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610086#comment-13610086
 ] 

Harsh J commented on HDFS-4239:
---

I just noticed Steve's comment referring the same - should've gone through 
properly before spending google cycles. I feel HDFS-1362 implemented would 
solve half of this - and the other half would be to make the removals 
automatic. Right now the checkDiskError does not eject if its slow - as long as 
its succeed, which would have to be done via this JIRA I think. The re-add 
would be possible via HDFS-1362.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-12-06 Thread Andy Isaacson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13525888#comment-13525888
 ] 

Andy Isaacson commented on HDFS-4239:
-

I created HDFS-4284 to track the BlockReaderLocal issue.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-12-05 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510939#comment-13510939
 ] 

Jean-Daniel Cryans commented on HDFS-4239:
--

bq. That should work just fine, if the HDFS config is compatible with the new 
set of available directories. 

We tried it today, it worked fine. We did encounter an interesting problem tho, 
the region server on the same node continued to use that disk directly since 
it's configured with local reads.

To rephrase that, a long running BlockReaderLocal will ride over local DN 
restarts and disk ejections. We had to drain the RS of all its regions in 
order to stop it from using the bad disk.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-12-04 Thread Tibor Vass (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510155#comment-13510155
 ] 

Tibor Vass commented on HDFS-4239:
--

What about stopping the datanode, chmod 0-ing, and restarting the datanode?

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-12-04 Thread Andy Isaacson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510181#comment-13510181
 ] 

Andy Isaacson commented on HDFS-4239:
-

bq. What about stopping the datanode, chmod 0-ing, and restarting the datanode?

That should work just fine, if the HDFS config is compatible with the new set 
of available directories.  That means either ensuring that the inaccessible 
datadir does not exceed the {{dfs.datanode.failed.volumes.tolerated}} value, or 
removing the inaccessible datadir from {{dfs.data.dir}}.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-11-29 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506719#comment-13506719
 ] 

Todd Lipcon commented on HDFS-4239:
---

Would {{chmod 000 /path/to/data/dir}} do the trick? That should cause it to 
start getting IOExceptions, which would then get it to eject that disk from its 
list.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-11-29 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506724#comment-13506724
 ] 

stack commented on HDFS-4239:
-

Let me try.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-11-29 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506731#comment-13506731
 ] 

Harsh J commented on HDFS-4239:
---

Todd's proposal would still require the DN to be started up with a  1 
toleration value via {{dfs.datanode.failed.volumes.tolerated}}.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-11-29 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506734#comment-13506734
 ] 

Harsh J commented on HDFS-4239:
---

Typo, I meant = 1

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-11-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506747#comment-13506747
 ] 

Steve Loughran commented on HDFS-4239:
--

would a {{umount -f}} let you force the unmount. 

The big volume management JIRA is HDFS-1362 -that's the one that really needs 
finishing off. 

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-11-29 Thread Andy Isaacson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506758#comment-13506758
 ] 

Andy Isaacson commented on HDFS-4239:
-

bq. would a umount -f let you force the unmount. 

Unfortunately not while a running process still has a filedescriptor open on 
the volume. It would require {{revoke()}} support in the kernel, and that 
effort foundered many years ago.  http://lwn.net/Articles/262528/

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-11-29 Thread Andy Isaacson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506769#comment-13506769
 ] 

Andy Isaacson commented on HDFS-4239:
-

bq. Would chmod 000 /path/to/data/dir do the trick? That should cause it to 
start getting IOExceptions, which would then get it to eject that disk from its 
list.

Unfortunately the DN keeps the {{in_use.lock}} open even after the volume is 
marked failed.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2012-11-29 Thread Andy Isaacson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506791#comment-13506791
 ] 

Andy Isaacson commented on HDFS-4239:
-

To expand on my previous comment:

I tested on trunk, on a DN with {{dfs.datanode.failed.volumes.tolerated=1}}.  
Running {{chmod 0 /data/5/datadir/current}} caused the DN to eject the volume 
and continue operating.  I then used {{lsof -p}} to verify what filedescriptors 
remained open and observed that {{/data/5/datadir/in_use.lock}} was still open.

 Means of telling the datanode to stop using a sick disk
 ---

 Key: HDFS-4239
 URL: https://issues.apache.org/jira/browse/HDFS-4239
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: stack

 If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
 occasionally, or just exhibiting high latency -- your choices are:
 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
 disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
 the rereplication of the downed datanode's data can be pretty disruptive, 
 especially if the cluster is doing low latency serving: e.g. hosting an hbase 
 cluster.
 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
 can't unmount the disk while it is in use).  This latter is better in that 
 only the bad disk's data is rereplicated, not all datanode data.
 Is it possible to do better, say, send the datanode a signal to tell it stop 
 using a disk an operator has designated 'bad'.  This would be like option #2 
 above minus the need to stop and restart the datanode.  Ideally the disk 
 would become unmountable after a while.
 Nice to have would be being able to tell the datanode to restart using a disk 
 after its been replaced.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira