[jira] [Commented] (HDFS-3744) Decommissioned nodes are included in cluster after switch which is not expected

2012-08-03 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428200#comment-13428200
 ] 

Raju commented on HDFS-3744:


Oh Yesterday was my bad day @work :(

Let me correct my second opinion first 

{quote}2. I would like to with your first opinion with STANDBY check in 
replication(or move replication to Active service).
{quote}

Here first opinion I meant here is Uma Maheshwara Rao's opinion.
I accept that DFSAdmin command need to be sent for both NN. This can solve the 
problem here.

That's u also meant Arron.
And I would like to add Standby check at replication monitor to avoid load in 
cluster.

My First opinion

Before to support my opinion of persisting the node decommisioned consider the 
scenarios where 
{quote}
 Decommission command is given by admin but Standby NN is down ...etc  
Scenarios where Standby NN is not available when issuing command.
{quote}

DFSAdmin will just create a DFS client and calls refreshNodes()(Client will 
retry for NN 10 times by default not forever if I am not wrong), in this case 
Standby NN will not get to know the Decommission request and when Standby is up 
and switches to active???

I guess It will consider the decommissioned node as well.

{quote}
I'm hesitant to go with this suggestion. How would differences be rectified 
between what's persisted in the edit log and what's present in the excluded 
hosts file?
{quote}

Initially I also thought of same thing. Since we are persisting the message in 
different forms, but consider 

{quote}
Admin could have just configured the exclude nodes in the file, but he wouldn't 
have issued refreshnode command. 
{quote}

By persisting into edit logs we can be sure of which DN is decommissioned? Not 
only by Standby NN but also when Standalone NN restarts. 

Or is there any way NN can identify Decommissioned node without persisting?

Here I mean to persist all the stages of recommission and decommission too.

Please correct me If I am not correct.
I will be very glad to hear more about my opinion of persisting.

 Decommissioned nodes are included in cluster after switch which is not 
 expected
 ---

 Key: HDFS-3744
 URL: https://issues.apache.org/jira/browse/HDFS-3744
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.0.0-alpha, 2.1.0-alpha, 2.0.1-alpha
Reporter: Brahma Reddy Battula

 Scenario:
 =
 Start ANN and SNN with three DN's
 Exclude DN1 from cluster by using decommission feature 
 (./hdfs dfsadmin -fs hdfs://ANNIP:8020 -refreshNodes)
 After decommission successful,do switch such that SNN will become Active.
 Here exclude node(DN1) is included in cluster.Able to write files to excluded 
 node since it's not excluded.
 Checked SNN(Which Active before switch) UI decommissioned=1 and ANN UI 
 decommissioned=0
 One more Observation:
 
 All dfsadmin commands will create proxy only on nn1 irrespective of Active or 
 standby.I think this also we need to re-look once..
 I am not getting , why we are not given HA for dfsadmin commands..?
 Please correct me,,If I am wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3744) Decommissioned nodes are included in cluster after switch which is not expected

2012-08-03 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428228#comment-13428228
 ] 

Raju commented on HDFS-3744:


I would like to add some scenarios which can be handled by persisting

1. If N/W is unreachable from DFSClient issuing refreshNodes and Standby.
2. If N/W is unreachable from DFSClient issuing refreshNodes and active, 
refreshnodes command reached Standby NN and it marks DECOMMISSION in progress 
and NN N/W is reachable.

and other N/W down scenarios as well.

 Decommissioned nodes are included in cluster after switch which is not 
 expected
 ---

 Key: HDFS-3744
 URL: https://issues.apache.org/jira/browse/HDFS-3744
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.0.0-alpha, 2.1.0-alpha, 2.0.1-alpha
Reporter: Brahma Reddy Battula

 Scenario:
 =
 Start ANN and SNN with three DN's
 Exclude DN1 from cluster by using decommission feature 
 (./hdfs dfsadmin -fs hdfs://ANNIP:8020 -refreshNodes)
 After decommission successful,do switch such that SNN will become Active.
 Here exclude node(DN1) is included in cluster.Able to write files to excluded 
 node since it's not excluded.
 Checked SNN(Which Active before switch) UI decommissioned=1 and ANN UI 
 decommissioned=0
 One more Observation:
 
 All dfsadmin commands will create proxy only on nn1 irrespective of Active or 
 standby.I think this also we need to re-look once..
 I am not getting , why we are not given HA for dfsadmin commands..?
 Please correct me,,If I am wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3735) [ NNUI -- NNJspHelper.java ] Last three fields not considered for display data in sorting

2012-08-02 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427400#comment-13427400
 ] 

Raju commented on HDFS-3735:


Hey Brahma,
{code}
case FIELD_FAILED_VOL:
  ret = d1.getVolumeFailures() - d2.getVolumeFailures();
  break;
{code}

Since we are using this code in comparator its good to return -1,0 or 1 as its 
return value. Even without that with current fix it will work. But in general 
the mentioned values are returned.

 [ NNUI -- NNJspHelper.java ] Last three fields not considered for display 
 data in sorting
 --

 Key: HDFS-3735
 URL: https://issues.apache.org/jira/browse/HDFS-3735
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 2.0.0-alpha, 2.0.1-alpha
Reporter: Brahma Reddy Battula
Priority: Minor

 Live datanode list is not correctly sorted for columns Block Pool Used (GB),
 Block Pool Used (%)  and Failed Volumes. Read comments for more details.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3744) Decommissioned nodes are included in cluster after switch which is not expected

2012-08-02 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427404#comment-13427404
 ] 

Raju commented on HDFS-3744:


I have 2 suggestions to handle the same issue

1. Persist the NODE_DECOMMISSIONED by active so SNN will get node 
DECOMMISSIONED.

2. I would like to with your first opinion with SAFEMODE check in 
replication(or move replication to Active service).

 Decommissioned nodes are included in cluster after switch which is not 
 expected
 ---

 Key: HDFS-3744
 URL: https://issues.apache.org/jira/browse/HDFS-3744
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.0.0-alpha, 2.1.0-alpha, 2.0.1-alpha
Reporter: Brahma Reddy Battula

 Scenario:
 =
 Start ANN and SNN with three DN's
 Exclude DN1 from cluster by using decommission feature 
 (./hdfs dfsadmin -fs hdfs://ANNIP:8020 -refreshNodes)
 After decommission successful,do switch such that SNN will become Active.
 Here exclude node(DN1) is included in cluster.Able to write files to excluded 
 node since it's not excluded.
 Checked SNN(Which Active before switch) UI decommissioned=1 and ANN UI 
 decommissioned=0
 One more Observation:
 
 All dfsadmin commands will create proxy only on nn1 irrespective of Active or 
 standby.I think this also we need to re-look once..
 I am not getting , why we are not given HA for dfsadmin commands..?
 Please correct me,,If I am wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3744) Decommissioned nodes are included in cluster after switch which is not expected

2012-08-01 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426775#comment-13426775
 ] 

Raju commented on HDFS-3744:


{quote}We have to execute this particular command on both the nodes.
Otherwise, even though we make it failover work here, still SNN will not know 
about the excluded nodes as it did not get any refresh nodes command.
{quote}
Until SNN gets refreshnodes command SNN will not be aware of decommission of 
DN. But if we send the command to SNN as well which will cause the REPLICATION 
triggered in the SNN as well. Even though DN will not accept any requests from 
SNN, but SNN will be trying to replicate the blocks which will increase the 
load on SNN + N/W load + DN load as well.

This is my initial opinion, Please let me know if my opinion is wrong. 


 Decommissioned nodes are included in cluster after switch which is not 
 expected
 ---

 Key: HDFS-3744
 URL: https://issues.apache.org/jira/browse/HDFS-3744
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.0.0-alpha, 2.1.0-alpha, 2.0.1-alpha
Reporter: Brahma Reddy Battula

 Scenario:
 =
 Start ANN and SNN with three DN's
 Exclude DN1 from cluster by using decommission feature 
 (./hdfs dfsadmin -fs hdfs://ANNIP:8020 -refreshNodes)
 After decommission successful,do switch such that SNN will become Active.
 Here exclude node(DN1) is included in cluster.Able to write files to excluded 
 node since it's not excluded.
 Checked SNN(Which Active before switch) UI decommissioned=1 and ANN UI 
 decommissioned=0
 One more Observation:
 
 All dfsadmin commands will create proxy only on nn1 irrespective of Active or 
 standby.I think this also we need to re-look once..
 I am not getting , why we are not given HA for dfsadmin commands..?
 Please correct me,,If I am wrong.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3747) In some TestCace used in maven. The balancer-Test failed because of TimeOut

2012-08-01 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426777#comment-13426777
 ] 

Raju commented on HDFS-3747:


I guess, in this problem we need to worry about why source node does not have 
the blocks.
Can u post the logs which could be more helpful and which test u executed. 

 In some TestCace used in maven. The balancer-Test failed because of TimeOut
 ---

 Key: HDFS-3747
 URL: https://issues.apache.org/jira/browse/HDFS-3747
 Project: Hadoop HDFS
  Issue Type: Test
  Components: balancer
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: meng gong
  Labels: test
 Fix For: 2.1.0-alpha, 3.0.0


 When run a given test case for Banlancer Test. In the test the banlancer 
 thread try to move some block cross the rack but it can't find any available 
 blocks in the source rack. Then the thread won't interrupt until the tag 
 isTimeUp reaches 20min. But maven judges the test failed because the thread 
 have runned for 15min.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3734) TestFSEditLogLoader.testReplicationAdjusted() will hang if number of blocks are more than one

2012-08-01 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426783#comment-13426783
 ] 

Raju commented on HDFS-3734:


This is because of improper calculation of the blocksThreshold if I am not wrong

 TestFSEditLogLoader.testReplicationAdjusted() will hang if number of blocks 
 are more than one
 -

 Key: HDFS-3734
 URL: https://issues.apache.org/jira/browse/HDFS-3734
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: Vinay

 TestFSEditLogLoader.testReplicationAdjusted() which was added in HDFS-2003 
 will fail if number of blocks before cluster restart are more than one.
 Test Scenario:
 --
 1. Write a file with min replication as 1 and replication factor as 1.
 2. Change the min replication to 2 and restart the cluster.
 Expected: Min replication should be automatically reset on cluster restart by 
 replicating more blocks.
 Currently, if the number of blocks before restart is only one, then on 
 restart NN will not enter safemode, hence replication will happen and 
 satisfies min replication factor.
 If initial blocks count is more than 1 which are having replication factor as 
 1, then on restart NN will enter safemode and will never come out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0

2012-07-31 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425849#comment-13425849
 ] 

Raju commented on HDFS-3731:


{quote}We should probably hook the finalize code to also rm -rf the 
blocksbeingwritten directory, or else the storage will be leaked forever, 
right?{quote}

Yes Todd forget to mention about finalize, we need to delete the BBW dir in 
finalize

{quote}If you agree, may be we can file separate JIRA for that, as this JIRA 
mainly talking about 2.0 upgrade from 1.0.
{quote}

Uma I accept your opinion but we need to modify the code which is already 
released, I am not very clear on how to fix the same.


{quote}I thought that hardlinks to directories are not typically supported 
...{quote}
Robert Here I am not directly referring to Hardlink like
{code}
ln sourceDir destDir
{code}

I am talking about using 
{code}
void org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(File from, 
File to, int oldLV, HardLink hl)
{code}

which will do the individual file linking for all the blocks.




 2.0 release upgrade must handle blocks being written from 1.0
 -

 Key: HDFS-3731
 URL: https://issues.apache.org/jira/browse/HDFS-3731
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Suresh Srinivas
Assignee: Todd Lipcon
Priority: Blocker

 Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 
 release. Problem reported by Brahma Reddy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2956) calling fetchdt without a --renewer argument throws NPE

2012-07-31 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425861#comment-13425861
 ] 

Raju commented on HDFS-2956:


Here we are defining the protocol message as
{code}
message GetDelegationTokenRequestProto {
  required string renewer = 1;
}
{code}

based on some of the above comments I feel the renewer should be optional 
(since null can be passed I mean we are not providing renewer).

Even with optional we will have the generated class with null check for 
renewer, so I guess we can have null check for renewer like

{code}
if(renewer != null) {
GetDelegationTokenRequestProto req = GetDelegationTokenRequestProto
.newBuilder()
.setRenewer(renewer.toString())
.build();
} else {
  GetDelegationTokenRequestProto req = GetDelegationTokenRequestProto
  .newBuilder()
  .build();
}
{code} 
This should be possible since we are declaring the renewer optional, similarly 
we can parse the message back at serverside translator.

Please correct me if I am wrong

 calling fetchdt without a --renewer argument throws NPE
 ---

 Key: HDFS-2956
 URL: https://issues.apache.org/jira/browse/HDFS-2956
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Daryn Sharp

 If I call bin/hdfs fetchdt /tmp/mytoken without a --renewer foo argument, 
 then it will throw a NullPointerException:
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:830)
 this is because getDelegationToken is being called with a null renewer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0

2012-07-30 Thread Raju (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425015#comment-13425015
 ] 

Raju commented on HDFS-3731:


In the upgrade process we usually rename the old folder to previous folder but 
here we have 2 folders (one contain finalised blocks and other in BBW) to 
handle ,I would like to propose

sd.root/previous  sd.root/current/BPID/current/finalized
sd.root/blocksbeingwritten  sd.root/current/BPID/current/rbw

NOTE:  hardlink
NOTE: blocksbeingwritten folder is not renamed since this can cause rollback 
problem with old versions.

With the above fix there are some advantages
1. Simple change in existing code(2.X +) code,
2. No change required for old version code.(1.X -),
3. Rollback will work without any new effort,

Please correct me if I am wrong.

 2.0 release upgrade must handle blocks being written from 1.0
 -

 Key: HDFS-3731
 URL: https://issues.apache.org/jira/browse/HDFS-3731
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: data-node
Affects Versions: 2.0.0-alpha
Reporter: Suresh Srinivas
Assignee: Todd Lipcon
Priority: Blocker

 Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 
 release. Problem reported by Brahma Reddy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira