[jira] [Resolved] (HDFS-3307) when a file

2012-04-19 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3307.
---

Resolution: Invalid

> when a file
> ---
>
> Key: HDFS-3307
> URL: https://issues.apache.org/jira/browse/HDFS-3307
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: yixiaohua
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3261) TestHASafeMode fails on HDFS-3042 branch

2012-04-13 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3261.
---

   Resolution: Fixed
Fix Version/s: Auto failover (HDFS-3042)
 Hadoop Flags: Reviewed

Committed this yesterday, just forgot to resolve

> TestHASafeMode fails on HDFS-3042 branch
> 
>
> Key: HDFS-3261
> URL: https://issues.apache.org/jira/browse/HDFS-3261
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: Auto failover (HDFS-3042)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Trivial
> Fix For: Auto failover (HDFS-3042)
>
> Attachments: hdfs-3261.txt
>
>
> TestHASafeMode started failing on the HDFS-3042 branch after the commit of 
> HADOOP-8247. The reason is that testEnterSafeModeInANNShouldNotThrowNPE 
> restarts the active node, and then tries to make an RPC to it right after 
> restarting. The RPC picks up a cached connection to the old (restarted) NN, 
> which causes an EOFException. This was just due to a test change that was 
> made in HADOOP-8247, not due to any change made by the actual patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3159) Document NN auto-failover setup and configuration

2012-04-12 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3159.
---

   Resolution: Fixed
Fix Version/s: Auto failover (HDFS-3042)
 Hadoop Flags: Reviewed

Committed to branch, thanks for reviews

> Document NN auto-failover setup and configuration
> -
>
> Key: HDFS-3159
> URL: https://issues.apache.org/jira/browse/HDFS-3159
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: auto-failover, documentation, ha
>Affects Versions: Auto failover (HDFS-3042)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: Auto failover (HDFS-3042)
>
> Attachments: HDFSHighAvailability.html, delta.txt, hdfs-3159.txt, 
> hdfs-3159.txt
>
>
> We should document how to configure, set up, and monitor an automatic 
> failover setup. This will require adding the new configs to the *-default.xml 
> and adding prose to the "apt" docs as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3055) Implement recovery mode for branch-1

2012-04-11 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3055.
---

   Resolution: Fixed
Fix Version/s: (was: 1.0.0)
   1.1.0
 Hadoop Flags: Reviewed

Committed to branch-1. Thanks, Colin. Can you please fill in the "Release Note" 
flag here and in HDFS-3004 pointing out the new feature and giving a reference 
to where it is documented?

> Implement recovery mode for branch-1
> 
>
> Key: HDFS-3055
> URL: https://issues.apache.org/jira/browse/HDFS-3055
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
>Priority: Minor
> Fix For: 1.1.0
>
> Attachments: HDFS-3055-b1.001.patch, HDFS-3055-b1.002.patch, 
> HDFS-3055-b1.003.patch, HDFS-3055-b1.004.patch, HDFS-3055-b1.005.patch, 
> HDFS-3055-b1.006.patch, HDFS-3055-b1.007.patch
>
>
> Implement recovery mode for branch-1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3079) Document -bootstrapStandby flag for deploying HA

2012-04-11 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3079.
---

  Resolution: Duplicate
Target Version/s: 0.24.0, 0.23.3  (was: 0.23.3, 0.24.0)

Dup by MAPREDUCE-4132 (the docs are in the MR project)

> Document -bootstrapStandby flag for deploying HA
> 
>
> Key: HDFS-3079
> URL: https://issues.apache.org/jira/browse/HDFS-3079
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation, ha
>Affects Versions: 0.24.0, 0.23.3
>Reporter: Todd Lipcon
>Priority: Minor
>
> HDFS-2731 added a new flag to make it easier to bootstrap a new standby node 
> in an HA cluster. But, I forgot to update the docs. We should improve the 
> docs to mention this new, easier method of setting up HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2995) start-dfs.sh should only start the 2NN for namenodes with dfs.namenode.secondary.http-address configured

2012-04-09 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2995.
---

   Resolution: Fixed
Fix Version/s: 2.0.0
 Hadoop Flags: Reviewed

Looks like this got committed to trunk and branch-2 last week.

> start-dfs.sh should only start the 2NN for namenodes with 
> dfs.namenode.secondary.http-address configured
> 
>
> Key: HDFS-2995
> URL: https://issues.apache.org/jira/browse/HDFS-2995
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: scripts
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Eli Collins
> Fix For: 2.0.0
>
> Attachments: hdfs-2995.txt
>
>
> When I run "start-dfs.sh" it tries to start a 2NN on every node in the 
> cluster. This despite:
> [todd@c1120 hadoop-active]$ ./bin/hdfs getconf -secondaryNameNodes
> Incorrect configuration: secondary namenode address 
> dfs.namenode.secondary.http-address is not configured.
> Thankfully they do not start :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3223) Auto HA: add zkfc to hadoop-daemon.sh script

2012-04-06 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3223.
---

   Resolution: Fixed
Fix Version/s: Auto failover (HDFS-3042)
 Hadoop Flags: Reviewed

Committed to branch, thanks.

> Auto HA: add zkfc to hadoop-daemon.sh script
> 
>
> Key: HDFS-3223
> URL: https://issues.apache.org/jira/browse/HDFS-3223
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: auto-failover, scripts
>Affects Versions: Auto failover (HDFS-3042)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Trivial
> Fix For: Auto failover (HDFS-3042)
>
> Attachments: hdfs-3223.txt
>
>
> In order to start the ZKFC, we need to add it to the list of daemons in this 
> script.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3200) Auto-HA: Scope all ZKFC configs by nameservice

2012-04-05 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3200.
---

   Resolution: Fixed
Fix Version/s: Auto failover (HDFS-3042)
 Hadoop Flags: Reviewed

> Auto-HA: Scope all ZKFC configs by nameservice
> --
>
> Key: HDFS-3200
> URL: https://issues.apache.org/jira/browse/HDFS-3200
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: auto-failover
>Affects Versions: Auto failover (HDFS-3042)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: Auto failover (HDFS-3042)
>
> Attachments: hdfs-3200.txt, hdfs-3200.txt
>
>
> Currently the ZKFC is configured with a global set of configs. But different 
> nameservices may want to use different ZooKeeper quorums, ACLs, znodes, etc. 
> So, we should optionally allow the confs to be specified with a nameservice 
> suffix, overriding the same config with no suffix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-1378) Edit log replay should track and report file offsets in case of errors

2012-04-04 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1378.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

Committed backport to branch-1. Thanks, Colin!

> Edit log replay should track and report file offsets in case of errors
> --
>
> Key: HDFS-1378
> URL: https://issues.apache.org/jira/browse/HDFS-1378
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.22.0
>Reporter: Todd Lipcon
>Assignee: Colin Patrick McCabe
> Fix For: 1.1.0, 0.23.0
>
> Attachments: HDFS-1378-b1.002.patch, HDFS-1378-b1.003.patch, 
> HDFS-1378-b1.004.patch, hdfs-1378-branch20.txt, hdfs-1378.0.patch, 
> hdfs-1378.1.patch, hdfs-1378.2.txt
>
>
> Occasionally there are bugs or operational mistakes that result in corrupt 
> edit logs which I end up having to repair by hand. In these cases it would be 
> very handy to have the error message also print out the file offsets of the 
> last several edit log opcodes so it's easier to find the right place to edit 
> in the OP_INVALID marker. We could also use this facility to provide a rough 
> estimate of how far along edit log replay the NN is during startup (handy 
> when a 2NN has died and replay takes a while)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2185) HA: HDFS portion of ZK-based FailoverController

2012-04-02 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2185.
---

   Resolution: Fixed
Fix Version/s: Auto failover (HDFS-3042)
 Hadoop Flags: Reviewed

Committed to auto-failover branch (HDFS-3042). Please feel free to continue 
commenting on the design -- either here, or if you prefer, we can move the 
discussion to the umbrella task HDFS-3042. Apologies for having uploaded the 
document to this subtask instead of the supertask.

> HA: HDFS portion of ZK-based FailoverController
> ---
>
> Key: HDFS-2185
> URL: https://issues.apache.org/jira/browse/HDFS-2185
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: auto-failover, ha
>Affects Versions: 0.24.0, 0.23.3
>Reporter: Eli Collins
>Assignee: Todd Lipcon
> Fix For: Auto failover (HDFS-3042)
>
> Attachments: Failover_Controller.jpg, hdfs-2185.txt, hdfs-2185.txt, 
> hdfs-2185.txt, hdfs-2185.txt, hdfs-2185.txt, zkfc-design.pdf, 
> zkfc-design.pdf, zkfc-design.pdf, zkfc-design.tex
>
>
> This jira is for a ZK-based FailoverController daemon. The FailoverController 
> is a separate daemon from the NN that does the following:
> * Initiates leader election (via ZK) when necessary
> * Performs health monitoring (aka failure detection)
> * Performs fail-over (standby to active and active to standby transitions)
> * Heartbeats to ensure the liveness
> It should have the same/similar interface as the Linux HA RM to aid 
> pluggability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2609) DataNode.getDNRegistrationByMachineName can probably be removed or simplified

2012-03-31 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2609.
---

Resolution: Later

Re-resolving with "Later" instead of "Fixed" since it's not fixed yet.

> DataNode.getDNRegistrationByMachineName can probably be removed or simplified
> -
>
> Key: HDFS-2609
> URL: https://issues.apache.org/jira/browse/HDFS-2609
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: data-node
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Eli Collins
>
> I noticed this while working on HDFS-1971: The 
> {{getDNRegistrationByMachineName}} iterates over block pools to return a 
> given block pool's registration object based on its {{machineName}} field. 
> But, the machine name for every BPOfferService is identical - they're always 
> constructed by just calling {{DataNode.getName}}. All of the call sites for 
> this function are from tests, as well. So, maybe it's not necessary, or at 
> least it might be able to be simplified or moved to a test method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3071) haadmin failover command does not provide enough detail for when target NN is not ready to be active

2012-03-22 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3071.
---

   Resolution: Fixed
Fix Version/s: 0.23.2
   0.24.0
 Hadoop Flags: Reviewed

Committed to trunk and 23. Thanks for reviewing, ATM, and for reporting, Philip.

> haadmin failover command does not provide enough detail for when target NN is 
> not ready to be active
> 
>
> Key: HDFS-3071
> URL: https://issues.apache.org/jira/browse/HDFS-3071
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha
>Affects Versions: 0.24.0
>Reporter: Philip Zeyliger
>Assignee: Todd Lipcon
> Fix For: 0.24.0, 0.23.2
>
> Attachments: hdfs-3071.txt, hdfs-3071.txt, hdfs-3071.txt, 
> hdfs-3071.txt, hdfs-3071.txt, hdfs-3071.txt
>
>
> When running the failover command, you can get an error message like the 
> following:
> {quote}
> $ hdfs --config $(pwd) haadmin -failover namenode2 namenode1
> Failover failed: xxx.yyy/1.2.3.4:8020 is not ready to become active
> {quote}
> Unfortunately, the error message doesn't describe why that node isn't ready 
> to be active.  In my case, the target namenode's logs don't indicate anything 
> either. It turned out that the issue was "Safe mode is ON.Resources are low 
> on NN. Safe mode must be turned off manually.", but ideally the user would be 
> told that at the time of the failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-1197) Blocks are considered "complete" prematurely after commitBlockSynchronization or DN restart

2012-03-16 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1197.
---

   Resolution: Fixed
Fix Version/s: 0.20.205.0
 Assignee: Todd Lipcon

This was committed to branch-20 a long while back

> Blocks are considered "complete" prematurely after commitBlockSynchronization 
> or DN restart
> ---
>
> Key: HDFS-1197
> URL: https://issues.apache.org/jira/browse/HDFS-1197
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client, name-node
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 0.20-append, 0.20.205.0
>
> Attachments: HDFS-1197-20s.2.patch, HDFS-1197-20s.3.patch, 
> HDFS-1197-without-addStoredBlock-change.1.patch, hdfs-1197-test-changes.txt, 
> hdfs-1197.txt, testTC2-failure.txt
>
>
> I saw this failure once on my internal Hudson job that runs the append tests 
> 48 times a day:
> junit.framework.AssertionFailedError: expected:<114688> but was:<98304>
>   at org.apache.hadoop.hdfs.AppendTestUtil.check(AppendTestUtil.java:112)
>   at 
> org.apache.hadoop.hdfs.TestFileAppend3.testTC2(TestFileAppend3.java:116)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3074) HDFS ignores group of a user when creating a file or a directory, and instead inherits

2012-03-11 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3074.
---

Resolution: Won't Fix

For whatever strange historic reason, Hadoop uses BSD semantics instead of 
Linux semantics for group ownership at creation. I filed this "bug" once, too :)

> HDFS ignores group of a user when creating a file or a directory, and instead 
> inherits
> --
>
> Key: HDFS-3074
> URL: https://issues.apache.org/jira/browse/HDFS-3074
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.1
>Reporter: Harsh J
>Priority: Minor
>
> When creating a file or making a directory on HDFS, the namesystem calls pass 
> {{null}} for the group name, thereby having the parent directory permissions 
> inherited onto the file.
> This is not how the Linux FS works at least.
> For instance, if I have today a user 'foo' with default group 'foo', and I 
> have my HDFS home dir created as "foo:foo" by the HDFS admin, all files I 
> create under my directory too will have "foo" as group unless I chgrp them 
> myself. This makes sense.
> Now, if my admin were to change my local accounts' default/primary group to 
> 'bar' (but did not change so on my homedir on HDFS, and I were to continue 
> writing files to my home directory or any subdirectory that has 'foo' as 
> group, all files still get created with group 'foo' - as if the NN has not 
> realized the primary group of the mapped shell account has already changed.
> On linux this is the opposite. My login session's current primary group is 
> what determines the default group on my created files and directories, not 
> the parent dir owner.
> If the create and mkdirs call passed UGI's group info 
> (UserGroupInformation.getCurrentUser().getGroupNames()[0] should give primary 
> group?) along into their calls instead of a null in the PermissionsStatus 
> object, perhaps this can be avoided.
> Or should we leave this as-is, and instead state that if admins wish their 
> default groups of users to change, they'd have to chgrp all the directories 
> themselves?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3053) Support for true zero-copy (mmap-based) reads

2012-03-06 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3053.
---

Resolution: Duplicate

looks like Dhruba filed this at the same time I did. Resolving as dup of 
HDFS-3051

> Support for true zero-copy (mmap-based) reads
> -
>
> Key: HDFS-3053
> URL: https://issues.apache.org/jira/browse/HDFS-3053
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs client
>Reporter: Todd Lipcon
>
> Continuing a discussion started in HDFS-2834, it would be slick to provide 
> support for a true zero-copy read API. In this style API, rather than having 
> the user pass buffers to read into, the DFS client would provide buffers to 
> the user. This allows the client to pass slices of mmaped blocks, for 
> example, which can cut one or two copies out of many read scenarios.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3035) HA: fix failure of TestFileAppendRestart due to OP_UPDATE_BLOCKS

2012-03-01 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3035.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thx.

> HA: fix failure of TestFileAppendRestart due to OP_UPDATE_BLOCKS
> 
>
> Key: HDFS-3035
> URL: https://issues.apache.org/jira/browse/HDFS-3035
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-3035.txt
>
>
> HDFS-3023 added OP_UPDATE_BLOCKS, so the assertions in TestFileAppendRestart 
> which check for a given number of OP_ADDs are no longer correct. Simply need 
> to fix the test to look for OP_UPDATE_BLOCKS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3010) Get performance on HA branch to match trunk

2012-02-29 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3010.
---

Resolution: Fixed

With the recent patches committed to the HA branch, performance is now 
comparable.

> Get performance on HA branch to match trunk
> ---
>
> Key: HDFS-3010
> URL: https://issues.apache.org/jira/browse/HDFS-3010
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
>
> As described in [this 
> comment|https://issues.apache.org/jira/browse/HDFS-1623?focusedCommentId=13215309&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13215309]
>  the performance of the HA branch for writes is significantly reduced 
> compared to trunk. We need to dig a bit and optimize whatever it is that's 
> hurting us in order to get back to the same performance numbers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3023) Optimize entries in edits log for persistBlocks calls

2012-02-29 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3023.
---

  Resolution: Fixed
   Fix Version/s: HA branch (HDFS-1623)
Target Version/s: HA branch (HDFS-1623), 0.24.0  (was: 0.24.0, HA branch 
(HDFS-1623))
Hadoop Flags: Reviewed

Committed to HA branch.

> Optimize entries in edits log for persistBlocks calls
> -
>
> Key: HDFS-3023
> URL: https://issues.apache.org/jira/browse/HDFS-3023
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node, performance
>Affects Versions: HA branch (HDFS-1623), 0.23.2
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-3023-HDFS-1623.txt, hdfs-3023.txt
>
>
> One of the performance issues noticed in the HA branch is due to the much 
> larger edit logs, now that we are writing OP_ADD transactions to the edit log 
> on every block allocation. We can condense these calls down in two ways:
> 1) use variable-length integers for the block list length, size, and genstamp 
> (most of these end up fitting in far less than 8 bytes)
> 2) use delta-coding for the genstamp and block size for any blocks after the 
> first block (most blocks will be the same size and only slightly higher 
> genstamps)
> 3) introduce a new OP_UPDATE_BLOCKS transaction that doesn't re-serialize 
> metadata information like lease owner, permissions, etc
> 4) allow OP_UPDATE_BLOCKS to only re-serialize the blocks that have changed 
> for a given transaction

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3019) TestEditLogJournalFailures silently failing

2012-02-28 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3019.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> TestEditLogJournalFailures silently failing
> ---
>
> Key: HDFS-3019
> URL: https://issues.apache.org/jira/browse/HDFS-3019
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-3019.txt
>
>
> Because we use a crappy version of surefire, currently, we didn't notice 
> this. But, TestEditLogJournalFailures seems to be calling System.exit() -- 
> Surefire reports no tests run, errored, or skipped, but it's lying. I'll file 
> a separate JIRA to upgrade Surefire on trunk, but we also need to investigate 
> this test failure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-3013) HA: NameNode format doesn't pick up dfs.namenode.name.dir.NameServiceId configure

2012-02-27 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-3013.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Ran the tests, no new failures introduced. I also double checked that the new 
test case fails if the fix is removed. Committed to HA branch. Thanks, Mingjie!

> HA: NameNode format doesn't pick up dfs.namenode.name.dir.NameServiceId 
> configure
> -
>
> Key: HDFS-3013
> URL: https://issues.apache.org/jira/browse/HDFS-3013
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Mingjie Lai
>Assignee: Mingjie Lai
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-3013.patch
>
>
> One problem I observed at HA branch is that namenode -format doesn't pick up 
> the right configured dfs.namenode.name.dir.NameServiceId. In this case, 
> ``namenode -format'' will format the default directory instead of the 
> configured one, while namenode will use the correct one. 
> The root cause is that NameNode.initializeGenericKeys() is only invoked at 
> the NameNode constructor. So it doesn't get called by static methods like 
> format and finalize. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2904) HA: Client support for getting delegation tokens to an HA cluster

2012-02-24 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2904.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks for all of you who reviewed.

> HA: Client support for getting delegation tokens to an HA cluster
> -
>
> Key: HDFS-2904
> URL: https://issues.apache.org/jira/browse/HDFS-2904
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, hdfs client, name-node, security
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2904.txt, hdfs-2904.txt, hdfs-2904.txt, test-dt.sh
>
>
> Currently we have server-side support for delegation tokens in HA, and some 
> tests to verify it, but the client throws NPEs when trying to fetch a DT. 
> This is because the cluster doesn't have a single hostname, but instead a 
> logical nameservice name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2973) HA: re-enable NO_ACK optimization for block deletion

2012-02-22 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2973.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to HA branch (ran unit tests on this the other day)

> HA: re-enable NO_ACK optimization for block deletion
> 
>
> Key: HDFS-2973
> URL: https://issues.apache.org/jira/browse/HDFS-2973
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2973.txt
>
>
> Currently in trunk, when a file is removed, the deletion request is sent to 
> the DNs with a special NO_ACK flag to indicate that they don't need to ACK 
> the deletion. The NN itself takes care of removing the blocks from the block 
> map, so the deletion report would be redundant.
> In the HA branch, we disabled this to fix a failure in TestSafeMode -- when 
> the active NN issues a block deletion, and the standby hasn't read the edits 
> yet, this test case expects that it would see the block deletions.
> I don't see any actual compelling reasons for this. I think we can restore 
> the optimization and modify the test to pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2972) HA: small optimization building incremental block report

2012-02-22 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2972.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch. I also tested this on a 100-node cluster.

> HA: small optimization building incremental block report
> 
>
> Key: HDFS-2972
> URL: https://issues.apache.org/jira/browse/HDFS-2972
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2972.txt
>
>
> Current the incremental block report is a List. However, we only really need 
> to send the most recent information for each block. So, if a block becomes 
> RBW and is then becomes FINALIZED before the next incremental block report, 
> we should just report the FINALIZED replica. It's an easy change to just 
> switch to a map by blockid.
> This is on the HA branch since HA changes the code to send RBWs in the 
> incremental reports, hence it's more important for maintaining performance 
> parity with trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2929) HA: stress test and fixes for block synchronization

2012-02-22 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2929.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to HA branch, thanks for reviews.

> HA: stress test and fixes for block synchronization
> ---
>
> Key: HDFS-2929
> URL: https://issues.apache.org/jira/browse/HDFS-2929
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2929.txt, hdfs-2929.txt, hdfs-2929.txt, 
> hdfs-2929.txt, hdfs-2929.txt
>
>
> We have a couple of TODOs in {{commitBlockSynchronization}} and {{syncBlock}} 
> around HA. I think the current behavior may in fact be correct, but I plan to 
> write a stress test / functional test for better coverage of this area.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2948) HA: NN throws NPE during shutdown if it fails to startup

2012-02-15 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2948.
---

Resolution: Fixed

Applied the delta, thanks

> HA: NN throws NPE during shutdown if it fails to startup
> 
>
> Key: HDFS-2948
> URL: https://issues.apache.org/jira/browse/HDFS-2948
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2948.txt
>
>
> Last night's nightly build had a bunch of NPEs thrown in NameNode.stop. Not 
> sure which patch introduced the issue, but the problem is that 
> NameNode.stop() is called if an exception is thrown during startup. If the 
> exception is thrown before the namesystem is created, then 
> NameNode.namesystem is null, and {{namesystem.stop}} throws NPE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2957) transactionsSinceLastLogRoll metric can throw IllegalStateException

2012-02-15 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2957.
---

Resolution: Duplicate

oops, this was duplicate of HDFS-2955 filed at almost the same time :)

> transactionsSinceLastLogRoll metric can throw IllegalStateException
> ---
>
> Key: HDFS-2957
> URL: https://issues.apache.org/jira/browse/HDFS-2957
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.24.0
>Reporter: Todd Lipcon
>Assignee: Aaron T. Myers
>
> 12/02/15 15:04:36 ERROR lib.MethodMetric: Error invoking method 
> getTransactionsSinceLastLogRoll
> ...
> at 
> org.apache.hadoop.metrics2.MetricsSystem.register(MetricsSystem.java:54)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices(FSNamesystem.java:505)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:435)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:399)
> ...
> Caused by: java.lang.IllegalStateException: Bad state: OPEN_FOR_READING
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:172)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.getCurSegmentTxId(FSEditLog.java:417)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getTransactionsSinceLastLogRoll(FSNamesystem.java:3170)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2935) HA: Shared edits dir property should be suffixed with nameservice and namenodeID

2012-02-15 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2935.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to HA branch.

> HA: Shared edits dir property should be suffixed with nameservice and 
> namenodeID
> 
>
> Key: HDFS-2935
> URL: https://issues.apache.org/jira/browse/HDFS-2935
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Vinithra Varadharajan
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2935.txt
>
>
> Similar to the NameNode's name dirs, we should also be able to specify the 
> shared edits dir as dfs.namenode.shared.edits.dir.nameserviceId.nnId.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2934) HA: Allow configs to be scoped to all NNs in the nameservice

2012-02-15 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2934.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Thanks for the review. I addressed the nits on commit.

> HA: Allow configs to be scoped to all NNs in the nameservice
> 
>
> Key: HDFS-2934
> URL: https://issues.apache.org/jira/browse/HDFS-2934
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2934.txt
>
>
> Currently, for namenode-specific keys in HA, one must configure them as 
> keyfoo.nameserviceid.namenodeid. However, in many cases all of the NNs in a 
> nameservice would share the same value. So we should allow the configuration 
> of "keyfoo.nameserviceid" to apply to all NNs. The resolution path for these 
> keys would then be:
> keyfoo.nameserviceid.nnid
> keyfoo.nameserviceid (if above not set)
> keyfoo (if neither of above set)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-238) imeplement DFSClient on top of thriftfs - this may require DFSClient or DN protocol changes

2012-02-14 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-238.
--

Resolution: Won't Fix

Marking wontfix, seems like protobuf-based IPC and/or HTTPFS is probably the 
better way forward here.

> imeplement DFSClient on top of thriftfs - this may require DFSClient or DN 
> protocol changes
> ---
>
> Key: HDFS-238
> URL: https://issues.apache.org/jira/browse/HDFS-238
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Pete Wyckoff
>Priority: Minor
>
> Open up DFS Protocol to allow non-Hadoop DFS clients to implement 
> reads/writes.  Obviously, the NN need not be changed because the thriftfs 
> server will serve up the same metadata - ie it's a bridge to the NN.
> This is useful because if we can do this in Java using more open APIs, we 
> could do it in C++ or Python or Perl :)
> Doing it in Java first makes sense because we already have the DFSClient - 
> kind of a proof of concept.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2948) HA: NN throws NPE during shutdown if it fails to startup

2012-02-14 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2948.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to ha branch, thanks ATM

> HA: NN throws NPE during shutdown if it fails to startup
> 
>
> Key: HDFS-2948
> URL: https://issues.apache.org/jira/browse/HDFS-2948
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2948.txt
>
>
> Last night's nightly build had a bunch of NPEs thrown in NameNode.stop. Not 
> sure which patch introduced the issue, but the problem is that 
> NameNode.stop() is called if an exception is thrown during startup. If the 
> exception is thrown before the namesystem is created, then 
> NameNode.namesystem is null, and {{namesystem.stop}} throws NPE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2945) Allow NNThroughputBenchmark to stress IPC layer

2012-02-13 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2945.
---

Resolution: Won't Fix

> Allow NNThroughputBenchmark to stress IPC layer
> ---
>
> Key: HDFS-2945
> URL: https://issues.apache.org/jira/browse/HDFS-2945
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: hdfs-2945.txt
>
>
> Currently, the NNThroughputBenchmark acts only as a benchmark of the NN 
> itself. It accesses the NN directly rather than going via IPC. It would be 
> nice to allow it to run in another mode where all access goes through the IPC 
> layer, so that changes in IPC serialization performance can be 
> repeatably/easily measured.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2060) DFS client RPCs using protobufs

2012-02-10 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2060.
---

Resolution: Duplicate

This was fixed by other work to move RPC to protobuf

> DFS client RPCs using protobufs
> ---
>
> Key: HDFS-2060
> URL: https://issues.apache.org/jira/browse/HDFS-2060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: hdfs-2060-getblocklocations.txt
>
>
> The most important place for wire-compatibility in DFS is between clients and 
> the cluster, since lockstep upgrade is very difficult and a single client may 
> want to talk to multiple server versions. So, I'd like to focus this JIRA on 
> making the RPCs between the DFS client and the NN/DNs wire-compatible using 
> protocol buffer based serialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2478) HDFS Protocols in Protocol Buffers

2012-02-10 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2478.
---

Resolution: Duplicate

This was fixed by the other work to move IPC to protobufs

> HDFS Protocols in Protocol Buffers
> --
>
> Key: HDFS-2478
> URL: https://issues.apache.org/jira/browse/HDFS-2478
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Sanjay Radia
>Assignee: Sanjay Radia
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2924) Standby checkpointing fails to authenticate in secure cluster

2012-02-09 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2924.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to HA branch, thanks for the review.

> Standby checkpointing fails to authenticate in secure cluster
> -
>
> Key: HDFS-2924
> URL: https://issues.apache.org/jira/browse/HDFS-2924
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node, security
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2924.txt
>
>
> When running HA on a secure cluster, the SBN checkpointing process doesn't 
> seem to pick up the keytab-based credentials for its RPC connection to the 
> active. I think we're just missing a doAs() in the right spot.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2579) Starting delegation token manager during safemode fails

2012-02-08 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2579.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to HA branch, thanks for the reviews.

> Starting delegation token manager during safemode fails
> ---
>
> Key: HDFS-2579
> URL: https://issues.apache.org/jira/browse/HDFS-2579
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node, security
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2579.txt, hdfs-2579.txt, hdfs-2579.txt
>
>
> I noticed this on the HA branch, but it seems to actually affect non-HA 
> branch 0.23 if security is enabled. When the NN starts up, if security is 
> enabled, we start the delegation token secret manager, which then tries to 
> call {{logUpdateMasterKey}}. This fails because the edit logs may not be 
> written while in safe-mode.
> It seems to me that there's not any necessary reason that you have to make a 
> new master key at startup, since you've loaded the old key when you load the 
> FSImage. You'd only be lacking a DT master key on a fresh cluster, in which 
> case we could have it generate one at format time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2923) Namenode IPC handler count uses the wrong configuration key

2012-02-08 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2923.
---

   Resolution: Fixed
Fix Version/s: 0.23.2
   0.24.0
 Hadoop Flags: Reviewed

Committed to branch, thanks for reviewing Eli.

> Namenode IPC handler count uses the wrong configuration key
> ---
>
> Key: HDFS-2923
> URL: https://issues.apache.org/jira/browse/HDFS-2923
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.0, 0.23.1
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.24.0, 0.23.2
>
> Attachments: hdfs-2923.txt
>
>
> In HDFS-1763, there was a typo introduced which causes the namenode to use 
> dfs.datanode.handler.count to set the number of IPC threads instead of the 
> correct dfs.namenode.handler.count. This results in bad performance under 
> high load, since there are not nearly enough handlers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2632) existing in_use.lock file is removed after failing to lock this file

2012-02-07 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2632.
---

Resolution: Duplicate

> existing in_use.lock file is removed after failing to lock this file
> 
>
> Key: HDFS-2632
> URL: https://issues.apache.org/jira/browse/HDFS-2632
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.21.0
> Environment: Scientific Linux 5.3
>Reporter: Dan Bradley
>
> If an attempt is made to start the namenode when it is already running, an 
> exception is generated on failure to lock in_use.lock.  However, there is a 
> bug: in_use.lock is deleted!  After that, if another attempt is made to start 
> the namenode, there is no in_use.lock file, so the new instance goes ahead 
> and starts messing with the namenode state files.  It eventually fails to 
> bind to the TCP port, but it has already done damage by that time.  
> Specifically, the 'edits' file being written to by the running instance is 
> moved to 'previous.checkpoint' so all further transactions are lost when the 
> HDFS service is next restarted.  We observed a case of data loss because of 
> this.
> This issue relates to HDFS-1690, but the problem in HDFS-1690 was stated in a 
> way that is specific to -format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2794) HA: Active NN may purge edit log files before standby NN has a chance to read them

2012-02-06 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2794.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Thanks for the reviews, committed to the HA branch

> HA: Active NN may purge edit log files before standby NN has a chance to read 
> them
> --
>
> Key: HDFS-2794
> URL: https://issues.apache.org/jira/browse/HDFS-2794
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2794.txt, hdfs-2794.txt
>
>
> Given that the active NN is solely responsible for purging finalized edit log 
> segments, and given that the active NN has no way of knowing when the standby 
> reads edit logs, it's  possible that the standby NN could fail to read all 
> edits it needs before the active purges the files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2874) HA: edit log should log to shared dirs before local dirs

2012-02-03 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2874.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to HA branch, thanks for the reviews, all.

> HA: edit log should log to shared dirs before local dirs
> 
>
> Key: HDFS-2874
> URL: https://issues.apache.org/jira/browse/HDFS-2874
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2874.txt, hdfs-2874.txt, hdfs-2874.txt
>
>
> Currently, the NN logs its edits to each of its edits directories in 
> sequence. This can produce the following bad sequence:
> - NN accumulates 100 edits (tx 1-100) in the buffer. Writes and syncs to 
> local drive, then crashes
> - Failover occurs. SBN takes over at txid=1, since txid 1 never got writen.
> - First NN restarts. It reads up to txid 100 from its local directories. It 
> is now "ahead" of the active NN with inconsistent state.
> The solution is to write to the shared edits dir, and sync that, before 
> writing to any local drives.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2779) HA: Add lease recovery handling to HA

2012-02-03 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2779.
---

Resolution: Duplicate

Resolving as duplicate of HDFS-2691 - I've been testing lease recovery on a 
cluster using HBase - I can kill an hbase RS, then do a failover, and the 
master is able to properly recover the in-progress file without losing edits.

> HA: Add lease recovery handling to HA
> -
>
> Key: HDFS-2779
> URL: https://issues.apache.org/jira/browse/HDFS-2779
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Suresh Srinivas
>Assignee: Suresh Srinivas
> Fix For: HA branch (HDFS-1623)
>
>
> Lease recovery needs to be handled in HA setup

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2861) HA: checkpointing should verify that the dfs.http.address has been configured to a non-loopback for peer NN

2012-02-02 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2861.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Contributed to HA branch, thanks for reviews.

> HA: checkpointing should verify that the dfs.http.address has been configured 
> to a non-loopback for peer NN
> ---
>
> Key: HDFS-2861
> URL: https://issues.apache.org/jira/browse/HDFS-2861
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2861.txt, hdfs-2861.txt, hdfs-2861.txt, 
> hdfs-2861.txt
>
>
> In an HA setup I was running for the past week, I just noticed that 
> checkpoints weren't getting properly uploaded, since the SBN was connecting 
> to http://0.0.0.0:50070/ rather than the correct dfs.http.address. So, it was 
> uploading checkpoints to itself instead of the peer. We should add sanity 
> checks during startup to ensure that the configuration is correct.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2876) The unit tests (src/test/unit) are not being compiled and are not runnable

2012-02-01 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2876.
---

Resolution: Duplicate

> The unit tests (src/test/unit) are not being compiled and are not runnable
> --
>
> Key: HDFS-2876
> URL: https://issues.apache.org/jira/browse/HDFS-2876
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.23.0
>Reporter: Eli Collins
>
> The unit tests (src/test/unit not src/test/java) are not being compiled and 
> are not runnable. {{mvn -Dtest=TestBlockRecovery test}} executed from 
> hadoop-hdfs-project does not compile or execute the test.
> TestBlockRecovery does not compile yet this test target completes w/o error. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2859) LOCAL_ADDRESS_MATCHER.match has NPE when called from DFSUtil.getSuffixIDs when the host is incorrect

2012-02-01 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2859.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

I committed this. I also fixed an extra unneeded import of NNStorage which was 
in the patch before commit. In the future, can you also please attach "p0" 
patches rather than p1? ie use git diff --no-prefix? Thanks!

> LOCAL_ADDRESS_MATCHER.match has NPE when called from DFSUtil.getSuffixIDs 
> when the host is incorrect
> 
>
> Key: HDFS-2859
> URL: https://issues.apache.org/jira/browse/HDFS-2859
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Bikas Saha
>Assignee: Bikas Saha
>Priority: Minor
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2859.HDFS-1623.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2870) HA: Remove some INFO level logging accidentally left around

2012-02-01 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2870.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks.

> HA: Remove some INFO level logging accidentally left around
> ---
>
> Key: HDFS-2870
> URL: https://issues.apache.org/jira/browse/HDFS-2870
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Priority: Trivial
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2870.txt
>
>
> Currently the NN is logging a line per block at INFO level in 
> processMisReplicatedBlocks. This was just for debugging, and should be at 
> trace level.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2824) HA: failover does not succeed if prior NN died just after creating an edit log segment

2012-01-30 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2824.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks Aaron!

> HA: failover does not succeed if prior NN died just after creating an edit 
> log segment
> --
>
> Key: HDFS-2824
> URL: https://issues.apache.org/jira/browse/HDFS-2824
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Aaron T. Myers
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2824-HDFS-1623.patch, HDFS-2824-HDFS-1623.patch
>
>
> In stress testing failover, I had the following failure:
> - NN1 rolls edit logs and starts writing edits_inprogress_1000
> - NN1 crashes before writing the START_LOG_SEGMENT transaction
> - NN2 tries to become active, and calls {{recoverUnfinalizedSegment}}. Since 
> the log file contains no valid transactions, it is marked as corrupt and 
> renamed with the {{.corrupt}} suffix
> - The sanity check in {{openLogsForWrite}} will refuse to open a new 
> in-progress log at the same txid. Failover does not proceed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2691) HA: Tests and fixes for pipeline targets and replica recovery

2012-01-30 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2691.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks for reviews.

> HA: Tests and fixes for pipeline targets and replica recovery
> -
>
> Key: HDFS-2691
> URL: https://issues.apache.org/jira/browse/HDFS-2691
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2691.txt, hdfs-2691.txt, hdfs-2691.txt, 
> hdfs-2691.txt
>
>
> Currently there are some TODOs around pipeline/recovery code in the HA 
> branch. For example, commitBlockSynchronization only gets sent to the active 
> NN which may have failed over by that point. So, we need to write some tests 
> here and figure out what the correct behavior is.
> Another related area is the treatment of targets in the pipeline. When a 
> pipeline is created, the active NN adds the "expected locations" to the 
> BlockInfoUnderConstruction, but the DN identifiers aren't logged with the 
> OP_ADD. So after a failover, the BlockInfoUnderConstruction will have no 
> targets and I imagine replica recovery would probably trigger some issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2804) SBN should not mark blocks under-replicated when exiting safemode

2012-01-23 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2804.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

I ran the HA tests with this patch applied and they all passed. Committed to 
branch, thanks Eli.

> SBN should not mark blocks under-replicated when exiting safemode
> -
>
> Key: HDFS-2804
> URL: https://issues.apache.org/jira/browse/HDFS-2804
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2804.txt, hdfs-2804.txt
>
>
> In testing on a cluster, I restarted with one fewer datanodes than 
> previously. This caused a few thousand blocks to be under-replicated. Similar 
> to HDFS-2795, I saw the under-replicated blocks on the SBN, slowly decreasing 
> as the replication thread ran. This seems to be because we process the 
> replication queue when exiting safemode, even if in standby mode. It also 
> reports many "missing blocks" in the NN UI which are slowly decreasing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2688) HA: write tests for quota tracking and HA

2012-01-23 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2688.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> HA: write tests for quota tracking and HA
> -
>
> Key: HDFS-2688
> URL: https://issues.apache.org/jira/browse/HDFS-2688
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2688.txt, hdfs-2688.txt
>
>
> Quota is one of the areas I suspect we might have bugs in the current state 
> HA implementation. We need to track quotas correctly on the standby so that 
> when a failover occurs, all of the stats are up to date. Let's add some 
> functional tests that exercise quotas during failover/failback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2820) Add a simple sanity check for HA config

2012-01-23 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2820.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks Eli.

> Add a simple sanity check for HA config
> ---
>
> Key: HDFS-2820
> URL: https://issues.apache.org/jira/browse/HDFS-2820
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Trivial
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2820.txt
>
>
> If the user configures a shared edits dir, but doesn't configure the namenode 
> addresses correctly, the NN will fail to start up in a very strange way.
> I had this misconfiguration in one of my test clusters which was difficult to 
> debug, even though I'm very familiar with the code. This patch is to add a 
> simple sanity check so that if a user has the same misconfiguration, it will 
> fail to start and given a more informative dump.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2812) When becoming active, NN should treat all leases as freshly renewed

2012-01-19 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2812.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks for review.

> When becoming active, NN should treat all leases as freshly renewed
> ---
>
> Key: HDFS-2812
> URL: https://issues.apache.org/jira/browse/HDFS-2812
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2812.txt, hdfs-2812.txt
>
>
> I haven't seen this yet in practice, but I think the following is a bug:
> - Clients currently only renew leases to the active NN
> - Leases are written into the log when created, but there is no record to 
> renew them
> - In Standby state, we don't check leases for expiration. But, when we fail 
> over and start the lease monitor, since the leases haven't been renewed, they 
> might end up getting incorrectly released.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2592) HA: Balancer support for HA namenodes

2012-01-17 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2592.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> HA: Balancer support for HA namenodes
> -
>
> Key: HDFS-2592
> URL: https://issues.apache.org/jira/browse/HDFS-2592
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: balancer, ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Uma Maheswara Rao G
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2592.patch, HDFS-2592.patch, HDFS-2592.patch, 
> HDFS-2592.patch
>
>
> The balancer currently interacts directly with namenode InetSocketAddresses 
> and makes its own IPC proxies. We need to integrate it with HA so that it 
> uses the same client failover infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up

2012-01-16 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2795.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch. I fixed the "20" to "5", good catch.

> HA: Standby NN takes a long time to recover from a dead DN starting up
> --
>
> Key: HDFS-2795
> URL: https://issues.apache.org/jira/browse/HDFS-2795
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2795.txt
>
>
> To reproduce:
> # Start an HA cluster with a DN.
> # Write several blocks to the FS with replication 1.
> # Shutdown the DN
> # Wait for the NNs to declare the DN dead. All blocks will be 
> under-replicated.
> # Restart the DN.
> Note that upon restarting the DN, the active NN will immediately get all 
> block locations from the initial BR. The standby NN will not, and instead 
> will slowly add block locations for a subset of the previously-missing blocks 
> on every DN heartbeat.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2767) HA: ConfiguredFailoverProxyProvider should support NameNodeProtocol

2012-01-16 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2767.
---

  Resolution: Fixed
   Fix Version/s: HA branch (HDFS-1623)
Target Version/s: HA branch (HDFS-1623)
Hadoop Flags: Reviewed

Committed to branch, thanks Uma.

> HA: ConfiguredFailoverProxyProvider should support NameNodeProtocol
> ---
>
> Key: HDFS-2767
> URL: https://issues.apache.org/jira/browse/HDFS-2767
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, hdfs client
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Blocker
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2767.patch, HDFS-2767.patch, HDFS-2767.patch, 
> hdfs-2767-what-todd-had.txt
>
>
> Presentely ConfiguredFailoverProxyProvider supports ClinetProtocol.
> It should support NameNodeProtocol also, because Balancer uses 
> NameNodeProtocol for getting blocks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2747) HA: entering safe mode after starting SBN can NPE

2012-01-16 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2747.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> HA: entering safe mode after starting SBN can NPE
> -
>
> Key: HDFS-2747
> URL: https://issues.apache.org/jira/browse/HDFS-2747
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>Assignee: Uma Maheswara Rao G
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2747.patch, HDFS-2747.patch
>
>
> Entering Safemode on the primary after while it's already in safemode after 
> the SBN is started results in an NPE: 
> {noformat}
> hadoop-0.24.0-SNAPSHOT $ ./bin/hdfs dfsadmin -safemode get
> Safe mode is ON
> hadoop-0.24.0-SNAPSHOT $ ./bin/hdfs dfsadmin -safemode enter
> safemode: java.lang.NullPointerException
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2775) HA: TestStandbyCheckpoints.testBothNodesInStandbyState fails intermittently

2012-01-10 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2775.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thx.

> HA: TestStandbyCheckpoints.testBothNodesInStandbyState fails intermittently
> ---
>
> Key: HDFS-2775
> URL: https://issues.apache.org/jira/browse/HDFS-2775
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2775.txt
>
>
> This test is failing periodically on this assertion:
> {code}
> assertEquals(12, nn0.getNamesystem().getFSImage().getStorage()
> .getMostRecentCheckpointTxId());
> {code}
> My guess is it's a test race. Investigating...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2773) HA: reading edit logs from an earlier version leaves blocks in under-construction state

2012-01-10 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2773.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thx for review.

> HA: reading edit logs from an earlier version leaves blocks in 
> under-construction state
> ---
>
> Key: HDFS-2773
> URL: https://issues.apache.org/jira/browse/HDFS-2773
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hadoop-1.0-multiblock-file.tgz, hdfs-2773.txt
>
>
> In HDFS-2602, the code for applying OP_ADD and OP_CLOSE was changed a bit, 
> and the new code has the following problem: if an OP_CLOSE includes new 
> blocks (ie not previously seen in an OP_ADD) then those blocks will remain in 
> the "under construction" state rather than being marked "complete". This is 
> because {{updateBlocks}} always creates {{BlockInfoUnderConstruction}} 
> regardless of the opcode. This bug only affects the upgrade path, since in 
> trunk we always persist blocks with OP_ADDs before we call OP_CLOSE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2753) Standby namenode stuck in safemode during a failover

2012-01-10 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2753.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> Standby namenode stuck in safemode during a failover
> 
>
> Key: HDFS-2753
> URL: https://issues.apache.org/jira/browse/HDFS-2753
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Hari Mankude
>Assignee: Hari Mankude
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2753.patch, hdfs-2753.txt, hdfs-2753.txt
>
>
> Write traffic initiated from the client. Manual failover is done by killing 
> NN and converting a  different standby to active. NN is restarted as standby. 
> The restarted standby stays in safemode forever. More information in the 
> description.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2724) NN web UI can throw NPE after startup, before standby state is entered

2012-01-09 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2724.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks Eli.

> NN web UI can throw NPE after startup, before standby state is entered
> --
>
> Key: HDFS-2724
> URL: https://issues.apache.org/jira/browse/HDFS-2724
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2724.txt
>
>
> There's a brief period of time (a few seconds) after the NN web server has 
> been initialized, but before the NN's HA state is initialized. If 
> {{dfshealth.jsp}} is hit during this time, a {{NullPointerException}} will be 
> thrown.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2762) TestCheckpoint is timing out

2012-01-09 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2762.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Indeed, looks like you're right. Committed to the branch, thanks Uma

> TestCheckpoint is timing out
> 
>
> Key: HDFS-2762
> URL: https://issues.apache.org/jira/browse/HDFS-2762
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Aaron T. Myers
>Assignee: Uma Maheswara Rao G
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2762.patch
>
>
> TestCheckpoint is timing out on the HA branch, and has been for a few days.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2730) HA: Refactor shared HA-related test code into HATestUtils class

2012-01-08 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2730.
---

  Resolution: Fixed
Hadoop Flags: Reviewed

Committed to branch, thx for the review.

> HA: Refactor shared HA-related test code into HATestUtils class
> ---
>
> Key: HDFS-2730
> URL: https://issues.apache.org/jira/browse/HDFS-2730
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2730.txt
>
>
> A fair number of the HA tests are sharing code like 
> {{waitForStandbyToCatchUp}}, etc. We should refactor this code into an 
> HATestUtils class with static methods.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

2012-01-06 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2709.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

All the HA tests passed. Committed to branch, thanks atm.

> HA: Appropriately handle error conditions in EditLogTailer
> --
>
> Key: HDFS-2709
> URL: https://issues.apache.org/jira/browse/HDFS-2709
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Aaron T. Myers
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch
>
>
> Currently if the edit log tailer experiences an error replaying edits in the 
> middle of a file, it will go back to retrying from the beginning of the file 
> on the next tailing iteration. This is incorrect since many of the edits will 
> have already been replayed, and not all edits are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized 
> file (ie skip those edits already applied), or (b) abort the standby if it 
> hits an error while tailing. If "a" isn't simple, let's do "b" for now and 
> come back to 'a' later since this is a rare circumstance and better to abort 
> than be incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2291) HA: Checkpointing in an HA setup

2012-01-04 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2291.
---

  Resolution: Fixed
Target Version/s: HA branch (HDFS-1623)
Hadoop Flags: Reviewed

Committed to HA branch

> HA: Checkpointing in an HA setup
> 
>
> Key: HDFS-2291
> URL: https://issues.apache.org/jira/browse/HDFS-2291
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Aaron T. Myers
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2291.txt, hdfs-2291.txt, hdfs-2291.txt, 
> hdfs-2291.txt
>
>
> We obviously need to create checkpoints when HA is enabled. One thought is to 
> use a third, dedicated checkpointing node in addition to the active and 
> standby nodes. Another option would be to make the standby capable of also 
> performing the function of checkpointing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2720) HA : TestStandbyIsHot is failing while copying in_use.lock file from NN1 nameSpaceDirs to NN2 nameSpaceDirs

2012-01-04 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2720.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks Uma.

> HA : TestStandbyIsHot is failing while copying in_use.lock file from NN1 
> nameSpaceDirs to NN2 nameSpaceDirs 
> 
>
> Key: HDFS-2720
> URL: https://issues.apache.org/jira/browse/HDFS-2720
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2720.patch, HDFS-2720.patch
>
>
> To maintain the clusterID same , we are copying the namespaceDirs from 1st NN 
> to other NNs.
> While copying this files, in_use.lock file may not allow to copy in all the 
> OSs since it has aquired the lock on it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2716) HA: Configuration needs to allow different dfs.http.addresses for each HA NN

2011-12-30 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2716.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> HA: Configuration needs to allow different dfs.http.addresses for each HA NN
> 
>
> Key: HDFS-2716
> URL: https://issues.apache.org/jira/browse/HDFS-2716
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2716.txt, hdfs-2716.txt
>
>
> Earlier on the HA branch we expanded the configuration so that different IPC 
> addresses can be specified for each of the HA NNs in a cluster. But we didn't 
> do this for the HTTP address. This has proved problematic while working on 
> HDFS-2291 (checkpointing in HA).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2692) HA: Bugs related to failover from/into safe-mode

2011-12-29 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2692.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks for the reviews, Aaron and Eli. I filed HDFS-2730 
for the test util refactor

> HA: Bugs related to failover from/into safe-mode
> 
>
> Key: HDFS-2692
> URL: https://issues.apache.org/jira/browse/HDFS-2692
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2692.txt, hdfs-2692.txt, hdfs-2692.txt
>
>
> In testing I saw an AssertionError come up several times when I was trying to 
> do failover between two NNs where one or the other was in safe-mode. Need to 
> write some unit tests to try to trigger this -- hunch is it has something to 
> do with the treatment of "safe block count" while tailing edits in safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2714) HA: Fix test cases which use standalone FSNamesystems

2011-12-29 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2714.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> HA: Fix test cases which use standalone FSNamesystems
> -
>
> Key: HDFS-2714
> URL: https://issues.apache.org/jira/browse/HDFS-2714
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Trivial
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2714.txt
>
>
> Several tests (eg TestEditLog, TestSaveNamespace) failed in the most recent 
> build with an NPE inside of FSNamesystem.checkOperation. These tests set up a 
> standalone FSN that isn't fully initialized. We just need to add a null check 
> to deal with this case in checkOperation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2603) HA: don't initialize replication queues until entering Active mode

2011-12-20 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2603.
---

Resolution: Duplicate

This work got incorporated into HDFS-1972

> HA: don't initialize replication queues until entering Active mode
> --
>
> Key: HDFS-2603
> URL: https://issues.apache.org/jira/browse/HDFS-2603
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
>
> As described in the comments of HDFS-1975:
> 1) Active NN receives setReplication to drop some file's replication from 3 
> to 1
> 2) It writes OP_SET_REPLICATION to its log, invalidates two replicas, and 
> returns
> 3) The DNs report BLOCK_INVALIDATED back to both the ActiveNN and SBNN.
> 4) The SBNN hasn't received the OP_SET_REPLICATION yet, so it marks the block 
> as under-replicated.
> In the case of raising replication (eg from 1 to 3) we get the opposite 
> problem: the SBNN marks the block as over-replicated and adds two of the 
> replicas to its invalidation list.
> Generation stamps don't help here, because changing replication level of a 
> block doesn't change its gen-stamp (and it shouldn't). One possible answer is 
> that we need to modify FSNamesystem.isPopulatingReplQueues to return false on 
> the standby, and then when it switches from standby to active, initialize the 
> replication queues only after reading the latest edits... I think that will 
> solve the SET_REPLICATION issue, but not certain if it will solve all the 
> issues in this general class.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-1972) HA: Datanode fencing mechanism

2011-12-20 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1972.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to HA branch. Thanks for the reviews

> HA: Datanode fencing mechanism
> --
>
> Key: HDFS-1972
> URL: https://issues.apache.org/jira/browse/HDFS-1972
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Reporter: Suresh Srinivas
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-1972-v1.txt, hdfs-1972.txt, hdfs-1972.txt, 
> hdfs-1972.txt, hdfs-1972.txt, hdfs-1972.txt
>
>
> In high availability setup, with an active and standby namenode, there is a 
> possibility of two namenodes sending commands to the datanode. The datanode 
> must honor commands from only the active namenode and reject the commands 
> from standby, to prevent corruption. This invariant must be complied with 
> during fail over and other states such as split brain. This jira addresses 
> issues related to this, design of the solution and implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2693) Synchronization issues around state transition

2011-12-20 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2693.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thx for the reviews Eli.

> Synchronization issues around state transition
> --
>
> Key: HDFS-2693
> URL: https://issues.apache.org/jira/browse/HDFS-2693
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2693.txt, hdfs-2693.txt, hdfs-2693.txt, 
> hdfs-2693.txt, hdfs-2693.txt
>
>
> Currently when the NN changes state, it does so without synchronization. In 
> particular, the state transition function does:
> (1) leave old state
> (2) change state variable
> (3) enter new state
> This means that the NN is marked as "active" before it has actually 
> transitioned to active mode and opened its edit logs. This gives a window 
> where write transactions can come in and the {{checkOperation}} allows them, 
> but then they fail because the edit log is not yet opened.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2682) HA: When a FailoverProxyProvider is used, Client should not retry for 45 times(hard coded value) if it is timing out to connect to server.

2011-12-19 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2682.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> HA:  When a FailoverProxyProvider is used, Client should not retry for 45 
> times(hard coded value)  if it is timing out to connect to server.
> 
>
> Key: HDFS-2682
> URL: https://issues.apache.org/jira/browse/HDFS-2682
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, hdfs client
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2682.patch
>
>
> If Clients are getting SocketTimeoutException, when it is trying to connect 
> to Server, it will retry for 45 times, to rethrow the exception to 
> RetryPolicy.
> I think we can make this 45 retry times to configurable and set it to lower 
> value when FailoverProxyProvider is used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2678) HA: When a FailoverProxyProvider is used, DFSClient should not retry connection ten times before failing over

2011-12-19 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2678.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch w/ the abovementioned change. Thanks atm.

> HA: When a FailoverProxyProvider is used, DFSClient should not retry 
> connection ten times before failing over
> -
>
> Key: HDFS-2678
> URL: https://issues.apache.org/jira/browse/HDFS-2678
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, hdfs client
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2678.patch
>
>
> Even though the {{FailoverOnNetworkExceptionRetry}} retry policy tries to 
> fail over immediately, in the event the active is down, o.a.h.ipc.Client 
> (below the RetryPolicy) will retry the initial connection 10 times before 
> bubbling up this exception to the RetryPolicy. This should be much quicker.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-1108) Log newly allocated blocks

2011-12-18 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1108.
---

Resolution: Duplicate

Resolving this one as duplicate since it got incorporated into HDFS-2602

> Log newly allocated blocks
> --
>
> Key: HDFS-1108
> URL: https://issues.apache.org/jira/browse/HDFS-1108
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Reporter: dhruba borthakur
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-1108.patch, hdfs-1108-habranch.txt, 
> hdfs-1108-habranch.txt, hdfs-1108-habranch.txt, hdfs-1108-habranch.txt, 
> hdfs-1108-habranch.txt, hdfs-1108.txt
>
>
> The current HDFS design says that newly allocated blocks for a file are not 
> persisted in the NN transaction log when the block is allocated. Instead, a 
> hflush() or a close() on the file persists the blocks into the transaction 
> log. It would be nice if we can immediately persist newly allocated blocks 
> (as soon as they are allocated) for specific files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2677) HA: Web UI should indicate the NN state

2011-12-18 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2677.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> HA: Web UI should indicate the NN state
> ---
>
> Key: HDFS-2677
> URL: https://issues.apache.org/jira/browse/HDFS-2677
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>Assignee: Eli Collins
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2677.txt, hdfs-2677.txt, hdfs-2677.txt, 
> hdfs-2677.txt
>
>
> The DFS web UI should indicate whether it's an active or standby.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2679) Add interface to query current state to HAServiceProtocol

2011-12-18 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2679.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> Add interface to query current state to HAServiceProtocol 
> --
>
> Key: HDFS-2679
> URL: https://issues.apache.org/jira/browse/HDFS-2679
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Eli Collins
>Assignee: Eli Collins
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2679.txt, hdfs-2679.txt, hdfs-2679.txt, 
> hdfs-2679.txt, hdfs-2679.txt
>
>
> Let's add an interface to HAServiceProtocol to query the current state of a 
> NameNode for use by the the CLI (HAAdmin) and Web UI (HDFS-2677). This 
> essentially makes the names "active" and "standby" from ACTIVE_STATE and 
> STANDBY_STATE public interfaces, which IMO seems reasonable. Unlike the other 
> APIs we should be able to use the interface even when HA is not enabled (as 
> by default a non-HA NN is active).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2553) BlockPoolSliceScanner spinning in loop

2011-12-17 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2553.
---

  Resolution: Fixed
   Fix Version/s: 0.23.1
  0.24.0
Target Version/s: 0.24.0, 0.23.1  (was: 0.23.1, 0.24.0)
Hadoop Flags: Reviewed

Committed to trunk and 23. Thanks, Uma!

> BlockPoolSliceScanner spinning in loop
> --
>
> Key: HDFS-2553
> URL: https://issues.apache.org/jira/browse/HDFS-2553
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.23.0, 0.24.0
>Reporter: Todd Lipcon
>Assignee: Uma Maheswara Rao G
>Priority: Critical
> Fix For: 0.24.0, 0.23.1
>
> Attachments: CPUUsage.jpg, HDFS-2553.patch, HDFS-2553.patch
>
>
> Playing with trunk, I managed to get a DataNode in a situation where the 
> BlockPoolSliceScanner is spinning in the following loop, using 100% CPU:
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.isAlive(DataNode.java:820)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.isBPServiceAlive(DataNode.java:2962)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:625)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:614)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:95)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2684) Fix up some failing unit tests on HA branch

2011-12-16 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2684.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Thanks. I ran all the modified tests again to double-check and committed to 
branch.

> Fix up some failing unit tests on HA branch
> ---
>
> Key: HDFS-2684
> URL: https://issues.apache.org/jira/browse/HDFS-2684
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2684.txt, hdfs-2684.txt, hdfs-2684.txt
>
>
> To keep moving quickly on the HA branch, we've committed some stuff even 
> though some unit tests are failing. This JIRA is to take a pass through the 
> failing unit tests and get back to green (or close to it). If anything turns 
> out to be a major amount of work I'll file separate JIRAs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2667) HA: Fix NN Active->Standby transition

2011-12-15 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2667.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to the branch, thanks Aaron.

> HA: Fix NN Active->Standby transition
> -
>
> Key: HDFS-2667
> URL: https://issues.apache.org/jira/browse/HDFS-2667
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2667.txt, hdfs-2667.txt, hdfs-2667.txt, 
> hdfs-2667.txt
>
>
> Currently when the active transitions to standby (eg for a manual failover) 
> the EditLogTailer starts tailing edits at the wrong transaction ID. This 
> causes it to double-apply a bunch of edits, which causes various problems. We 
> need to add a test which checks this state transition and fix any related 
> bugs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2683) Authority-based lookup of proxy provider fails if path becomes canonicalized

2011-12-14 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2683.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Thanks, committed to branch

> Authority-based lookup of proxy provider fails if path becomes canonicalized
> 
>
> Key: HDFS-2683
> URL: https://issues.apache.org/jira/browse/HDFS-2683
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, hdfs client
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2683.txt, hdfs-2683.txt
>
>
> When testing MapReduce on top of an HA cluster we ran into the following bug: 
> some uses of HDFS paths go through a canonicalization step which ensures that 
> the authority component in the URI includes a port number. So our 
> hdfs://logical-nn-uri/foo path turned into hdfs://logical-nn-uri:8020/foo. 
> The code which looks up the failover proxy provider then failed to find the 
> associated config. We should only compare the hostname portion of the URI 
> when looking up proxy providers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2680) DFSClient should construct failover proxy with exponential backoff

2011-12-14 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2680.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Thanks, committed to HA branch.

> DFSClient should construct failover proxy with exponential backoff
> --
>
> Key: HDFS-2680
> URL: https://issues.apache.org/jira/browse/HDFS-2680
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, hdfs client
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2680.txt
>
>
> HADOOP-7896 adds facilities in common for exponential backoff when failing 
> back and forth between NNs. We need to use the new capability from DFSClient 
> when we construct the proxy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2671) HA: NN should throw StandbyException in response to RPCs in STANDBY state

2011-12-14 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2671.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

thanks, committed to branch

> HA: NN should throw StandbyException in response to RPCs in STANDBY state
> -
>
> Key: HDFS-2671
> URL: https://issues.apache.org/jira/browse/HDFS-2671
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2671.txt
>
>
> Currently the NN is throwing UnsupportedActionException when it is hit with 
> RPCs while in Standby. This is what the StandbyException class is meant for. 
> The wrong type is preventing client failover from working as designed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2675) Reduce verbosity when double-closing edit logs

2011-12-14 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2675.
---

  Resolution: Fixed
   Fix Version/s: 0.23.1
  0.24.0
Target Version/s: 0.24.0, 0.23.1  (was: 0.23.1, 0.24.0)
Hadoop Flags: Reviewed

Committed to 23 and trunk, thx

> Reduce verbosity when double-closing edit logs
> --
>
> Key: HDFS-2675
> URL: https://issues.apache.org/jira/browse/HDFS-2675
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Trivial
> Fix For: 0.24.0, 0.23.1
>
> Attachments: hdfs-2675.txt
>
>
> Currently the edit logs log at WARN level when they're double-closed. But 
> this happens in the normal flow of things, so we may as well reduce it to 
> DEBUG to reduce log spam in unit tests, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2634) Standby needs to ingest latest edit logs before transitioning to active`

2011-12-08 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2634.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

committed latest patch, thx for the reviews.

> Standby needs to ingest latest edit logs before transitioning to active`
> 
>
> Key: HDFS-2634
> URL: https://issues.apache.org/jira/browse/HDFS-2634
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2634.txt, hdfs-2634.txt
>
>
> When the standby transitions to active state, it needs to _read_ the latest 
> edit logs before it reopens them for write access. Currently, the transition 
> calls {{stopStandbyServices}}, which stops the tailer, but doesn't read ahead 
> to the very end. This ends up leaving the shared edits dir in an inconsistent 
> state where we have overlapping transaction IDs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2627) HA: determine DN's view of which NN is active based on heartbeat responses

2011-12-07 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2627.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to the branch, thanks for the review.

> HA: determine DN's view of which NN is active based on heartbeat responses
> --
>
> Key: HDFS-2627
> URL: https://issues.apache.org/jira/browse/HDFS-2627
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2627-v1.txt, hdfs-2627-v2.txt
>
>
> This is the first part of the design described in [this 
> comment|https://issues.apache.org/jira/browse/HDFS-1972?focusedCommentId=13160601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13160601]
>  in HDFS-1972. When the DNs start, they should not consider either of the NNs 
> in a block pool to be the active one. Rather, the NNs should include their HA 
> state as part of the heartbeat response to the DN, and the DN will believe 
> whichever NN claims to be active at a higher transaction ID.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2625) HA: TestDfsOverAvroRpc failing after introduction of HeartbeatResponse type

2011-12-04 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2625.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thx for the review.

> HA: TestDfsOverAvroRpc failing after introduction of HeartbeatResponse type
> ---
>
> Key: HDFS-2625
> URL: https://issues.apache.org/jira/browse/HDFS-2625
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2625.txt
>
>
> TestDfsOverAvroRpc is failing after HDFS-2616. The issue seems to be that we 
> sometimes fill in the "commands" list in HeartbeatResponse with null to 
> indicate no commands. This makes avro barf with a cryptic message about 
> unions of nulls of unions of arrays of nulls or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2624) HA: ConfiguredFailoverProxyProvider doesn't correctly stop ProtocolTranslators

2011-12-04 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2624.
---

  Resolution: Fixed
Hadoop Flags: Reviewed

Committed to branch, thx for reviews.

> HA: ConfiguredFailoverProxyProvider doesn't correctly stop ProtocolTranslators
> --
>
> Key: HDFS-2624
> URL: https://issues.apache.org/jira/browse/HDFS-2624
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2624.txt
>
>
> In the tip of branch, the direct proxies to the NN have been replaced by 
> ProtocolTranslator implementations. But, the proxy provider still calls 
> RPC.stopProxy, which generates a warning and doesn't actually stop the proxy. 
> We need to check if the protocol implementation is a Closeable and call 
> close() instead in that case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2626) HA: BPOfferService.verifyAndSetNamespaceInfo needs to be synchronized

2011-12-04 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2626.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks for the review.

> HA: BPOfferService.verifyAndSetNamespaceInfo needs to be synchronized
> -
>
> Key: HDFS-2626
> URL: https://issues.apache.org/jira/browse/HDFS-2626
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2626.txt
>
>
> When starting an HA blockpool with both namenodes up, I often see an NPE, 
> referenced by one of the TODOs. The issue is the following interleaving:
> - first BPActor registers, and sets bpNSInfo in BPOfferService. It then 
> proceeds to initFsDataset which takes a little bit of time
> - second BPActor registers, and sees bpNSInfo is non-null, then proceeds to 
> heartbeat loop. Meanwhile BPActor 1 is still initting FSDataset
> - second BPActor gets an NPE on first heartbeat since fsdataset is still null.
> We just need to synchronize that function to fix the NPE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2616) Change DatanodeProtocol#sendHeartbeat to return HeartbeatResponse

2011-12-01 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2616.
---

Resolution: Fixed

Committed hdfs-2616-addendum.txt as r1209315.

> Change DatanodeProtocol#sendHeartbeat to return HeartbeatResponse
> -
>
> Key: HDFS-2616
> URL: https://issues.apache.org/jira/browse/HDFS-2616
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Suresh Srinivas
>Assignee: Suresh Srinivas
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-2616.txt, HDFS-2616.txt, HDFS-2616.txt, 
> HDFS-2616.txt, HDFS-2616.txt, HDFS-2616.txt, hdfs-2616-addendum.txt
>
>
> DatanodeProtocol#sendHeartbeat() returns DatanodeCommand[]. This jira 
> proposes changing it to to return HeartbeatResponse that has 
> DatanodeCommand[]. This allows adding other information that can be returned 
> by the namenode to the datanode, instead of having to only return 
> DatanodeCommand[]. For relevant discussion see HDFS-1972. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2623) HA: Add test case for hot standby capability

2011-12-01 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2623.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed the failing test case so we can all share it while working towards 
getting the hot-standby in working shape.

> HA: Add test case for hot standby capability
> 
>
> Key: HDFS-2623
> URL: https://issues.apache.org/jira/browse/HDFS-2623
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2623.txt
>
>
> Putting up a fairly simple test case I wrote that verifies that the standby 
> is kept "hot"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2612) HA: handle refreshNameNodes in federated HA clusters

2011-12-01 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2612.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to branch, thanks for the review.

> HA: handle refreshNameNodes in federated HA clusters
> 
>
> Key: HDFS-2612
> URL: https://issues.apache.org/jira/browse/HDFS-2612
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2612.txt, hdfs-2612.txt
>
>
> For expediency in HDFS-1971 we've commented out the {{refreshNameNodes}} 
> function temporarily on branch. Need to fix that code to handle refresh with 
> HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2622) HA: fix TestDFSUpgrade on HA branch

2011-12-01 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2622.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed, thanks Eli.

> HA: fix TestDFSUpgrade on HA branch
> ---
>
> Key: HDFS-2622
> URL: https://issues.apache.org/jira/browse/HDFS-2622
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2622.txt
>
>
> Since HDFS-1975, we now increment the generation stamp for each block 
> allocation. So the EXPECTED_TXID constant is wrong now in TestDFSUpgrade, 
> since we do twice as many txns while creating files in the first phase of the 
> test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-1975) HA: Support for sharing the namenode state from active to standby.

2011-11-30 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1975.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed to HA branch. Thanks Jitendra and Aaron for the original revs of the 
patch, and Eli for reviewing.

> HA: Support for sharing the namenode state from active to standby.
> --
>
> Key: HDFS-1975
> URL: https://issues.apache.org/jira/browse/HDFS-1975
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Reporter: Suresh Srinivas
>Assignee: Jitendra Nath Pandey
> Fix For: HA branch (HDFS-1623)
>
> Attachments: HDFS-1975-HA.2.patch, HDFS-1975-HA.patch, 
> HDFS-1975-HDFS-1623.patch, HDFS-1975-HDFS-1623.patch, 
> HDFS-1975-HDFS-1623.patch, HDFS-1975-HDFS-1623.patch, hdfs-1975.txt, 
> hdfs-1975.txt
>
>
> To enable hot standby namenode, the standby node must have current 
> information for - namenode state (image + edits) and block location 
> information. This jira addresses keeping the namenode state current in the 
> standby node. To do this, the proposed solution in this jira is to use a 
> shared storage to store the namenode state. 
> Note one could also build an alternative solution by augmenting the backup 
> node. A seperate jira could explore this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2591) HA: MiniDFSCluster support to mix and match federation with HA

2011-11-29 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2591.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

Committed v3 to HA branch

> HA: MiniDFSCluster support to mix and match federation with HA
> --
>
> Key: HDFS-2591
> URL: https://issues.apache.org/jira/browse/HDFS-2591
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, test
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2591-v2.txt, hdfs-2591-v3.txt, hdfs-2591.txt
>
>
> Right now the MiniDFS builder is somewhat inflexible - it just takes a 
> "numNameNodes" parameter, which is used to specify federated nameservices. In 
> order to add HA support, we need to be able to be more specific when 
> configuring the NNs -- eg to test the case where there is one nameservice 
> that is HA and another which is not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2582) Scope dfs.ha.namenodes config by nameservice

2011-11-28 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2582.
---

  Resolution: Fixed
Hadoop Flags: Reviewed

Committed to HDFS-1623 branch. Thanks for the reviews, Eli. If there are 
further review comments I'll be happy to address post-commit.

> Scope dfs.ha.namenodes config by nameservice
> 
>
> Key: HDFS-2582
> URL: https://issues.apache.org/jira/browse/HDFS-2582
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: data-node, name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2582-v2.txt, hdfs-2582-v3.txt, hdfs-2582-v4.txt, 
> hdfs-2582.txt
>
>
> HDFS-2231 started the process of adding configuration for HA, but one piece 
> is missing. The current state of the configuration is, I believe:
> {{dfs.ha.namenodes}} - a list of identifiers for HA namenodes
> {{dfs.federation.nameservices}} - a list of federated nameservices
> {{dfs.namenode.rpc-address[.nameservice-id][.namenode-id]}} - some specific 
> config for the given namenode. If HA or federation is disabled, the extra 
> components can be elided for backwards compatibility.
> The issue here is that there is no easy way to discern which NN is paired 
> with which other NN. Additionally, adding a new federated nameservice to a 
> config will require changes to {{dfs.ha.namenodes}} which makes templating 
> harder. It would be simpler to change {{dfs.ha.namenodes}} to be 
> nameservice-scoped: {{dfs.ha.namenodes.}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2577) HA: NN fails to start since it tries to start secret manager in safemode

2011-11-23 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2577.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

> HA: NN fails to start since it tries to start secret manager in safemode
> 
>
> Key: HDFS-2577
> URL: https://issues.apache.org/jira/browse/HDFS-2577
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: name-node
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2577.txt
>
>
> After HDFS-2301, the NN fails to start with the following:
> Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot 
> log master key update in safe mode. Name node is in safe mode.
> The reported blocks 0 needs additional 5 blocks to reach the threshold 1. 
> of total blocks 4. Safe mode will be turned off automatically.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.logUpdateMasterKey(FSNamesystem.java:4259)
> at 
> org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.logUpdateMasterKey(DelegationTokenSecretManager.java:285)
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.updateCurrentKey(AbstractDelegationTokenSecretManager.java:143)
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.startThreads(AbstractDelegationTokenSecretManager.java:98)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startSecretManager(FSNamesystem.java:386)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:440)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:937)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:57)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2080) Speed up DFS read path by lessening checksum overhead

2011-11-03 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2080.
---

   Resolution: Fixed
Fix Version/s: 0.23.1
 Hadoop Flags: Reviewed

All of the subtasks have now been completed and committed for 0.23.1 and 0.24. 
Thanks to those that helped, especially Nathan, Kihwal, Eli, and Nicholas for 
the many reviews.

> Speed up DFS read path by lessening checksum overhead
> -
>
> Key: HDFS-2080
> URL: https://issues.apache.org/jira/browse/HDFS-2080
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs client, performance
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Fix For: 0.24.0, 0.23.1
>
> Attachments: hdfs-2080.txt, hdfs-2080.txt
>
>
> I've developed a series of patches that speeds up the HDFS read path by a 
> factor of about 2.5x (~300M/sec to ~800M/sec for localhost reading from 
> buffer cache) and also will make it easier to allow for advanced users (eg 
> hbase) to skip a buffer copy. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2379) 0.20: Allow block reports to proceed without holding FSDataset lock

2011-11-01 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2379.
---

   Resolution: Fixed
Fix Version/s: 0.20.206.0
 Hadoop Flags: Reviewed

Committed to 0.20-security. Thanks for the reviews, Suresh.

> 0.20: Allow block reports to proceed without holding FSDataset lock
> ---
>
> Key: HDFS-2379
> URL: https://issues.apache.org/jira/browse/HDFS-2379
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.206.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20.206.0
>
> Attachments: hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt, 
> hdfs-2379.txt, hdfs-2379.txt, hdfs-2379.txt
>
>
> As disks are getting larger and more plentiful, we're seeing DNs with 
> multiple millions of blocks on a single machine. When page cache space is 
> tight, block reports can take multiple minutes to generate. Currently, during 
> the scanning of the data directories to generate a report, the FSVolumeSet 
> lock is held. This causes writes and reads to block, timeout, etc, causing 
> big problems especially for clients like HBase.
> This JIRA is to explore some of the ideas originally discussed in HADOOP-4584 
> for the 0.20.20x series.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-2523) NameNode needs to add HAServiceProtocol to its RPC Server

2011-10-31 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-2523.
---

   Resolution: Fixed
Fix Version/s: HA branch (HDFS-1623)
 Hadoop Flags: Reviewed

committed to branch, thanks for quick review atm.

> NameNode needs to add HAServiceProtocol to its RPC Server
> -
>
> Key: HDFS-2523
> URL: https://issues.apache.org/jira/browse/HDFS-2523
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Affects Versions: HA branch (HDFS-1623)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Trivial
> Fix For: HA branch (HDFS-1623)
>
> Attachments: hdfs-2523.txt
>
>
> When the new RPC protocol stuff was merged to 0.23, we didn't add 
> HAServiceProtocol to NameNodeRpcServer. Trivial JIRA to fix this. Also fixing 
> a small bug where, if startup failed, HA code would trigger an NPE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-1266) Missing license headers in branch-20-append

2011-10-20 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1266.
---

Resolution: Invalid

branch-0.20-append is abandoned. license headers in 0.20-security should be OK.

> Missing license headers in branch-20-append
> ---
>
> Key: HDFS-1266
> URL: https://issues.apache.org/jira/browse/HDFS-1266
> Project: Hadoop HDFS
>  Issue Type: Task
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Trivial
> Fix For: 0.20-append
>
>
> We appear to have some files without license headers, we should do a quick 
> pass through and fix them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-1896) Additional QA tasks for Edit Log branch

2011-10-03 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1896.
---

Resolution: Not A Problem

The items needing manual testing have been tested, and the branch is long since 
merged. Marking resolved.

> Additional QA tasks for Edit Log branch
> ---
>
> Key: HDFS-1896
> URL: https://issues.apache.org/jira/browse/HDFS-1896
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: Edit log branch (HDFS-1073)
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>
> As we close out tasks in the HDFS-1073 branch, there are a few places where 
> I've noticed that we lack some test coverage. Creating this ticket just as a 
> place to jot down some notes on things that we ought to make sure are tested, 
> preferably by automated (unit) tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HDFS-1218) 20 append: Blocks recovered on startup should be treated with lower priority during block synchronization

2011-10-03 Thread Todd Lipcon (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved HDFS-1218.
---

   Resolution: Fixed
Fix Version/s: (was: 0.20-append)
 Hadoop Flags: Reviewed

Suresh committed this to 0.20.205

> 20 append: Blocks recovered on startup should be treated with lower priority 
> during block synchronization
> -
>
> Key: HDFS-1218
> URL: https://issues.apache.org/jira/browse/HDFS-1218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20-append
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 0.20.205.0
>
> Attachments: HDFS-1218.20s.2.patch, hdfs-1281.txt
>
>
> When a datanode experiences power loss, it can come back up with truncated 
> replicas (due to local FS journal replay). Those replicas should not be 
> allowed to truncate the block during block synchronization if there are other 
> replicas from DNs that have _not_ restarted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >