[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails

2014-10-24 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183724#comment-14183724
 ] 

Allen Wittenauer commented on HDFS-7231:


bq. (unable to continue)

This is the part I've also been trying to get more information on as well. :D  
From what I've been told, the first few directories were converted to the new 
format but not all of them.  This seems to imply that the DN process was 
brought down at some point in time.  

bq. This as you know is a recipe for disaster 

I can neither confirm or deny this statement. ;)

bq. Before you go on to 2.4.1, if you do finalize of upgrade what happens?

It appears to work as expected.  So this whole condition appears to trigger 
with non-finalized systems.  

> rollingupgrade needs some guard rails
> -
>
> Key: HDFS-7231
> URL: https://issues.apache.org/jira/browse/HDFS-7231
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Allen Wittenauer
>Priority: Blocker
>
> See first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails

2014-10-24 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183013#comment-14183013
 ] 

Suresh Srinivas commented on HDFS-7231:
---

[~aw], can you please respond to the comments?

> rollingupgrade needs some guard rails
> -
>
> Key: HDFS-7231
> URL: https://issues.apache.org/jira/browse/HDFS-7231
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Allen Wittenauer
>Priority: Blocker
>
> See first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails

2014-10-22 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180312#comment-14180312
 ] 

Suresh Srinivas commented on HDFS-7231:
---

Allen, I just rewrote the steps with additional details to clarify:
# Upgrade 2.0.5 cluster to 2.2
# Do not -finalizeUpgrade
# Install 2.4.1 binaries on the cluster machines. Start the datanodes on 2.4.1.
# Start namenode -upgrade option.
# Namenode start fails because 2.0.5 to 2.2 upgrade is still in progress
# Leave 2.4.1 DNs running
# Install binaries on NN to 2.2
# Start NN on 2.2 with no upgrade related options

So far things are clear. Then you go on to say, the following:
bq. DNs now do a partial roll-forward, rendering them unable to continue
What do you mean by this?

bq. admins manually repair version files on those broken directories
This is as you know is a recipe for disaster.

Let me ask you a question. Before you go on to 2.4.1, if you do finalize of 
upgrade what happens?

> rollingupgrade needs some guard rails
> -
>
> Key: HDFS-7231
> URL: https://issues.apache.org/jira/browse/HDFS-7231
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Allen Wittenauer
>Priority: Blocker
>
> See first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails

2014-10-13 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169634#comment-14169634
 ] 

Allen Wittenauer commented on HDFS-7231:


Verified the same crappy experience exists in the 2.6 branch.  Marking this as 
a blocker since this will be the last release for everyone's precious JDK 1.6 
support.

I'd love to hear some options from the peanut gallery on how to improve this so 
users aren't left with a potential time bomb on their hands.  Alias -upgrade to 
-rollingupgrade?  Bring nn -finalize back?  Auto-finalize?

> rollingupgrade needs some guard rails
> -
>
> Key: HDFS-7231
> URL: https://issues.apache.org/jira/browse/HDFS-7231
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Allen Wittenauer
>Priority: Blocker
>
> See first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails

2014-10-10 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167620#comment-14167620
 ] 

Allen Wittenauer commented on HDFS-7231:


Argh: another point:  with namenode -finalize being taken away, this scenario 
is pretty much unsolvable.

> rollingupgrade needs some guard rails
> -
>
> Key: HDFS-7231
> URL: https://issues.apache.org/jira/browse/HDFS-7231
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>
> See first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails

2014-10-10 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167545#comment-14167545
 ] 

Allen Wittenauer commented on HDFS-7231:


One other thing about rolling while there is a finalize dir there:  it's worth 
pointing out that -rollback becomes a "please shoot me now" command. At least, 
I don't want to think about the consequences

> rollingupgrade needs some guard rails
> -
>
> Key: HDFS-7231
> URL: https://issues.apache.org/jira/browse/HDFS-7231
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>
> See first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7231) rollingupgrade needs some guard rails

2014-10-10 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167535#comment-14167535
 ] 

Allen Wittenauer commented on HDFS-7231:


We had an admin perform an upgrade that went south due to rolling upgrade 
interfering with the previous method of upgrading.  The series of events as 
given to me (I was out of town, so didn't witness firsthand) was:

# Build a 2.0.5/2.1.0/etc cluster where rolling upgrade is not an option.
# Upgrade it to 2.2
# Do not \-finalizeUpgrade
# Upgrade binaries to 2.4.1 
# Run namenode \-upgrade
# watch it fail.
# Leave 2.4.1 DNs running
# Downgrade binaries on NN to 2.2
# Start NN
# DNs now do a partial roll-forward, rendering them unable to continue
# admins manually repair version files on those broken directories

...

There were clearly a few mistakes made in the above procedure, most of which 
were driven by a belief that the NN *had* to be out of safemode to do a 
finalize.  So they attempted to do that, which of course led to other things 
going wrong. I'm not sure what triggered the DNs to basically render their 
VERSION files broken.  I haven't been able to duplicate it, but I've only tried 
on a much smaller scale so that might be related.  I also suspect there was an 
attempt to rollback the binaries on the DNs and I haven't tried that yet 
either. 

My own testing of this scenario has given me a few insights.

* DNs should not start rolling while there is a directory there to finalize.  

If you do what appears (to me, at least) to get the system back up the proper 
thing here: (bring down the namenode, namenode \-finalize, bring up namenode), 
hdfs dfsadmin \-finalizeUpgrade afterward doesn't appear to send the message to 
the DNs to clean up their space, requiring manual intervention. 

* Suprise! DNs exit if the 'proper' NN is brought up with \-upgrade.

Doing the 2.2 NN \-finalize and then bringing 2.4.1 NN up with \-upgrade, 
results in the 2.4.1 DNs all coming down.  This was a bit of a surprise given 
they were perfectly happy staying up with a broken 2.2 NN in \-upgrade mode 
before.

I'm sure there are other things here, but these are the two big ones that stuck 
out. I'm doing some other manual testing using the above procedures with a few 
other changes to see what else sticks out.

> rollingupgrade needs some guard rails
> -
>
> Key: HDFS-7231
> URL: https://issues.apache.org/jira/browse/HDFS-7231
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Allen Wittenauer
>
> See first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)