subject:"\[jira\] \[Commented\] \(HBASE\-5270\) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler"

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219349#comment-13219349
]

stack commented on HBASE-5270:
--

@Prakash

Presume we pass list from splitLogAfterStartup to joinClusters as you suggest
and presume list of servers included the server that had been hosting .META.

Allow that during or just after splitLogAfterStartup, .META. server 'crashes'
-- it becomes unresponsive. Also allow that somehow, just before it hung up,
during a long running log split, .META. took on a couple of edits saying
regions A, B, and C had split.

In assignRootAndMeta, we'll notice the unresponsiveness, force the expiration
of the server that was carrying .META. (this will queue a ServerShutdownHandler
but will not wait on its completion), and we'll then reassign of .META. Its
very likely that .META. will go to one of the other 'good' servers. Its also
likely that the SSH will not have completed its processing before this assign
happens. Thus, on deploy, the .META. will be missing the above A, B, and C
split edits.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219354#comment-13219354
]

stack commented on HBASE-5270:
--

@Chunhui You have a new method splitLogIfOnline which will split the log if the
server was online. Why do you not expire the server? (You remove the
expireIfOnline method).

Now we have this initializing state, do you think we should also stop the
processing of expired servers during this startup phase and instead queue them
up for processing after the master is up? Could do that in another issue maybe
since this issue has been going on too long and your patch is at least an
improvement on what we currently have (This startup sequence needs a big
refactor IMO -- it is way too complicated figuring the sequence in which stuff
runs).

Are there still holes? For example, say the .META. server crashes AFTER we've
verified it up in assignRootAndMeta but before we get to do a scan of .META. to
rebuild user regions list. Could .META. be assigned w/o log splitting
finishing? (I don't think so... .META. would be offline until the
servershutdown handler ran and it would first split logs).

Good stuff.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-29 Thread Prakash Khemani (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219413#comment-13219413
]

Prakash Khemani commented on HBASE-5270:

@Stack

If we presume that the list of servers that joinClusters received contains
the server hosting .META., then the next step, that you outlined in your
scenario, cannot be allowed. If we are splitting logs for .META. then we
have determined that meta-server was not running and therefore it cannot
be taking edits. The problem you are outlining is probably still there but
the scenario has to be refined.

Anyway my point was - at startup master should determine once what servers
are up and what are not. This should include whether ROOT and META are
assigned or not. And then it should initialize everything based on that
knowledge which must not change during initialization. Anything that
changes during initialization should be taken care of by the normal
Server-handlers. But I have to admit, I don't understand the assignment
complexities very well Š I will read up some more.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-29 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219428#comment-13219428
]

stack commented on HBASE-5270:
--

@Prakash

bq. then the next step, that you outlined in your scenario, cannot be
allowed

How should we do this boss?

bq. The problem you are outlining is probably still there but the scenario has
to be refined.

What should I add? If we allow that the split could take a long time, its
possible that on entry to the log splitting the server was good but by the end
it could have gone AWOL.

bq. And then it should initialize everything based on that knowledge which must
not change during initialization.

I think the root issue is that it needs to scan .META. and -ROOT- as part of
the startup; they need to be assigned and up w/ all edits in place. Thats
whats proving to be a little tough to ensure.

(Thanks for the review P).

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-29 Thread chunhui shen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219727#comment-13219727
]

chunhui shen commented on HBASE-5270:
-

@stack
I think we could introduce initializing state first for HBASE-5454.

bq.Are there still holes? For example, say the .META. server crashes AFTER
we've verified it up in assignRootAndMeta but before we get to do a scan of
.META. to rebuild user regions list. Could .META. be assigned w/o log splitting
finishing?

Yes.it's another a hole, but it's easy to solve， we could stop the processing
of expired servers until master finished assign ROOT and META not initialized.

I have thought this issue for a long time, and I think preventing processing of
SSH is a clear and simple solution, otherwise we should consider many cases
where meta server died in different time during master initializing.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219787#comment-13219787
]

ramkrishna.s.vasudevan commented on HBASE-5270:
---

bq.I think preventing processing of SSH is a clear and simple solution

I too think it is a good and simple idea.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219858#comment-13219858
]

stack commented on HBASE-5270:
--

bq. I have thought this issue for a long time, and I think preventing
processing of SSH is a clear and simple solution, otherwise we should consider
many cases where meta server died in different time during master initializing.

Would we do the above as part of another issue Chunhui?

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219861#comment-13219861
]

stack commented on HBASE-5270:
--

Also, do you need to make a new version of this patch now hbase-5454 has gone
in? Thanks.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-28 Thread Prakash Khemani (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218407#comment-13218407
]

Prakash Khemani commented on HBASE-5270:

Assuming that the master uses the saved region-server list in joinCluster,
can you then please outline the scenario where problems can still happen?
There is some handling of META and ROOT not being available in
ServerShutdownHandler and I am wondering why that is not sufficient.

On 2/27/12 11:17 PM, chunhui shen (Commented) (JIRA) j...@apache.org

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-28 Thread chunhui shen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218820#comment-13218820
]

chunhui shen commented on HBASE-5270:
-

@Prakash
In a live cluster, do the following step
1.kill the master;
1.start the master, and master is initializing；
3.master complete splitLog
4.kill the META server
5.master start assigning ROOT and META
6.Now meta region data will loss， because we can't ensure META server's log is
split caused by step 4.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-27 Thread stack (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217347#comment-13217347
]

stack commented on HBASE-5270:
--

Do you think we should check to see if we have already split this server's log
for the case where the server was carrying root and meta?

{code}
+ splitLogIfOnline(currentMetaServer);
{code}

Or will the above call become a noop because we just split it before we
assignedroot?

Is this a 'safe mode' or is it the master 'initializing'? I think 'safe mode'
makes folks think of hdfs. It is a little similar in that master is trying to
make sense of the cluster but initializing might be a better name for this
state.

BTW, I think this is an improvement over previous versions of this patch. Its
easier to reason about. Good stuff Chunhui.

Make a method and put this duplicated code into it and call it from the two
places its repeated:

{code}
+if (!deadNotExpiredServers.isEmpty()) {
+ for (final ServerName server : deadNotExpiredServers) {
+LOG.debug(Removing dead but not expired server: + server
++ from eligible server pool.);
+servers.remove(server);
+ }
+}
{code}

Fix this bit of javadoc '... but not are expired now.'

You don't need this:

{code}
+ * Copyright 2007 The Apache Software Foundation
{code}

I think MasterInSafeModeException becomes MasterInitializingException?

Good stuff Chunhui

Regards Jimmy's comment:

bq. Instead of introducing safe mode, can we add something to the RPC server
and don't allow it to sever traffic before the actual server is ready, for
example, fully initialized?

We have a ServerNotRunningYetException down in the ipc. Its thrown by
HBaseServer if RPC has not started yet. It seems a little related to this
MasterInitializing. We also have a PleaseHoldException. Perhaps the Master
should throw this instead of the MasterInitializing? We'd throw a
PleaseHoldException and the message would be detail that the master is
initializing?

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt

--
This message is automatically generated by JIRA.
If you think it was

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-27 Thread Jimmy Xiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217364#comment-13217364
]

Jimmy Xiang commented on HBASE-5270:

@Stack, I agree. I think we should reuse the existing exception if we can.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217377#comment-13217377
]

Zhihong Yu commented on HBASE-5270:
---

If we reuse PleaseHoldException, the javadoc for that exception should be
modified:
{code}
* This exception is thrown by the master when a region server was shut down
* and restarted so fast that the master still hasn't processed the server
* shutdown of the first instance.
{code}

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-27 Thread stack (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217732#comment-13217732
]

stack commented on HBASE-5270:
--

@Ted Yes. We can keep the prefix and change the rest of the sentence to be
more generic. If Chunhui reuses it here, it'll be an exception the master
throws when they want the client to come back in a while.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217863#comment-13217863
]

chunhui shen commented on HBASE-5270:
-

bq.Do you think we should check to see if we have already split this server's
log for the case where the server was carrying root and meta?
I think it's not a problem, server's log dir will be deleted after split, so
the second split will do nothing. Of course, we could do the check to prevent
unnecessary call splitLogIfOnline.

Thanks stack for the review, I will make a new patch as the above advice.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217874#comment-13217874
]

chunhui shen commented on HBASE-5270:
-

In v7 patch, use PleaseHoldException.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217876#comment-13217876
]

Zhihong Yu commented on HBASE-5270:
---

@Chunhui:
Can you upload patch v7 onto review board ?

A test run through Hadoop QA would be helpful as well.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217888#comment-13217888
]

Zhihong Yu commented on HBASE-5270:
---

{code}
+ * of the first instance, or when master is initializing and client call
admin's
+ * operations
{code}
should read:
{code}
+ * of the first instance, or when master is initializing and client calls admin
+ * operations
{code}
Please fill javadoc for the following method:
{code}
+ public RegionServerTracker createRegionServerTracker(final ZooKeeperWatcher
zkw,
{code}
Is the above used for testing ? I don't see it called in other classes.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217891#comment-13217891
]

Zhihong Yu commented on HBASE-5270:
---

Please remove space after the dot:
{code}
+mfs. splitLogAfterStartup(sm.getOnlineServers().keySet());
{code}
I see the following code in several methods:
{code}
+if (isInitializing()) {
+ throw new PleaseHoldException(Master is initializing);
+}
{code}
Does creating a new method wrapping the above code make sense ?

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217915#comment-13217915
]

chunhui shen commented on HBASE-5270:
-

@Ted
I has modified as the above in patchv8.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-27 Thread Prakash Khemani (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217968#comment-13217968
]

Prakash Khemani commented on HBASE-5270:

(I haven't read through the comments carefully and I am sorry for the noise if
I am way off the mark)

The problem as I see it is that the Master's understanding of which region
servers are online changes from the time that it calls splitLogAfterStartup()
to the time it calls rebuildUserRegions() in joinCluster().

I feel that it might be lot simpler if master saves the list of region-servers
that it had given to splitLogAfterStartup(), and later uses the same list for
rebuilding user regions. That should fix this issue, won't it?

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217973#comment-13217973
]

chunhui shen commented on HBASE-5270:
-

bq.I feel that it might be lot simpler if master saves the list of
region-servers that it had given to splitLogAfterStartup(), and later uses the
same list for rebuilding user regions. That should fix this issue, won't it?

Yes, it's right in most case. But if the server died who carrying ROOT or META
during master initializing, it's another problem. So we should fix these two
case.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch,
hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-26 Thread chunhui shen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216989#comment-13216989
]

chunhui shen commented on HBASE-5270:
-

bq. don't allow it to sever traffic before the actual server is ready.
I think it's inconvenient. For example, before fully initialized, we need to
allow RegionserverReport but don't allow admin's operation.Also, Server death
is found through ZK not RPC.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-26 Thread chunhui shen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216994#comment-13216994
]

chunhui shen commented on HBASE-5270:
-

@stack
Could you take a look about introducing safemode to delay SSH after master is
initialized.
I think this solution is more easier for the issue.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-24 Thread Jimmy Xiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216114#comment-13216114
]

Jimmy Xiang commented on HBASE-5270:

Instead of introducing safe mode, can we add something to the RPC server and
don't allow it to sever traffic before the actual server is ready, for example,
fully initialized?

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214475#comment-13214475
]

chunhui shen commented on HBASE-5270:
-

@Ted
I submit patch v5.

bq. So a server could be in both deadNotExpiredServers and deadservers ? I
don't see return statement in the if block.
I'm sorry I make a mistake to miss return statement in the if block.

Also we check that we're not in safe mode in expireDelayedServers().

And master is in safe mode only when it is initializing now.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214476#comment-13214476
]

chunhui shen commented on HBASE-5270:
-

I can't add review request, it throws error:The file
'https://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java'
(r1292711) could not be found in the repository
why?

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-23 Thread Zhihong Yu (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214793#comment-13214793
]

Zhihong Yu commented on HBASE-5270:
---

I was able to create new request.
Select hbase for Repository.
Enter '/' for Base Directory.

Leave Bugs field blank.
Enter hbase to Groups field.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215298#comment-13215298
]

chunhui shen commented on HBASE-5270:
-

@Ted
I has created the Review Request:
https://reviews.apache.org/r/4021/

Thank you.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-23 Thread Zhihong Yu (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215355#comment-13215355
 ] 

Zhihong Yu commented on HBASE-5270:
---

There seems to be some compilation error:
{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) 
on project hbase: Compilation failure: Compilation failure:
[ERROR] 
/home/hduser/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:[55,27]
 package org.apache.mina.util does not exist
...
[ERROR] 
/home/hduser/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:[107,54]
 cannot find symbol
[ERROR] symbol  : class ConcurrentHashSet
[ERROR] location: class org.apache.hadoop.hbase.master.ServerManager
{code}

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.1, 0.94.0

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, 
 hbase-5270v4.patch, hbase-5270v5.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215369#comment-13215369
]

chunhui shen commented on HBASE-5270:
-

I build project again and didn't find any compilation error.

bq.[ERROR]
/home/hduser/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:[55,27]
package org.apache.mina.util does not exist
why package org.apache.mina.util ? Is there any mistake?

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-23 Thread Zhihong Yu (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215370#comment-13215370
]

Zhihong Yu commented on HBASE-5270:
---

The compilation error was caused by the following in
src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:
{code}
+import org.apache.mina.util.ConcurrentHashSet;
{code}

See
http://www.onkarjoshi.com/blog/201/concurrenthashset-in-java-from-concurrenthashmap/
for how to get a ConcurrentHashSet.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch,
5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch,
hbase-5270v4.patch, hbase-5270v5.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-22 Thread chunhui shen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214394#comment-13214394
]

chunhui shen commented on HBASE-5270:
-

@Stack @Ted
In hbase-5270v4.patch,
I introduce safe mode for master when it is stopping or initializing.
In the safe mode, master will delay processing ServerShutdownHandler and refuse
many admin operations(could see HBASE-5454).
Through safe mode, we could ensure data security and fix this issue much easier.
Could you review it again.
Thanks.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.92.1, 0.94.0

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-22 Thread Zhihong Yu (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214396#comment-13214396
]

Zhihong Yu commented on HBASE-5270:
---

@Chunhui:
Since review board isn't used, do you mind highlighting the new changes in
hbase-5270v4.patch ?

Thanks

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.92.1, 0.94.0

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-22 Thread Zhihong Yu (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214398#comment-13214398
]

Zhihong Yu commented on HBASE-5270:
---

@Chunhui:
Have you verified patch v4 in real, decent sized cluster ?
My concern is that the safe mode would make cluster startup longer, especially
after a critical issue caused cluster shutdown.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-22 Thread chunhui shen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214405#comment-13214405
]

chunhui shen commented on HBASE-5270:
-

I define that master is in safe mode if it is stopping or initializing.

If a region server die during master's safe mode, ServerManager will add
ServerName to a set(SetServerName deadNotExpiredServers),but not expire it
until master is initialized。

So if it is a server which carry META or ROOT, we will split its log when
assigning RootAndMeta.

Also, when assigning regions , we will remove this dead server from
destinations.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-22 Thread chunhui shen (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214410#comment-13214410
]

chunhui shen commented on HBASE-5270:
-

@Ted
bq. safe mode would make cluster startup longer, especially after a critical
issue caused cluster shutdown.

I think it just make some admin operations unavailable during safe mode, but
not affect data read and write service.
It will make SSH longer, but it's a small probability event, where server died
during master is initializing.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
Fix For: 0.92.1, 0.94.0

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-22 Thread Zhihong Yu (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214418#comment-13214418
 ] 

Zhihong Yu commented on HBASE-5270:
---

{code}
   public synchronized void expireServer(final ServerName serverName) {
+if (services.isSafeMode()) {
+  LOG.info(Master is in safe mode, delay expiring server  + serverName);
+  this.deadNotExpiredServers.add(serverName);
+}
{code}
So a server could be in both deadNotExpiredServers and deadservers ? I don't 
see return statement in the if block.

In expireDelayedServers(), should we check that we're not in safe mode ?

I recommend creating a review on review board. See an example in my first 
comment of this JIRA.

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.1, 0.94.0

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, 
 hbase-5270v4.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-18 Thread Zhihong Yu (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211068#comment-13211068
]

Zhihong Yu commented on HBASE-5270:
---

w.r.t. Chunhui's comment @ 18/Feb/12 02:52
We shoulds correlate the 10s sleep after log splitting with the 20s sleep in
test through some constant. Otherwise the test would easily break.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210772#comment-13210772
]

chunhui shen commented on HBASE-5270:
-

{code}
+// We set serverLoad with one region, it could differentiate with
+// regionserver which is started just now
+HServerLoad serverLoad = new HServerLoad();
+serverLoad.setNumberOfRegions(1);
How you know it has a region?
{code}
We do this to mark the RS running ago, not the regionserver which is started
just now.
(If it is a regionserver started just now, it has no regions, so when master
assignRootAndMeta,we needn't expire it.(Only 90 version need do this, because
rootLocation doesn't contain startcode, so we can't be sure it is a rootServer
according to HServerAddress))

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210776#comment-13210776
]

chunhui shen commented on HBASE-5270:
-

{code}Can you just do + super.nodeDeleted(path); instead of +
GatedNodeDeleteRegionServerTracker.super.nodeDeleted(path);?
{code}
If we block the nodeDeleted(path) in GatedNodeDeleteRegionServerTracker, it
will block all the ZK event.
so I just want to delay the event of RS node deleted through a thread. However,
in the thread#run(), we need call
GatedNodeDeleteRegionServerTracker.super.nodeDeleted(path);

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210777#comment-13210777
]

chunhui shen commented on HBASE-5270:
-

{code}
Why the need for this timeout:

+Thread.sleep(1 * 2);
+((GatedNodeDeleteRegionServerTracker) master.getRegionServerTracker()).gate
+.set(false);
{code}
Because we sleep 10s after splitLog, we sleep 20s to make sure that master is
assigning RootAndMeta or has assigned. After it we starting process the event
of RS node deleted

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210778#comment-13210778
]

chunhui shen commented on HBASE-5270:
-

Because this issue contains a bug that root will not be assigned and master
will block on waiting for root when initializing
So we set timeout for the testcase.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210783#comment-13210783
]

chunhui shen commented on HBASE-5270:
-

{code}
+ * Dead servers under processing by the ServerShutdownHander.
Whats that mean? Its while the server is being processed by
ServerShutdownHandler exclusively -- these are the inProgress servers?
{code}
Yes,these are the inProgress servers

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210785#comment-13210785
]

chunhui shen commented on HBASE-5270:
-

So, what happens if a server had root and meta and its not expired when we do
failover? We'll expire it processing root. Will we expire it a second time
processing meta? Perhaps the answer is no because the first expiration will
clear the meta state in master?
{code}
if (metaServerLoad != null metaServerLoad.getNumberOfRegions() 0
+ !catalogTracker.getRootLocation().equals(metaServerAddress)) {
+ // If metaServer is online not start just now, we expire it
+ this.serverManager.expireServer(metaServerInfo);
+}
{code}
If a server had root and meta , we will ensure not expire it a second time
through catalogTracker.getRootLocation().equals(metaServerAddress)

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210788#comment-13210788
]

chunhui shen commented on HBASE-5270:
-

For the other suggestion,I will do a modify later.
Thanks for Stack's review!

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210787#comment-13210787
]

chunhui shen commented on HBASE-5270:
-

{code}
+// Remove regions in RIT, they are may being processed by the SSH.
+synchronized (regionsInTransition) {
+ nodes.removeAll(regionsInTransition.keySet());
+}
{code}
Perhaps SSH has put up something in RIT because its done an assign and here we
are blanket removing them all?

Yes, SSH and master'initializing Thread may assign the same regions, so we
should do a prevent of mutli assign.

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

[
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210791#comment-13210791
]

chunhui shen commented on HBASE-5270:
-

{code}So, what happens if a server had root and meta and its not expired when
we do failover? We'll expire it processing root. Will we expire it a second
time processing meta? Perhaps the answer is no because the first expiration
will clear the meta state in master?
{code}
I'm sorry I'm wrong for the upper comment.

if a server had root and meta, it will be expired when processing root,
and we will not expire it a second time processing meta because the following
code (metaServerInfo == null)
{code}+ HServerInfo metaServerInfo = this.serverManager
+ .getHServerInfo(metaServerAddress);
+ if (metaServerInfo != null) {
+HServerLoad metaServerLoad = metaServerInfo.getLoad();
+if (metaServerLoad != null metaServerLoad.getNumberOfRegions() 0
+ !catalogTracker.getRootLocation().equals(metaServerAddress)) {
+ // If metaServer is online not start just now, we expire it
+ this.serverManager.expireServer(metaServerInfo);
+}
+ }
{code}

Handle potential data loss due to concurrent processing of processFaileOver
and ServerShutdownHandler
-

Key: HBASE-5270
URL: https://issues.apache.org/jira/browse/HBASE-5270
Project: HBase
Issue Type: Sub-task
Components: master
Reporter: Zhihong Yu
Fix For: 0.94.0, 0.92.1

Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch,
5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch,
hbase-5270.patch, hbase-5270v2.patch, sampletest.txt

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler