[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-20 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
@ShawnWalker can you close this PR? Its in 1.8 in commit 
3e5524c3c391d2556492d070710a789510be3532.  Locally, I squashed these commits 
into one,  cherry picked that single commit to 1.8, and resolved conflicts.  I 
forgot to put the right message in the commit to close the PR.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-19 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
I looked in the code.  The suspension time is obtained from a table config 
object.  I am going to push these changes to 1.8 and master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-18 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
I ran some manual test on EC2 w/ 10 nodes.   It worked nicely.  I created 
1000 tablets and ran continuous ingest.   One thing I tried is setting the 
suspend duration to 10m, admin stop, waiting for that come back, setting 
suspend duration to 5, killing a tserver.   It seems like this 2nd tserver took 
10 min to come back (but this is from looking at monitor plot and guessing).  
Any, I want to review code for case where duration is changed and make sure 
thats picked up.  Also I will try testing this again tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-14 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
I cherry picked this back to 1.8 in my local fork.  My intent is to play 
around with this on a 10 node EC2 cluster.  I will report back after I do that. 
 If anyone has any particular experiments they would like me to run let me know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-08 Thread ShawnWalker
Github user ShawnWalker commented on the issue:

https://github.com/apache/accumulo/pull/121
  
> Did you test running stop-here.sh?
I couldn't get stop-here.sh to do anything in my setup.  Running `accumulo 
admin stop ...` brought a flaw to my attention, which I've fixed: I was mixing 
up milliseconds and nanoseconds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-07 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
> Indeed, I found this out when I went to implement my idea.

I just noticed that commit.  I can look over those changes today.  I like 
the idea of changing save to an enum.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-07 Thread ShawnWalker
Github user ShawnWalker commented on the issue:

https://github.com/apache/accumulo/pull/121
  
> I took a quick look and it seemed a tserver did not know why it was 
unloading. It seems like the tserver would need to know why it was unloading so 
it could only suspend when being stopped (would not want to set suspend for 
migration). Alternatively the unload thrift message could possibly be changed 
to include an option to suspension.

Indeed, I found this out when I went to implement my idea.  So in my most 
recent push, I modified the the thrift method 
`TabletClientService.unloadTablet(...)`.  Whereas it used to take a `boolean 
save` parameter, I instead have it take an enumerated value describing the 
desired end state of the tablet:
* UNASSIGNED: Upon unloading, leave the tablet in the UNASSIGNED state.
* SUSPENDED: Upon unloading, leave the tablet in the SUSPENDED state.
* DELETED: Upon unloading, leave the tablet in the UNASSIGNED state.  This 
tablet is being unloaded so it can be deleted soon; as such, don't bother minor 
compacting before unloading.  This is the behavior that setting `save==false` 
would accomplish previously.
* UNKNOWN: Well, ugh.  The enumerated value needed to correspond with 
values from `master.Master.TabletGoalState`, and I had no idea how to cleanly 
map `TabletGoalState.HOSTED`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-07 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
> My current thought would be to make unloading a tablet this way suspend 
the tablet instead of unassigning

I took a quick look and it seemed a tserver did not know why it was 
unloading.   It seems like the tserver would need to know why it was unloading 
so it could only suspend when being stopped (would not want to set suspend for 
migration).  Alternatively the unload thrift message could possibly be changed 
to include an option to suspension.

Would need to analyze the master and tserver code for concurrency issues if 
taking this approach.  Currently only the master is writing suspend.

>  Hence, running a tserver with tserv.port.client==0 will render 
Admin.stopTabletServer(...) ineffective, and stop-here.sh will fall back to 
sending SIGTERM/SIGKILL to stop the tserver. Similarly, running a tserver with 
tserv.port.search==true risks the same behavior.

That's an interesting find.  Can you open a new issue for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-05 Thread ShawnWalker
Github user ShawnWalker commented on the issue:

https://github.com/apache/accumulo/pull/121
  
> The stop-here.sh command has the master unload the tablets I think. How 
will this patch handle that case?
This patch won't handle such a case at all.  I'm sure it shows my 
inexperience with Accumulo, but I was unaware of this script.  I'm more 
familiar with engineering and dealing with [crash-only 
software](https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdf).
  I had assumed that a tserver would be stopped by SIGTERM or SIGKILL.

I'm open to suggestions on how to handle this use case.  My current thought 
would be to make unloading a tablet this way suspend the tablet instead of 
unassigning it.  I.e. in `tserver.TabletServer.UnloadTabletHandler.run()` at 
line 2012, call `TabletStateStore.suspend(...)` instead of 
`TabletStateStore.unassign(...)`.

> When a tablet server is suspended, all queries will block right?
When a *tablet* is suspended, all queries against that tablet do seem to 
block (or possibly time out).

> I see you are suspending the metadata tablets too.
By default, metadata tablets won't be suspended, even if the metadata table 
(or global configuration) has `tablet.suspend.duration` set.  One must also set 
the option `master.metadata.suspendable` to true (default false). The check for 
this is handled at Master.java:1154. 

Note to self: Looking back at that code, I realize that this check is made 
only once (at startup), instead of rechecking for updated configuration.  
Should probably make that check repeatedly.

> I see you are storing the host and port in the metadata for a suspended 
tablet. Sometimes we have tservers come up with a different host or port. In 
that case, I guess the tablets will wait until the suspend duration to be 
reassigned.
This is correct.  Tablet suspension is essentially incompatible with 
dynamic port assignment.  Of course, this wouldn't be the only part of Accumulo 
to suffer under random/dynamic port assignment.  Specifying 
`tserv.port.client==0` or `tserv.port.search==true` breaks assumptions in other 
places too.  Some I know of:
* I decided to match host+port based on code in 
`server.master.balance.DefaultLoadBalancer.getAssignment()`.  That code uses 
host+port to match a tablet's `last` column, for preserving locality.  If the 
tserver's port changes, the `last` column is effectively ignored, reducing 
locality.
* Having walked the logic path for `stop-here.sh`, my read is that 
`server.util.Admin.stopTabletServer(...)` (used by stop-here.sh) assumes 
tserver(s) on the specified  host (resp. localhost) will be on port(s) 
specified by `tserv.port.client`.  Hence, running a tserver with 
`tserv.port.client`==0 will render `stop-here.sh` ineffective.  Similarly, 
running a tserver with `tserv.port.search==true` risks rendering `stop-here.sh` 
ineffective.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-07-01 Thread mjwall
Github user mjwall commented on the issue:

https://github.com/apache/accumulo/pull/121
  
@ShawnWalker I reviewed this and talked with @keith-turner some.  I imagine 
during a rolling upgrade the process would be.

- Stage the new install on the box
- Run stop-here.sh
- Reconfigure the node to use the new install
- Run start-here.sh

The stop-here.sh command has the master unload the tablets I think.  How 
will this patch handle that case?

When a tablet server is suspended, all queries will block right?  I see you 
are suspending the metadata tablets too.  That will affect more than just a 
user queries though.  Bulk imports trying to get hosted locations comes to 
mind.  I see you are not suspending the root tablet though.

I see you are storing the host and port in the metadata for a suspended 
tablet.  Sometimes we have tservers come up with a different host or port.  In 
that case, I guess the tablets will wait until the suspend duration to be 
reassigned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-06-30 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
> Did you have suggestions for additional tests (manual or automated) that 
you would like to see?

No it sounds like you did the manual testing I would have done.  The case 
where there are multiple tservers w/ different host names.   In the IT the 
hostnames are all the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-06-29 Thread ShawnWalker
Github user ShawnWalker commented on the issue:

https://github.com/apache/accumulo/pull/121
  
Some testing on a small development cluster before I wrote 
`SuspendedTabletsIT`, but nothing significantly different from what I has 
`SuspendedTabletsIT` perform: Start Accumulo, create table, add split points to 
have multiple tablets, kill tserver, observe tablets being suspended, restart 
tserver, observe tablets being reassigned to previously dead tserver.

Did you have suggestions for additional tests (manual or automated) that 
you would like to see?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-06-29 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
@ShawnWalker did you do any testing besides the IT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-06-27 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
> If the tablet server returns before master notices it's gone, master will 
see it as a new empty tablet server. 

One possible beginning of solution is to change the behavior of 
`Master.gatherTabletInformation()` to do the following :

 * Check if the instance matches the tablet server its talking to get 
status.  This could be done by adding the instance id to the returned Thrift 
struct.
 * If the instance id does not match, then add tserver to `badServers`.  
This will prevent balancing (not sure if it will have other undesired 
consequences).

This will detect a new tsever instance after obtaining info from zookeeper. 
 I am still looking at code and thinking about what else may need to be done.

@ShawnWalker offline you were wondering if it was possible to have info 
about the same host+port twice in the master memory.  I looked at how the 
current code works and it attempts to prevent this.  Below is what I found :
 
  * The master uses `tserverStatus` for balancing which is populated at 
[Master line 
977](https://github.com/apache/accumulo/blob/97fdfc5912ce07615b9e85899675adc3d1b12578/server/master/src/main/java/org/apache/accumulo/master/Master.java#L977)
 by calling `gatherTabletInformation()`
 * `gatherTabletInformation()` calls `LiveTserverSet.getCurrentServers()`  
which returns `currentInstances` which us updated by 
[LiveTserverSet.checkServer()](https://github.com/apache/accumulo/blob/97fdfc5912ce07615b9e85899675adc3d1b12578/server/base/src/main/java/org/apache/accumulo/server/master/LiveTServerSet.java#L273).

The implementation of `checkServer()` uses the `current` map to ensure a 
tserver+port only has one entry in `currentInstances`.  The `current` map key 
is based on the zookeeper lock name which is host+port as set on [TabletServer 
line 
2309](https://github.com/apache/accumulo/blob/97fdfc5912ce07615b9e85899675adc3d1b12578/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L2309)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-06-27 Thread ShawnWalker
Github user ShawnWalker commented on the issue:

https://github.com/apache/accumulo/pull/121
  
If the tablet server returns before master notices it's gone, master will 
see it as a new empty tablet server.  This will (usually) cause balancing to be 
run, to give that tserver some work.  Once master notices that the original 
tserver has died, the original tserver's tablets will move to 
`TabletState.SUSPENDED`, and (assuming a sufficiently large 
`tablet.suspend.duration`), those suspended tablets will shortly be assigned to 
the "new" tserver.  This will again cause imbalance, as the "new" tserver will 
have approximately double it's expected load, causing yet more round(s) of 
balancing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-06-27 Thread keith-turner
Github user keith-turner commented on the issue:

https://github.com/apache/accumulo/pull/121
  
> I haven't accounted for the possibility that a tablet server might return 
before the master notices it had died. 

What do you think will happen in this situation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-06-24 Thread ShawnWalker
Github user ShawnWalker commented on the issue:

https://github.com/apache/accumulo/pull/121
  
Upon further thought, I haven't accounted for the possibility that a tablet 
server might return before the master notices it had died.  Such a situation 
would likely happen during a rolling restart, and might also happen in the 
other types of transient failures that have been discussed.

One solution which suggests itself to me is to block balancing until after 
the next assignment run whenever the `LiveTServerSet` contains two distinct 
tservers at the same location (host+port).  But that would still leave a minor 
race condition.

Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] accumulo issue #121: ACCUMULO-4353: Stabilize tablet assignment during trans...

2016-06-24 Thread ShawnWalker
Github user ShawnWalker commented on the issue:

https://github.com/apache/accumulo/pull/121
  
Well, Jenkins doesn't seem to like my PR, but its console output suggests a 
problem in the build server.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---