Github user ShawnWalker commented on the issue: https://github.com/apache/accumulo/pull/121 > The stop-here.sh command has the master unload the tablets I think. How will this patch handle that case? This patch won't handle such a case at all. I'm sure it shows my inexperience with Accumulo, but I was unaware of this script. I'm more familiar with engineering and dealing with [crash-only software](https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdf). I had assumed that a tserver would be stopped by SIGTERM or SIGKILL. I'm open to suggestions on how to handle this use case. My current thought would be to make unloading a tablet this way suspend the tablet instead of unassigning it. I.e. in `tserver.TabletServer.UnloadTabletHandler.run()` at line 2012, call `TabletStateStore.suspend(...)` instead of `TabletStateStore.unassign(...)`. > When a tablet server is suspended, all queries will block right? When a *tablet* is suspended, all queries against that tablet do seem to block (or possibly time out). > I see you are suspending the metadata tablets too. By default, metadata tablets won't be suspended, even if the metadata table (or global configuration) has `tablet.suspend.duration` set. One must also set the option `master.metadata.suspendable` to true (default false). The check for this is handled at Master.java:1154. Note to self: Looking back at that code, I realize that this check is made only once (at startup), instead of rechecking for updated configuration. Should probably make that check repeatedly. > I see you are storing the host and port in the metadata for a suspended tablet. Sometimes we have tservers come up with a different host or port. In that case, I guess the tablets will wait until the suspend duration to be reassigned. This is correct. Tablet suspension is essentially incompatible with dynamic port assignment. Of course, this wouldn't be the only part of Accumulo to suffer under random/dynamic port assignment. Specifying `tserv.port.client==0` or `tserv.port.search==true` breaks assumptions in other places too. Some I know of: * I decided to match host+port based on code in `server.master.balance.DefaultLoadBalancer.getAssignment()`. That code uses host+port to match a tablet's `last` column, for preserving locality. If the tserver's port changes, the `last` column is effectively ignored, reducing locality. * Having walked the logic path for `stop-here.sh`, my read is that `server.util.Admin.stopTabletServer(...)` (used by stop-here.sh) assumes tserver(s) on the specified host (resp. localhost) will be on port(s) specified by `tserv.port.client`. Hence, running a tserver with `tserv.port.client`==0 will render `stop-here.sh` ineffective. Similarly, running a tserver with `tserv.port.search==true` risks rendering `stop-here.sh` ineffective.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---