Github user ShawnWalker commented on the issue:
https://github.com/apache/accumulo/pull/121
> The stop-here.sh command has the master unload the tablets I think. How
will this patch handle that case?
This patch won't handle such a case at all. I'm sure it shows my
inexperience with Accumulo, but I was unaware of this script. I'm more
familiar with engineering and dealing with [crash-only
software](https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdf).
I had assumed that a tserver would be stopped by SIGTERM or SIGKILL.
I'm open to suggestions on how to handle this use case. My current thought
would be to make unloading a tablet this way suspend the tablet instead of
unassigning it. I.e. in `tserver.TabletServer.UnloadTabletHandler.run()` at
line 2012, call `TabletStateStore.suspend(...)` instead of
`TabletStateStore.unassign(...)`.
> When a tablet server is suspended, all queries will block right?
When a *tablet* is suspended, all queries against that tablet do seem to
block (or possibly time out).
> I see you are suspending the metadata tablets too.
By default, metadata tablets won't be suspended, even if the metadata table
(or global configuration) has `tablet.suspend.duration` set. One must also set
the option `master.metadata.suspendable` to true (default false). The check for
this is handled at Master.java:1154.
Note to self: Looking back at that code, I realize that this check is made
only once (at startup), instead of rechecking for updated configuration.
Should probably make that check repeatedly.
> I see you are storing the host and port in the metadata for a suspended
tablet. Sometimes we have tservers come up with a different host or port. In
that case, I guess the tablets will wait until the suspend duration to be
reassigned.
This is correct. Tablet suspension is essentially incompatible with
dynamic port assignment. Of course, this wouldn't be the only part of Accumulo
to suffer under random/dynamic port assignment. Specifying
`tserv.port.client==0` or `tserv.port.search==true` breaks assumptions in other
places too. Some I know of:
* I decided to match host+port based on code in
`server.master.balance.DefaultLoadBalancer.getAssignment()`. That code uses
host+port to match a tablet's `last` column, for preserving locality. If the
tserver's port changes, the `last` column is effectively ignored, reducing
locality.
* Having walked the logic path for `stop-here.sh`, my read is that
`server.util.Admin.stopTabletServer(...)` (used by stop-here.sh) assumes
tserver(s) on the specified host (resp. localhost) will be on port(s)
specified by `tserv.port.client`. Hence, running a tserver with
`tserv.port.client`==0 will render `stop-here.sh` ineffective. Similarly,
running a tserver with `tserv.port.search==true` risks rendering `stop-here.sh`
ineffective.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---