Github user ShawnWalker commented on the issue:

    https://github.com/apache/accumulo/pull/121
  
    > The stop-here.sh command has the master unload the tablets I think. How 
will this patch handle that case?
    This patch won't handle such a case at all.  I'm sure it shows my 
inexperience with Accumulo, but I was unaware of this script.  I'm more 
familiar with engineering and dealing with [crash-only 
software](https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdf).
  I had assumed that a tserver would be stopped by SIGTERM or SIGKILL.
    
    I'm open to suggestions on how to handle this use case.  My current thought 
would be to make unloading a tablet this way suspend the tablet instead of 
unassigning it.  I.e. in `tserver.TabletServer.UnloadTabletHandler.run()` at 
line 2012, call `TabletStateStore.suspend(...)` instead of 
`TabletStateStore.unassign(...)`.
    
    > When a tablet server is suspended, all queries will block right?
    When a *tablet* is suspended, all queries against that tablet do seem to 
block (or possibly time out).
    
    > I see you are suspending the metadata tablets too.
    By default, metadata tablets won't be suspended, even if the metadata table 
(or global configuration) has `tablet.suspend.duration` set.  One must also set 
the option `master.metadata.suspendable` to true (default false). The check for 
this is handled at Master.java:1154. 
    
    Note to self: Looking back at that code, I realize that this check is made 
only once (at startup), instead of rechecking for updated configuration.  
Should probably make that check repeatedly.
    
    > I see you are storing the host and port in the metadata for a suspended 
tablet. Sometimes we have tservers come up with a different host or port. In 
that case, I guess the tablets will wait until the suspend duration to be 
reassigned.
    This is correct.  Tablet suspension is essentially incompatible with 
dynamic port assignment.  Of course, this wouldn't be the only part of Accumulo 
to suffer under random/dynamic port assignment.  Specifying 
`tserv.port.client==0` or `tserv.port.search==true` breaks assumptions in other 
places too.  Some I know of:
    * I decided to match host+port based on code in 
`server.master.balance.DefaultLoadBalancer.getAssignment()`.  That code uses 
host+port to match a tablet's `last` column, for preserving locality.  If the 
tserver's port changes, the `last` column is effectively ignored, reducing 
locality.
    * Having walked the logic path for `stop-here.sh`, my read is that 
`server.util.Admin.stopTabletServer(...)` (used by stop-here.sh) assumes 
tserver(s) on the specified  host (resp. localhost) will be on port(s) 
specified by `tserv.port.client`.  Hence, running a tserver with 
`tserv.port.client`==0 will render `stop-here.sh` ineffective.  Similarly, 
running a tserver with `tserv.port.search==true` risks rendering `stop-here.sh` 
ineffective.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to