On 2/18/15, Christopher <ctubb...@apache.org> wrote:

> To rule out some scenarios, is it possible that your clients are writing to
> the wrong tables?
That was the first idea, so I added assert()'s to the code of the
writers few days ago. No assert was triggered, but some invalid values
appear after new tserver failure.

> Have you ever seen a failure affecting a table which does
> not exist (like what might happen if there's an off-by-one error in the WAL
> code)? Or affecting the metadata tables?
No.
Also, no tables were created or deleted during last two months.

> Can you reproduce this error reliably, or can you share the relevant ingest
> code which can reproduce this failure?

I will think how to reproduce it.
What could be special about the code: inserts are performed to few
(5..8) tables at once (one data table + few index tables) but no
MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
created and flushed consequentially, in the same thread. For Accumulo
1.4 it was a performance optimization, if worked faster than
MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
not changed after migration to 1.6.1.
In all cases with invalid values the index tables were affected (one
of the index table had values typical for another of the index
tables).

> Also, what kind of tablet server failures are you experiencing when this 
> happens?
Spontaneous power-offs. There is something wrong with the power units
so every 2-3 days one of the servers suddenly turns off and reboots.

Reply via email to