I forget to say that HDFS Datanodes and Accumulo Tablet Servers share the same machines. When a machine powers off, one Tablet Server and one Datanode became unavailable.
On 2/19/15, Eric Newton <eric.new...@gmail.com> wrote: > https://issues.apache.org/jira/browse/ACCUMULO-3603 > > -Eric > > > On Wed, Feb 18, 2015 at 7:12 PM, Denis <de...@camfex.cz> wrote: > >> On 2/18/15, Christopher <ctubb...@apache.org> wrote: >> >> > To rule out some scenarios, is it possible that your clients are >> > writing >> to >> > the wrong tables? >> That was the first idea, so I added assert()'s to the code of the >> writers few days ago. No assert was triggered, but some invalid values >> appear after new tserver failure. >> >> > Have you ever seen a failure affecting a table which does >> > not exist (like what might happen if there's an off-by-one error in the >> WAL >> > code)? Or affecting the metadata tables? >> No. >> Also, no tables were created or deleted during last two months. >> >> > Can you reproduce this error reliably, or can you share the relevant >> ingest >> > code which can reproduce this failure? >> >> I will think how to reproduce it. >> What could be special about the code: inserts are performed to few >> (5..8) tables at once (one data table + few index tables) but no >> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are >> created and flushed consequentially, in the same thread. For Accumulo >> 1.4 it was a performance optimization, if worked faster than >> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was >> not changed after migration to 1.6.1. >> In all cases with invalid values the index tables were affected (one >> of the index table had values typical for another of the index >> tables). >> >> > Also, what kind of tablet server failures are you experiencing when >> > this >> happens? >> Spontaneous power-offs. There is something wrong with the power units >> so every 2-3 days one of the servers suddenly turns off and reboots. >> >