I forget to say that HDFS Datanodes and Accumulo Tablet Servers share
the same machines.
When a machine powers off, one Tablet Server and one Datanode became
unavailable.

On 2/19/15, Eric Newton <eric.new...@gmail.com> wrote:
> https://issues.apache.org/jira/browse/ACCUMULO-3603
>
> -Eric
>
>
> On Wed, Feb 18, 2015 at 7:12 PM, Denis <de...@camfex.cz> wrote:
>
>> On 2/18/15, Christopher <ctubb...@apache.org> wrote:
>>
>> > To rule out some scenarios, is it possible that your clients are
>> > writing
>> to
>> > the wrong tables?
>> That was the first idea, so I added assert()'s to the code of the
>> writers few days ago. No assert was triggered, but some invalid values
>> appear after new tserver failure.
>>
>> > Have you ever seen a failure affecting a table which does
>> > not exist (like what might happen if there's an off-by-one error in the
>> WAL
>> > code)? Or affecting the metadata tables?
>> No.
>> Also, no tables were created or deleted during last two months.
>>
>> > Can you reproduce this error reliably, or can you share the relevant
>> ingest
>> > code which can reproduce this failure?
>>
>> I will think how to reproduce it.
>> What could be special about the code: inserts are performed to few
>> (5..8) tables at once (one data table + few index tables) but no
>> MultiTableBatchWriter is used. Few BatchWriter`s (one per table) are
>> created and flushed consequentially, in the same thread. For Accumulo
>> 1.4 it was a performance optimization, if worked faster than
>> MultiTableBatchWriter. Not sure if it is so for 1.6.1, this code was
>> not changed after migration to 1.6.1.
>> In all cases with invalid values the index tables were affected (one
>> of the index table had values typical for another of the index
>> tables).
>>
>> > Also, what kind of tablet server failures are you experiencing when
>> > this
>> happens?
>> Spontaneous power-offs. There is something wrong with the power units
>> so every 2-3 days one of the servers suddenly turns off and reboots.
>>
>

Reply via email to