<snip>

Keith Turner wrote:
>  Assuming batches were isolated from each other, and all batch/mutation
>  flushes were controlled and done once per batch, is it difficult because
>  the writes could be going to different tablet servers? Couldn't we keep
>  track of which failed and have a choice of having a configurable internal
>  retry (transient errors) or return the subset of mutations which failed and
>  leave it up to the caller? This could work for us. We might want need some
>  guarantees for a given row on the same server though - would have to think
>  about that.

The batch writer does retry on network errors (until timeout is
reached, which defaults to max long or int).  I think the only things
that percolate up to the users are unexpected exceptions in the batch
writer, tserver, or constraint violations.  Are you interested in
knowing what mutations failed because of a timeout?   I don't think
this can not be done w/o introducing a more expensive multi-step
protocol for writing data.   Currently when the batch writer sends
data its possible that the tserver received it and wrote it, but could
not report success to the client.   The client may either timeout or
send the data again.


It's trickier because server-side, we're also doing group-commits to the WAL. Your update session (started by the BatchWriter) will make some updates to the WAL and block on those to be sync'ed to the WAL. In this sync, there may be updates to the WAL that include updates other than your own.

That said, I'm not sure what the error conditions that Accumulo will "normally" throw you such an error (e.g. not related to HDFS being hosed or something). Maybe the HoldTimeException (tserver being too busy)? I'd have to lock myself in a room and really take a good look at this stuff again to refresh the cases where Accumulo might actually see an updated but still send you an error... Maybe this isn't a concern to you as I'm making it either :)

Reply via email to