[ https://issues.apache.org/jira/browse/ACCUMULO-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253947#comment-15253947 ]
Shawn Walker commented on ACCUMULO-4229: ---------------------------------------- You say that every hour the collection of memoized {{TabletLocator}} s clears itself. While I see that tservers clear these client caches periodically, I can't see anywhere that the client code itself does so. So I wouldn't think that your analysis applies to a general client doing some sort of bulk insertion via {{BatchWriter}} s. > BatchWriter writes to old, closed tablets leading to degraded write rates > ------------------------------------------------------------------------- > > Key: ACCUMULO-4229 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4229 > Project: Accumulo > Issue Type: Bug > Components: client > Affects Versions: 1.7.1 > Reporter: Dylan Hutchison > > BatchWriters that run a long time have write rates that sometimes > mysteriously decrease after the table it is writing to goes through a major > compaction or a split. The decrease can be as bad as reducing throughput to > 0. > This was first first mentioned in this [email > thread|https://mail-archives.apache.org/mod_mbox/accumulo-user/201406.mbox/%3ccamz+duvmmhegon9ejehr9h_rrpp50l2qz53bbdruvo0pira...@mail.gmail.com%3E] > for major compactions. > I discovered this in this [email > thread|https://mail-archives.apache.org/mod_mbox/accumulo-dev/201604.mbox/%3CCAPx%3DJkaY7fVh-U0O%2Bysx2d98LOGMcA4oEQOYgoPxR-0em4hdvg%40mail.gmail.com%3E] > for splits. See the thread for some log messages. > I turned on TRACE logs and I think I pinned it down: the TabletLocator cached > by a BatchWriter gets out of sync with the static cache of TabletLocators. > # The TabletServerBatchWriter caches a TabletLocator from the static > collection of TabletLocators when it starts writing. Suppose it is writing > to tablet T1. > # The TabletServerBatchWriter uses its locally cached TabletLocator inside > its `binMutations` method for its entire lifespan; this cache is never > refreshed or updated to sync up with the static collection of TabletLocators. > # Every hour, the static collection of TabletLocators clears itself. The > next call to get a TabletLocator from the static collection allocates a new > TabletLocator. Unfortunately, the TabletServerBatchWriter does not reflect > this change and continues to use the old, locally cached TabletLocator. > # Tablet T1 splits into T2 and T3, which closes T1. As such, it no longer > exists and the tablet server that receives the entries meant to go to T1 all > fail to write because T1 is closed. > # The TabletServerBatchWriter receives the response from the tablet server > that all entries failed to write. It invalidates the cache of the *new* > TabletLocator obtained from the static collection of TabletLocators. The old > TabletLocator that is cached locally does not get invalidated. > # The TabletServerBatchWriter re-queues the failed entries and tries to write > them to the same closed tablet T1, because it is still looking up tablets > using the old TabletLocator. > This behavior subsumes the circumstances William wrote about in the thread he > mentioned. The problem would occur as a result of either splits or major > compactions. It would only stop the BatchWriter if its entire memory filled > up with writes to the same tablet that was closed as a result of a majc or > split; otherwise it would just slow down the BatchWriter by failing to write > some number of entries with every RPC. > There are a few solutions we can think of. > # Not have the MutationWriter inside the TabletServerBatchWriter locally > cache TabletLocators. I suspect this was done for performance reasons, so > it's probably not a good solution. > # Have all the MutationWriters clear their cache at the same time the static > TabletLocator cache clears. I like this one. We could store a reference to > the Map that each MutationWriter has inside a static synchronized > WeakHashMap. The only time the weak map needs to be accessed is: > ## When a MutationWriter is constructed (from constructing a > TabletServerBatchWriter), add its new local TabletLocator cache to the weak > map. > ## When the static TabletLocator cache is cleared, also clear every map in > the weak map. > # Another solution is to make the invalidate calls on the local TabletLocator > cache rather than the global static one. If we go this route we should > double check the idea to make sure it does not impact the correctness of any > other pieces of code that use the cache. I like the previous idea better. > The TimeoutTabletLocator does not help when no timeout is set on the > BatchWriter (the default behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332)