[ 
https://issues.apache.org/jira/browse/ACCUMULO-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253947#comment-15253947
 ] 

Shawn Walker commented on ACCUMULO-4229:
----------------------------------------

You say that every hour the collection of memoized {{TabletLocator}} s clears 
itself.  While I see that tservers clear these client caches periodically, I 
can't see anywhere that the client code itself does so.  So I wouldn't think 
that your analysis applies to a general client doing some sort of bulk 
insertion via {{BatchWriter}} s.

> BatchWriter writes to old, closed tablets leading to degraded write rates
> -------------------------------------------------------------------------
>
>                 Key: ACCUMULO-4229
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4229
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.7.1
>            Reporter: Dylan Hutchison
>
> BatchWriters that run a long time have write rates that sometimes 
> mysteriously decrease after the table it is writing to goes through a major 
> compaction or a split.  The decrease can be as bad as reducing throughput to 
> 0.
> This was first first mentioned in this [email 
> thread|https://mail-archives.apache.org/mod_mbox/accumulo-user/201406.mbox/%3ccamz+duvmmhegon9ejehr9h_rrpp50l2qz53bbdruvo0pira...@mail.gmail.com%3E]
>  for major compactions.  
> I discovered this in this [email 
> thread|https://mail-archives.apache.org/mod_mbox/accumulo-dev/201604.mbox/%3CCAPx%3DJkaY7fVh-U0O%2Bysx2d98LOGMcA4oEQOYgoPxR-0em4hdvg%40mail.gmail.com%3E]
>  for splits.  See the thread for some log messages.
> I turned on TRACE logs and I think I pinned it down: the TabletLocator cached 
> by a BatchWriter gets out of sync with the static cache of TabletLocators.
> # The TabletServerBatchWriter caches a TabletLocator from the static 
> collection of TabletLocators when it starts writing.  Suppose it is writing 
> to tablet T1.
> # The TabletServerBatchWriter uses its locally cached TabletLocator inside 
> its `binMutations` method for its entire lifespan; this cache is never 
> refreshed or updated to sync up with the static collection of TabletLocators.
> # Every hour, the static collection of TabletLocators clears itself.  The 
> next call to get a TabletLocator from the static collection allocates a new 
> TabletLocator.  Unfortunately, the TabletServerBatchWriter does not reflect 
> this change and continues to use the old, locally cached TabletLocator.
> # Tablet T1 splits into T2 and T3, which closes T1.  As such, it no longer 
> exists and the tablet server that receives the entries meant to go to T1 all 
> fail to write because T1 is closed.
> # The TabletServerBatchWriter receives the response from the tablet server 
> that all entries failed to write.  It invalidates the cache of the *new* 
> TabletLocator obtained from the static collection of TabletLocators.  The old 
> TabletLocator that is cached locally does not get invalidated.
> # The TabletServerBatchWriter re-queues the failed entries and tries to write 
> them to the same closed tablet T1, because it is still looking up tablets 
> using the old TabletLocator.
> This behavior subsumes the circumstances William wrote about in the thread he 
> mentioned.  The problem would occur as a result of either splits or major 
> compactions.  It would only stop the BatchWriter if its entire memory filled 
> up with writes to the same tablet that was closed as a result of a majc or 
> split; otherwise it would just slow down the BatchWriter by failing to write 
> some number of entries with every RPC.
> There are a few solutions we can think of.  
> # Not have the MutationWriter inside the TabletServerBatchWriter locally 
> cache TabletLocators.  I suspect this was done for performance reasons, so 
> it's probably not a good solution. 
> # Have all the MutationWriters clear their cache at the same time the static 
> TabletLocator cache clears.  I like this one.  We could store a reference to 
> the Map that each MutationWriter has inside a static synchronized 
> WeakHashMap.  The only time the weak map needs to be accessed is:
> ## When a MutationWriter is constructed (from constructing a 
> TabletServerBatchWriter), add its new local TabletLocator cache to the weak 
> map.
> ## When the static TabletLocator cache is cleared, also clear every map in 
> the weak map.
> # Another solution is to make the invalidate calls on the local TabletLocator 
> cache rather than the global static one.  If we go this route we should 
> double check the idea to make sure it does not impact the correctness of any 
> other pieces of code that use the cache. I like the previous idea better.
> The TimeoutTabletLocator does not help when no timeout is set on the 
> BatchWriter (the default behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to