[ https://issues.apache.org/jira/browse/PHOENIX-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274822#comment-15274822 ]
Josh Elser commented on PHOENIX-2883: ------------------------------------- Right, the hypothesis: The hunch is that while that there is no coordination between the region closing and the async index rebuilding coming back online. Perhaps there needs to be coordination between the two to prevent this? > Region close during automatic disabling of index for rebuilding can lead to > RS abort > ------------------------------------------------------------------------------------ > > Key: PHOENIX-2883 > URL: https://issues.apache.org/jira/browse/PHOENIX-2883 > Project: Phoenix > Issue Type: Bug > Reporter: Josh Elser > Assignee: Josh Elser > > (disclaimer: still performing due-diligence on this one) > I've been helping a user this week with what is thought to be a race > condition in secondary index updates. This user has a relatively heavy > write-based workload with a few tables that each have at least one index. > What we have seen is that when the region distribution is changing > (concretely, we were doing a rolling restart of the cluster without the load > balancer disabled in the hopes of retaining as much availability as > possible), I've seen the following general outline in the logs: > * An index update fails (due to {{ERROR 2008 (INT10)}} the index metadata > cache expired or is just missing) > * The index is taken offline to be asynchronously rebuilt > * A flush on the data table's region is queue for quite some time > * RS is asked to close a region (due to a move, commonly) > * RS aborts because the memstore for the data table's region is in an > inconsistent state (e.g. {{Assertion failed while closing store <region> > <colfam> flushableSize expected=0, actual= 193392. Current > memstoreSize=-552208. Maybe a coprocessor operation failed and left the > memstore in a partially updated state.}} > Some relevant HBase issues include HBASE-10514 and HBASE-10844. > Have been talking to [~ayingshu] and [~devaraj] about it, but haven't found > anything definitively conclusive yet. Will dump findings here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)