Unfortunately I lost the stack trace overnight. But it does seem related to compaction, because now that the compaction tool is done, I don't see the issue anymore. I will run our incremental major compaction tool again and see if I can reproduce the issue.
On the plus side the system stayed stable and eventually recovered, although it did suffer all those timeouts. ---- Saad On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <[email protected]> wrote: > I'll paste a thread dump later, writing this from my phone :-) > > So the same issue has happened at different times for different regions, > but I couldn't see that the region in question was the one being compacted, > either this time or earlier. Although I might have missed an earlier > correlation in the logs where the issue started just after the compaction > completed. > > Usually a compaction for this table's regions take around 5-10 minutes, > much less for its smaller column family which is block cache enabled, > around a minute or less, and 5-10 minutes for the much larger one for which > we have block cache disabled in the schema, because we don't ever read it > in the primary cluster. So the only impact on reads would be from that > smaller column family which takes less than a minute to compact. > > But the issue once started doesn't seem to recover for a long time, long > past when any compaction on the region itself could impact anything. The > compaction tool which is our own code has long since moved to other > regions. > > Cheers. > > ---- > Saad > > > On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <[email protected]> wrote: > >> bq. timing out trying to obtain write locks on rows in that region. >> >> Can you confirm that the region under contention was the one being major >> compacted ? >> >> Can you pastebin thread dump so that we can have better idea of the >> scenario ? >> >> For the region being compacted, how long would the compaction take (just >> want to see if there was correlation between this duration and timeout) ? >> >> Cheers >> >> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <[email protected]> wrote: >> >> > Hi, >> > >> > We are running on Amazon EMR based HBase 1.4.0 . We are currently >> seeing a >> > situation where sometimes a particular region gets into a situation >> where a >> > lot of write requests to any row in that region timeout saying they >> failed >> > to obtain a lock on a row in a region and eventually they experience an >> IPC >> > timeout. This causes the IPC queue to blow up in size as requests get >> > backed up, and that region server experiences a much higher than normal >> > timeout rate for all requests, not just those timing out for failing to >> > obtain the row lock. >> > >> > The strange thing is the rows are always different but the region is >> always >> > the same. So the question is, is there a region component to how long a >> row >> > write lock would be held? I looked at the debug dump and the RowLocks >> > section shows a long list of write row locks held, all of them are from >> the >> > same region but different rows. >> > >> > Will trying to obtain a write row lock experience delays if no one else >> > holds a lock on the same row but the region itself is experiencing read >> > delays? We do have an incremental compaction tool running that major >> > compacts one region per region server at a time, so that will drive out >> > pages from the bucket cache. But for most regions the impact is >> > transitional until the bucket cache gets populated by pages from the new >> > HFile. But for this one region we start timing out trying to obtain >> write >> > locks on rows in that region. >> > >> > Any insight anyone can provide would be most welcome. >> > >> > Cheers. >> > >> > ---- >> > Saad >> > >> >
