Actually it happened again while some minior compactions were running, so don't think it related to our major compaction tool, which isn't even running right now. I will try to capture a debug dump of threads and everything while the event is ongoing. Seems to last at least half an hour or so and sometimes longer.
---- Saad On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com> wrote: > Unfortunately I lost the stack trace overnight. But it does seem related > to compaction, because now that the compaction tool is done, I don't see > the issue anymore. I will run our incremental major compaction tool again > and see if I can reproduce the issue. > > On the plus side the system stayed stable and eventually recovered, > although it did suffer all those timeouts. > > ---- > Saad > > > On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <saad.mu...@gmail.com> wrote: > >> I'll paste a thread dump later, writing this from my phone :-) >> >> So the same issue has happened at different times for different regions, >> but I couldn't see that the region in question was the one being compacted, >> either this time or earlier. Although I might have missed an earlier >> correlation in the logs where the issue started just after the compaction >> completed. >> >> Usually a compaction for this table's regions take around 5-10 minutes, >> much less for its smaller column family which is block cache enabled, >> around a minute or less, and 5-10 minutes for the much larger one for which >> we have block cache disabled in the schema, because we don't ever read it >> in the primary cluster. So the only impact on reads would be from that >> smaller column family which takes less than a minute to compact. >> >> But the issue once started doesn't seem to recover for a long time, long >> past when any compaction on the region itself could impact anything. The >> compaction tool which is our own code has long since moved to other >> regions. >> >> Cheers. >> >> ---- >> Saad >> >> >> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> wrote: >> >>> bq. timing out trying to obtain write locks on rows in that region. >>> >>> Can you confirm that the region under contention was the one being major >>> compacted ? >>> >>> Can you pastebin thread dump so that we can have better idea of the >>> scenario ? >>> >>> For the region being compacted, how long would the compaction take (just >>> want to see if there was correlation between this duration and timeout) ? >>> >>> Cheers >>> >>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com> >>> wrote: >>> >>> > Hi, >>> > >>> > We are running on Amazon EMR based HBase 1.4.0 . We are currently >>> seeing a >>> > situation where sometimes a particular region gets into a situation >>> where a >>> > lot of write requests to any row in that region timeout saying they >>> failed >>> > to obtain a lock on a row in a region and eventually they experience >>> an IPC >>> > timeout. This causes the IPC queue to blow up in size as requests get >>> > backed up, and that region server experiences a much higher than normal >>> > timeout rate for all requests, not just those timing out for failing to >>> > obtain the row lock. >>> > >>> > The strange thing is the rows are always different but the region is >>> always >>> > the same. So the question is, is there a region component to how long >>> a row >>> > write lock would be held? I looked at the debug dump and the RowLocks >>> > section shows a long list of write row locks held, all of them are >>> from the >>> > same region but different rows. >>> > >>> > Will trying to obtain a write row lock experience delays if no one else >>> > holds a lock on the same row but the region itself is experiencing read >>> > delays? We do have an incremental compaction tool running that major >>> > compacts one region per region server at a time, so that will drive out >>> > pages from the bucket cache. But for most regions the impact is >>> > transitional until the bucket cache gets populated by pages from the >>> new >>> > HFile. But for this one region we start timing out trying to obtain >>> write >>> > locks on rows in that region. >>> > >>> > Any insight anyone can provide would be most welcome. >>> > >>> > Cheers. >>> > >>> > ---- >>> > Saad >>> > >>> >> >