Actually it happened again while some minior compactions were running, so
don't think it related to our major compaction tool, which isn't even
running right now. I will try to capture a debug dump of threads and
everything while the event is ongoing. Seems to last at least half an hour
or so and sometimes longer.

----
Saad


On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Unfortunately I lost the stack trace overnight. But it does seem related
> to compaction, because now that the compaction tool is done, I don't see
> the issue anymore. I will run our incremental major compaction tool again
> and see if I can reproduce the issue.
>
> On the plus side the system stayed stable and eventually recovered,
> although it did suffer all those timeouts.
>
> ----
> Saad
>
>
> On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> I'll paste a thread dump later, writing this from my phone  :-)
>>
>> So the same issue has happened at different times for different regions,
>> but I couldn't see that the region in question was the one being compacted,
>> either this time or earlier. Although I might have missed an earlier
>> correlation in the logs where the issue started just after the compaction
>> completed.
>>
>> Usually a compaction for this table's regions take around 5-10 minutes,
>> much less for its smaller column family which is block cache enabled,
>> around a minute or less, and 5-10 minutes for the much larger one for which
>> we have block cache disabled in the schema, because we don't ever read it
>> in the primary cluster. So the only impact on reads would be from that
>> smaller column family which takes less than a minute to compact.
>>
>> But the issue once started doesn't seem to recover for a long time, long
>> past when any compaction on the region itself could impact anything. The
>> compaction tool which is our own code has long since moved to other
>> regions.
>>
>> Cheers.
>>
>> ----
>> Saad
>>
>>
>> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> bq. timing out trying to obtain write locks on rows in that region.
>>>
>>> Can you confirm that the region under contention was the one being major
>>> compacted ?
>>>
>>> Can you pastebin thread dump so that we can have better idea of the
>>> scenario ?
>>>
>>> For the region being compacted, how long would the compaction take (just
>>> want to see if there was correlation between this duration and timeout) ?
>>>
>>> Cheers
>>>
>>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > We are running on Amazon EMR based HBase 1.4.0 . We are currently
>>> seeing a
>>> > situation where sometimes a particular region gets into a situation
>>> where a
>>> > lot of write requests to any row in that region timeout saying they
>>> failed
>>> > to obtain a lock on a row in a region and eventually they experience
>>> an IPC
>>> > timeout. This causes the IPC queue to blow up in size as requests get
>>> > backed up, and that region server experiences a much higher than normal
>>> > timeout rate for all requests, not just those timing out for failing to
>>> > obtain the row lock.
>>> >
>>> > The strange thing is the rows are always different but the region is
>>> always
>>> > the same. So the question is, is there a region component to how long
>>> a row
>>> > write lock would be held? I looked at the debug dump and the RowLocks
>>> > section shows a long list of write row locks held, all of them are
>>> from the
>>> > same region but different rows.
>>> >
>>> > Will trying to obtain a write row lock experience delays if no one else
>>> > holds a lock on the same row but the region itself is experiencing read
>>> > delays? We do have an incremental compaction tool running that major
>>> > compacts one region per region server at a time, so that will drive out
>>> > pages from the bucket cache. But for most regions the impact is
>>> > transitional until the bucket cache gets populated by pages from the
>>> new
>>> > HFile. But for this one region we start timing out trying to obtain
>>> write
>>> > locks on rows in that region.
>>> >
>>> > Any insight anyone can provide would be most welcome.
>>> >
>>> > Cheers.
>>> >
>>> > ----
>>> > Saad
>>> >
>>>
>>
>

Reply via email to