Re: How Long Will HBase Hold A Row Write Lock?

Saad Mufti Thu, 01 Mar 2018 04:55:18 -0800

Unfortunately I lost the stack trace overnight. But it does seem related to
compaction, because now that the compaction tool is done, I don't see the
issue anymore. I will run our incremental major compaction tool again and
see if I can reproduce the issue.


On the plus side the system stayed stable and eventually recovered,
although it did suffer all those timeouts.

----
Saad


On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <[email protected]> wrote:

> I'll paste a thread dump later, writing this from my phone  :-)
>
> So the same issue has happened at different times for different regions,
> but I couldn't see that the region in question was the one being compacted,
> either this time or earlier. Although I might have missed an earlier
> correlation in the logs where the issue started just after the compaction
> completed.
>
> Usually a compaction for this table's regions take around 5-10 minutes,
> much less for its smaller column family which is block cache enabled,
> around a minute or less, and 5-10 minutes for the much larger one for which
> we have block cache disabled in the schema, because we don't ever read it
> in the primary cluster. So the only impact on reads would be from that
> smaller column family which takes less than a minute to compact.
>
> But the issue once started doesn't seem to recover for a long time, long
> past when any compaction on the region itself could impact anything. The
> compaction tool which is our own code has long since moved to other
> regions.
>
> Cheers.
>
> ----
> Saad
>
>
> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <[email protected]> wrote:
>
>> bq. timing out trying to obtain write locks on rows in that region.
>>
>> Can you confirm that the region under contention was the one being major
>> compacted ?
>>
>> Can you pastebin thread dump so that we can have better idea of the
>> scenario ?
>>
>> For the region being compacted, how long would the compaction take (just
>> want to see if there was correlation between this duration and timeout) ?
>>
>> Cheers
>>
>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > We are running on Amazon EMR based HBase 1.4.0 . We are currently
>> seeing a
>> > situation where sometimes a particular region gets into a situation
>> where a
>> > lot of write requests to any row in that region timeout saying they
>> failed
>> > to obtain a lock on a row in a region and eventually they experience an
>> IPC
>> > timeout. This causes the IPC queue to blow up in size as requests get
>> > backed up, and that region server experiences a much higher than normal
>> > timeout rate for all requests, not just those timing out for failing to
>> > obtain the row lock.
>> >
>> > The strange thing is the rows are always different but the region is
>> always
>> > the same. So the question is, is there a region component to how long a
>> row
>> > write lock would be held? I looked at the debug dump and the RowLocks
>> > section shows a long list of write row locks held, all of them are from
>> the
>> > same region but different rows.
>> >
>> > Will trying to obtain a write row lock experience delays if no one else
>> > holds a lock on the same row but the region itself is experiencing read
>> > delays? We do have an incremental compaction tool running that major
>> > compacts one region per region server at a time, so that will drive out
>> > pages from the bucket cache. But for most regions the impact is
>> > transitional until the bucket cache gets populated by pages from the new
>> > HFile. But for this one region we start timing out trying to obtain
>> write
>> > locks on rows in that region.
>> >
>> > Any insight anyone can provide would be most welcome.
>> >
>> > Cheers.
>> >
>> > ----
>> > Saad
>> >
>>
>

Re: How Long Will HBase Hold A Row Write Lock?

Reply via email to