Also, for now we have mitigated this problem by using the new setting in
HBase 1.4.0 that prevents one slow region server from blocking all client
requests. Of course it causes some timeouts but our overall ecosystem
contains Kafka queues for retries, so we can live with that. From what I
can see, it looks like this setting also has the good effect of preventing
clients from hammering a region server that is slow because its IPC queues
are backed up, allowing it to recover faster.

Does that make sense?

Cheers.

----
Saad


On Sat, Mar 10, 2018 at 7:04 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> So if I understand correctly, we would mitigate the problem by not
> evicting blocks for archived files immediately? Wouldn't this potentially
> lead to problems later if the LRU algo chooses to evict blocks for active
> files and leave blocks for archived files in there?
>
> I would definitely love to test this!!! Unfortunately we are running on
> EMR and the details of how to patch HBase under EMR are not clear to me :-(
>
> What we would really love would be a setting for actually immediately
> caching blocks for a new compacted file. I have seen in the code that even
> is we have the cache on write setting set to true, it will refuse to cache
> blocks for a file that is a newly compacted one. In our case we have sized
> the bucket cache to be big enough to hold all our data, and really want to
> avoid having to go to S3 until the last possible moment. A config setting
> to test this would be great.
>
> But thanks everyone for your feedback. Any more would also be welcome on
> the idea to let a user cache all newly compacted files.
>
> ----
> Saad
>
>
> On Wed, Mar 7, 2018 at 12:00 AM, Anoop John <anoop.hb...@gmail.com> wrote:
>
>> >>a) it was indeed one of the regions that was being compacted, major
>> compaction in one case, minor compaction in another, the issue started
>> just
>> after compaction completed blowing away bucket cached blocks for the older
>> HFile's
>>
>> About this part.    Ya after the compaction, there is a step where the
>> compacted away HFile's blocks getting removed from cache. This op takes a
>> write lock for this region (In Bucket Cache layer)..  Every read op which
>> is part of checkAndPut will try read from BC and that in turn need a read
>> lock for this region.  So there is chances that the read locks starve
>> because of so many frequent write locks .  Each block evict will attain
>> the
>> write lock one after other.  Will it be possible for you to patch this
>> evict and test once? We can avoid the immediate evict from BC after
>> compaction. I can help you with a patch if you wish
>>
>> Anoop
>>
>>
>>
>> On Mon, Mar 5, 2018 at 11:07 AM, ramkrishna vasudevan <
>> ramkrishna.s.vasude...@gmail.com> wrote:
>> > Hi Saad
>> >
>> > Your argument here
>> >>> The
>> >>>theory is that since prefetch is an async operation, a lot of the reads
>> in
>> >>>the checkAndPut for the region in question start reading from S3 which
>> is
>> >>>slow. So the write lock obtained for the checkAndPut is held for a
>> longer
>> >>>duration than normal. This has cascading upstream effects. Does that
>> sound
>> >>>plausible?
>> >
>> > Seems very much plausible. So before even the prefetch happens say for
>> > 'block 1' - and you have already issues N checkAndPut calls for the rows
>> in
>> > that 'block 1' -  all those checkAndPut will have to read that block
>> from
>> > S3 to perform the get() and then apply the mutation.
>> >
>> > This may happen for multiple threads at the same time because we are not
>> > sure when the prefetch would have actually been completed. I don know
>> what
>> > are the general read characteristics when a read happens from S3 but you
>> > could try to see how things work when a read happens from S3 and after
>> the
>> > prefetch completes ensure the same checkandPut() is done (from cache
>> this
>> > time) to really know the difference what S3 does there.
>> >
>> > Regards
>> > Ram
>> >
>> > On Fri, Mar 2, 2018 at 2:57 AM, Saad Mufti <saad.mu...@gmail.com>
>> wrote:
>> >
>> >> So after much investigation I can confirm:
>> >>
>> >> a) it was indeed one of the regions that was being compacted, major
>> >> compaction in one case, minor compaction in another, the issue started
>> just
>> >> after compaction completed blowing away bucket cached blocks for the
>> older
>> >> HFile's
>> >> b) in another case there was no compaction just a newly opened region
>> in
>> a
>> >> region server that hadn't finished perfetching its pages from S3
>> >>
>> >> We have prefetch on open set to true. Our load is heavy on checkAndPut
>> .The
>> >> theory is that since prefetch is an async operation, a lot of the reads
>> in
>> >> the checkAndPut for the region in question start reading from S3 which
>> is
>> >> slow. So the write lock obtained for the checkAndPut is held for a
>> longer
>> >> duration than normal. This has cascading upstream effects. Does that
>> sound
>> >> plausible?
>> >>
>> >> The part I don't understand still is all the locks held are for the
>> same
>> >> region but are all for different rows. So once the prefetch is
>> completed,
>> >> shouldn't the problem clear up quickly? Or does the slow region slow
>> down
>> >> anyone trying to do checkAndPut on any row in the same region even
>> after
>> >> the prefetch has completed. That is, do the long held row locks prevent
>> >> others from getting a row lock on a different row in the same region?
>> >>
>> >> In any case, we trying to use
>> >> https://issues.apache.org/jira/browse/HBASE-16388 support in HBase
>> 1.4.0
>> >> to
>> >> both insulate the app a bit from this situation and hoping that it will
>> >> reduce pressure on the region server in question, allowing it to
>> recover
>> >> faster. I haven't quite tested that yet, any advice in the meantime
>> would
>> >> be appreciated.
>> >>
>> >> Cheers.
>> >>
>> >> ----
>> >> Saad
>> >>
>> >>
>> >>
>> >> On Thu, Mar 1, 2018 at 9:21 AM, Saad Mufti <saad.mu...@gmail.com>
>> wrote:
>> >>
>> >> > Actually it happened again while some minior compactions were
>> running,
>> so
>> >> > don't think it related to our major compaction tool, which isn't even
>> >> > running right now. I will try to capture a debug dump of threads and
>> >> > everything while the event is ongoing. Seems to last at least half an
>> >> hour
>> >> > or so and sometimes longer.
>> >> >
>> >> > ----
>> >> > Saad
>> >> >
>> >> >
>> >> > On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com>
>> wrote:
>> >> >
>> >> >> Unfortunately I lost the stack trace overnight. But it does seem
>> related
>> >> >> to compaction, because now that the compaction tool is done, I don't
>> see
>> >> >> the issue anymore. I will run our incremental major compaction tool
>> >> again
>> >> >> and see if I can reproduce the issue.
>> >> >>
>> >> >> On the plus side the system stayed stable and eventually recovered,
>> >> >> although it did suffer all those timeouts.
>> >> >>
>> >> >> ----
>> >> >> Saad
>> >> >>
>> >> >>
>> >> >> On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <saad.mu...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >>> I'll paste a thread dump later, writing this from my phone  :-)
>> >> >>>
>> >> >>> So the same issue has happened at different times for different
>> >> regions,
>> >> >>> but I couldn't see that the region in question was the one being
>> >> compacted,
>> >> >>> either this time or earlier. Although I might have missed an
>> earlier
>> >> >>> correlation in the logs where the issue started just after the
>> >> compaction
>> >> >>> completed.
>> >> >>>
>> >> >>> Usually a compaction for this table's regions take around 5-10
>> minutes,
>> >> >>> much less for its smaller column family which is block cache
>> enabled,
>> >> >>> around a minute or less, and 5-10 minutes for the much larger one
>> for
>> >> which
>> >> >>> we have block cache disabled in the schema, because we don't ever
>> read
>> >> it
>> >> >>> in the primary cluster. So the only impact on reads would be from
>> that
>> >> >>> smaller column family which takes less than a minute to compact.
>> >> >>>
>> >> >>> But the issue once started doesn't seem to recover for a long time,
>> >> long
>> >> >>> past when any compaction on the region itself could impact
>> anything.
>> >> The
>> >> >>> compaction tool which is our own code has long since moved to other
>> >> >>> regions.
>> >> >>>
>> >> >>> Cheers.
>> >> >>>
>> >> >>> ----
>> >> >>> Saad
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com>
>> wrote:
>> >> >>>
>> >> >>>> bq. timing out trying to obtain write locks on rows in that
>> region.
>> >> >>>>
>> >> >>>> Can you confirm that the region under contention was the one being
>> >> major
>> >> >>>> compacted ?
>> >> >>>>
>> >> >>>> Can you pastebin thread dump so that we can have better idea of
>> the
>> >> >>>> scenario ?
>> >> >>>>
>> >> >>>> For the region being compacted, how long would the compaction take
>> >> (just
>> >> >>>> want to see if there was correlation between this duration and
>> >> timeout)
>> >> >>>> ?
>> >> >>>>
>> >> >>>> Cheers
>> >> >>>>
>> >> >>>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com
>> >
>> >> >>>> wrote:
>> >> >>>>
>> >> >>>> > Hi,
>> >> >>>> >
>> >> >>>> > We are running on Amazon EMR based HBase 1.4.0 . We are
>> currently
>> >> >>>> seeing a
>> >> >>>> > situation where sometimes a particular region gets into a
>> situation
>> >> >>>> where a
>> >> >>>> > lot of write requests to any row in that region timeout saying
>> they
>> >> >>>> failed
>> >> >>>> > to obtain a lock on a row in a region and eventually they
>> experience
>> >> >>>> an IPC
>> >> >>>> > timeout. This causes the IPC queue to blow up in size as
>> requests
>> >> get
>> >> >>>> > backed up, and that region server experiences a much higher than
>> >> >>>> normal
>> >> >>>> > timeout rate for all requests, not just those timing out for
>> failing
>> >> >>>> to
>> >> >>>> > obtain the row lock.
>> >> >>>> >
>> >> >>>> > The strange thing is the rows are always different but the
>> region
>> is
>> >> >>>> always
>> >> >>>> > the same. So the question is, is there a region component to how
>> >> long
>> >> >>>> a row
>> >> >>>> > write lock would be held? I looked at the debug dump and the
>> >> RowLocks
>> >> >>>> > section shows a long list of write row locks held, all of them
>> are
>> >> >>>> from the
>> >> >>>> > same region but different rows.
>> >> >>>> >
>> >> >>>> > Will trying to obtain a write row lock experience delays if no
>> one
>> >> >>>> else
>> >> >>>> > holds a lock on the same row but the region itself is
>> experiencing
>> >> >>>> read
>> >> >>>> > delays? We do have an incremental compaction tool running that
>> major
>> >> >>>> > compacts one region per region server at a time, so that will
>> drive
>> >> >>>> out
>> >> >>>> > pages from the bucket cache. But for most regions the impact is
>> >> >>>> > transitional until the bucket cache gets populated by pages from
>> the
>> >> >>>> new
>> >> >>>> > HFile. But for this one region we start timing out trying to
>> obtain
>> >> >>>> write
>> >> >>>> > locks on rows in that region.
>> >> >>>> >
>> >> >>>> > Any insight anyone can provide would be most welcome.
>> >> >>>> >
>> >> >>>> > Cheers.
>> >> >>>> >
>> >> >>>> > ----
>> >> >>>> > Saad
>> >> >>>> >
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >
>> >>
>>
>
>

Reply via email to