Re: Extremely high CPU usage after upgrading to Hbase 1.4.4

Ted Yu Fri, 07 Sep 2018 22:01:33 -0700

Thanks for detailed background information.

I assume your code has done de-dup for the filters contained in
FilterListWithOR.


I took a look at JIRAs which
touched hbase-client/src/main/java/org/apache/hadoop/hbase/filter in
branch-1.4
There were a few patches (some were very big) since the release of 1.3.0
So it is not obvious at first glance which one(s) might be related.

I noticed ColumnPrefixFilter.getNextCellHint (and
KeyValueUtil.createFirstOnRow) appearing many times in the stack trace.

I plan to dig more in this area.

Cheers

On Fri, Sep 7, 2018 at 11:30 AM Srinidhi Muppalla <srinid...@trulia.com>
wrote:

> Sure thing. For our table schema, each row represents one user and the row
> key is that user’s unique id in our system. We currently only use one
> column family in the table. The column qualifiers represent an item that
> has been surfaced to that user as well as additional information to
> differentiate the way the item has been surfaced to the user. Without
> getting into too many specifics, the qualifier follows the rough format of:
>
> “Channel-itemId-distinguisher”.
>
> The channel here is the channel through the item was previously surfaced
> to the user. The itemid is the unique id of the item that has been surfaced
> to the user. A distinguisher is some attribute about how that item was
> surfaced to the user.
>
> When we run a scan, we currently only ever run it on one row at a time. It
> was chosen over ‘get’ because (from our understanding) the performance
> difference is negligible, and down the road using scan would allow us some
> more flexibility.
>
> The filter list that is constructed with scan works by using a
> ColumnPrefixFilter as you mentioned. When a user is being communicated to
> on a particular channel, we have a list of items that we want to
> potentially surface for that user. So, we construct a prefix list with the
> channel and each of the item ids in the form of: “channel-itemId”. Then we
> run a scan on that row with that filter list using “WithOr” to get all of
> the matching channel-itemId combinations currently in that row/column
> family in the table. This way we can then know which of the items we want
> to surface to that user on that channel have already been surfaced on that
> channel. The reason we query using a prefix filter is so that we don’t need
> to know the ‘distinguisher’ part of the record when writing the actual
> query, because the distinguisher is only relevant in certain circumstances.
>
> Let me know if this is the information about our query pattern that you
> were looking for and if there is anything I can clarify or add.
>
> Thanks,
> Srinidhi
>
> On 9/6/18, 12:24 PM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>
>     From the stack trace, ColumnPrefixFilter is used during scan.
>
>     Can you illustrate how various filters are formed thru
> FilterListWithOR ?
>     It would be easier for other people to reproduce the problem given your
>     query pattern.
>
>     Cheers
>
>     On Thu, Sep 6, 2018 at 11:43 AM Srinidhi Muppalla <
> srinid...@trulia.com>
>     wrote:
>
>     > Hi Vlad,
>     >
>     > Thank you for the suggestion. I recreated the issue and attached the
> stack
>     > traces I took. Let me know if there’s any other info I can provide.
> We
>     > narrowed the issue down to occurring when upgrading from 1.3.0 to
> any 1.4.x
>     > version.
>     >
>     > Thanks,
>     > Srinidhi
>     >
>     > On 9/4/18, 8:19 PM, "Vladimir Rodionov" <vladrodio...@gmail.com>
> wrote:
>     >
>     >     Hi, Srinidhi
>     >
>     >     Next time you will see this issue, take jstack of a RS several
> times
>     > in a
>     >     row. W/o stack traces it is hard
>     >     to tell what was going on with your cluster after upgrade.
>     >
>     >     -Vlad
>     >
>     >
>     >
>     >     On Tue, Sep 4, 2018 at 3:50 PM Srinidhi Muppalla <
> srinid...@trulia.com
>     > >
>     >     wrote:
>     >
>     >     > Hello all,
>     >     >
>     >     > We are currently running Hbase 1.3.0 on an EMR cluster running
> EMR
>     > 5.5.0.
>     >     > Recently, we attempted to upgrade our cluster to using Hbase
> 1.4.4
>     > (along
>     >     > with upgrading our EMR cluster to 5.16). After upgrading, the
> CPU
>     > usage for
>     >     > all of our region servers spiked up to 90%. The load_one for
> all of
>     > our
>     >     > servers spiked from roughly 1-2 to 10 threads. After
> upgrading, the
>     > number
>     >     > of operations to the cluster hasn’t increased. After giving the
>     > cluster a
>     >     > few hours, we had to revert the upgrade. From the logs, we are
>     > unable to
>     >     > tell what is occupying the CPU resources. Is this a known
> issue with
>     > 1.4.4?
>     >     > Any guidance or ideas for debugging the cause would be greatly
>     >     > appreciated.  What are the best steps for debugging CPU usage?
>     >     >
>     >     > Thank you,
>     >     > Srinidhi
>     >     >
>     >
>     >
>     >
>
>
>

Re: Extremely high CPU usage after upgrading to Hbase 1.4.4

Reply via email to