In the previous stack trace you sent, shortCompactions and longCompactions
threads were not active.

Was the stack trace captured during period when the number of client
operations was low ?

If not, can you capture stack trace during off peak hours ?

Cheers

On Mon, Sep 10, 2018 at 12:08 PM Srinidhi Muppalla <srinid...@trulia.com>
wrote:

> Hi Ted,
>
> The highest number of filters used is 10, but the average is generally
> close to 1. Is it possible the CPU usage spike has to do with Hbase
> internal maintenance operations? It looks like post-upgrade the spike isn’t
> correlated with the frequency of reads/writes we are making, because the
> high CPU usage persisted when the number of operations went down.
>
> Thank you,
> Srinidhi
>
> On 9/8/18, 9:44 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>
>     Srinidhi :
>     Do you know the average / highest number of ColumnPrefixFilter's in the
>     FilterList ?
>
>     Thanks
>
>     On Fri, Sep 7, 2018 at 10:00 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
>     > Thanks for detailed background information.
>     >
>     > I assume your code has done de-dup for the filters contained in
>     > FilterListWithOR.
>     >
>     > I took a look at JIRAs which
>     > touched hbase-client/src/main/java/org/apache/hadoop/hbase/filter in
>     > branch-1.4
>     > There were a few patches (some were very big) since the release of
> 1.3.0
>     > So it is not obvious at first glance which one(s) might be related.
>     >
>     > I noticed ColumnPrefixFilter.getNextCellHint (and
>     > KeyValueUtil.createFirstOnRow) appearing many times in the stack
> trace.
>     >
>     > I plan to dig more in this area.
>     >
>     > Cheers
>     >
>     > On Fri, Sep 7, 2018 at 11:30 AM Srinidhi Muppalla <
> srinid...@trulia.com>
>     > wrote:
>     >
>     >> Sure thing. For our table schema, each row represents one user and
> the
>     >> row key is that user’s unique id in our system. We currently only
> use one
>     >> column family in the table. The column qualifiers represent an item
> that
>     >> has been surfaced to that user as well as additional information to
>     >> differentiate the way the item has been surfaced to the user.
> Without
>     >> getting into too many specifics, the qualifier follows the rough
> format of:
>     >>
>     >> “Channel-itemId-distinguisher”.
>     >>
>     >> The channel here is the channel through the item was previously
> surfaced
>     >> to the user. The itemid is the unique id of the item that has been
> surfaced
>     >> to the user. A distinguisher is some attribute about how that item
> was
>     >> surfaced to the user.
>     >>
>     >> When we run a scan, we currently only ever run it on one row at a
> time.
>     >> It was chosen over ‘get’ because (from our understanding) the
> performance
>     >> difference is negligible, and down the road using scan would allow
> us some
>     >> more flexibility.
>     >>
>     >> The filter list that is constructed with scan works by using a
>     >> ColumnPrefixFilter as you mentioned. When a user is being
> communicated to
>     >> on a particular channel, we have a list of items that we want to
>     >> potentially surface for that user. So, we construct a prefix list
> with the
>     >> channel and each of the item ids in the form of: “channel-itemId”.
> Then we
>     >> run a scan on that row with that filter list using “WithOr” to get
> all of
>     >> the matching channel-itemId combinations currently in that
> row/column
>     >> family in the table. This way we can then know which of the items
> we want
>     >> to surface to that user on that channel have already been surfaced
> on that
>     >> channel. The reason we query using a prefix filter is so that we
> don’t need
>     >> to know the ‘distinguisher’ part of the record when writing the
> actual
>     >> query, because the distinguisher is only relevant in certain
> circumstances.
>     >>
>     >> Let me know if this is the information about our query pattern that
> you
>     >> were looking for and if there is anything I can clarify or add.
>     >>
>     >> Thanks,
>     >> Srinidhi
>     >>
>     >> On 9/6/18, 12:24 PM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>     >>
>     >>     From the stack trace, ColumnPrefixFilter is used during scan.
>     >>
>     >>     Can you illustrate how various filters are formed thru
>     >> FilterListWithOR ?
>     >>     It would be easier for other people to reproduce the problem
> given
>     >> your
>     >>     query pattern.
>     >>
>     >>     Cheers
>     >>
>     >>     On Thu, Sep 6, 2018 at 11:43 AM Srinidhi Muppalla <
>     >> srinid...@trulia.com>
>     >>     wrote:
>     >>
>     >>     > Hi Vlad,
>     >>     >
>     >>     > Thank you for the suggestion. I recreated the issue and
> attached
>     >> the stack
>     >>     > traces I took. Let me know if there’s any other info I can
> provide.
>     >> We
>     >>     > narrowed the issue down to occurring when upgrading from
> 1.3.0 to
>     >> any 1.4.x
>     >>     > version.
>     >>     >
>     >>     > Thanks,
>     >>     > Srinidhi
>     >>     >
>     >>     > On 9/4/18, 8:19 PM, "Vladimir Rodionov" <
> vladrodio...@gmail.com>
>     >> wrote:
>     >>     >
>     >>     >     Hi, Srinidhi
>     >>     >
>     >>     >     Next time you will see this issue, take jstack of a RS
> several
>     >> times
>     >>     > in a
>     >>     >     row. W/o stack traces it is hard
>     >>     >     to tell what was going on with your cluster after upgrade.
>     >>     >
>     >>     >     -Vlad
>     >>     >
>     >>     >
>     >>     >
>     >>     >     On Tue, Sep 4, 2018 at 3:50 PM Srinidhi Muppalla <
>     >> srinid...@trulia.com
>     >>     > >
>     >>     >     wrote:
>     >>     >
>     >>     >     > Hello all,
>     >>     >     >
>     >>     >     > We are currently running Hbase 1.3.0 on an EMR cluster
>     >> running EMR
>     >>     > 5.5.0.
>     >>     >     > Recently, we attempted to upgrade our cluster to using
> Hbase
>     >> 1.4.4
>     >>     > (along
>     >>     >     > with upgrading our EMR cluster to 5.16). After
> upgrading, the
>     >> CPU
>     >>     > usage for
>     >>     >     > all of our region servers spiked up to 90%. The
> load_one for
>     >> all of
>     >>     > our
>     >>     >     > servers spiked from roughly 1-2 to 10 threads. After
>     >> upgrading, the
>     >>     > number
>     >>     >     > of operations to the cluster hasn’t increased. After
> giving
>     >> the
>     >>     > cluster a
>     >>     >     > few hours, we had to revert the upgrade. From the logs,
> we are
>     >>     > unable to
>     >>     >     > tell what is occupying the CPU resources. Is this a
> known
>     >> issue with
>     >>     > 1.4.4?
>     >>     >     > Any guidance or ideas for debugging the cause would be
> greatly
>     >>     >     > appreciated.  What are the best steps for debugging CPU
> usage?
>     >>     >     >
>     >>     >     > Thank you,
>     >>     >     > Srinidhi
>     >>     >     >
>     >>     >
>     >>     >
>     >>     >
>     >>
>     >>
>     >>
>
>
>

Reply via email to