Thanks for detailed background information. I assume your code has done de-dup for the filters contained in FilterListWithOR.
I took a look at JIRAs which touched hbase-client/src/main/java/org/apache/hadoop/hbase/filter in branch-1.4 There were a few patches (some were very big) since the release of 1.3.0 So it is not obvious at first glance which one(s) might be related. I noticed ColumnPrefixFilter.getNextCellHint (and KeyValueUtil.createFirstOnRow) appearing many times in the stack trace. I plan to dig more in this area. Cheers On Fri, Sep 7, 2018 at 11:30 AM Srinidhi Muppalla <srinid...@trulia.com> wrote: > Sure thing. For our table schema, each row represents one user and the row > key is that user’s unique id in our system. We currently only use one > column family in the table. The column qualifiers represent an item that > has been surfaced to that user as well as additional information to > differentiate the way the item has been surfaced to the user. Without > getting into too many specifics, the qualifier follows the rough format of: > > “Channel-itemId-distinguisher”. > > The channel here is the channel through the item was previously surfaced > to the user. The itemid is the unique id of the item that has been surfaced > to the user. A distinguisher is some attribute about how that item was > surfaced to the user. > > When we run a scan, we currently only ever run it on one row at a time. It > was chosen over ‘get’ because (from our understanding) the performance > difference is negligible, and down the road using scan would allow us some > more flexibility. > > The filter list that is constructed with scan works by using a > ColumnPrefixFilter as you mentioned. When a user is being communicated to > on a particular channel, we have a list of items that we want to > potentially surface for that user. So, we construct a prefix list with the > channel and each of the item ids in the form of: “channel-itemId”. Then we > run a scan on that row with that filter list using “WithOr” to get all of > the matching channel-itemId combinations currently in that row/column > family in the table. This way we can then know which of the items we want > to surface to that user on that channel have already been surfaced on that > channel. The reason we query using a prefix filter is so that we don’t need > to know the ‘distinguisher’ part of the record when writing the actual > query, because the distinguisher is only relevant in certain circumstances. > > Let me know if this is the information about our query pattern that you > were looking for and if there is anything I can clarify or add. > > Thanks, > Srinidhi > > On 9/6/18, 12:24 PM, "Ted Yu" <yuzhih...@gmail.com> wrote: > > From the stack trace, ColumnPrefixFilter is used during scan. > > Can you illustrate how various filters are formed thru > FilterListWithOR ? > It would be easier for other people to reproduce the problem given your > query pattern. > > Cheers > > On Thu, Sep 6, 2018 at 11:43 AM Srinidhi Muppalla < > srinid...@trulia.com> > wrote: > > > Hi Vlad, > > > > Thank you for the suggestion. I recreated the issue and attached the > stack > > traces I took. Let me know if there’s any other info I can provide. > We > > narrowed the issue down to occurring when upgrading from 1.3.0 to > any 1.4.x > > version. > > > > Thanks, > > Srinidhi > > > > On 9/4/18, 8:19 PM, "Vladimir Rodionov" <vladrodio...@gmail.com> > wrote: > > > > Hi, Srinidhi > > > > Next time you will see this issue, take jstack of a RS several > times > > in a > > row. W/o stack traces it is hard > > to tell what was going on with your cluster after upgrade. > > > > -Vlad > > > > > > > > On Tue, Sep 4, 2018 at 3:50 PM Srinidhi Muppalla < > srinid...@trulia.com > > > > > wrote: > > > > > Hello all, > > > > > > We are currently running Hbase 1.3.0 on an EMR cluster running > EMR > > 5.5.0. > > > Recently, we attempted to upgrade our cluster to using Hbase > 1.4.4 > > (along > > > with upgrading our EMR cluster to 5.16). After upgrading, the > CPU > > usage for > > > all of our region servers spiked up to 90%. The load_one for > all of > > our > > > servers spiked from roughly 1-2 to 10 threads. After > upgrading, the > > number > > > of operations to the cluster hasn’t increased. After giving the > > cluster a > > > few hours, we had to revert the upgrade. From the logs, we are > > unable to > > > tell what is occupying the CPU resources. Is this a known > issue with > > 1.4.4? > > > Any guidance or ideas for debugging the cause would be greatly > > > appreciated. What are the best steps for debugging CPU usage? > > > > > > Thank you, > > > Srinidhi > > > > > > > > > > > >