Migrating from Apache Cassandra to Hbase
Hi, Currently I'm using Apache Cassandra as backend for my restfull application. Having a cluster of 30 nodes (each having 12 cores, 64gb ram and 6 TB disk which 50% of the disk been used) write and read throughput is more than satisfactory for us. The input is a fixed set of long and int columns which we need to query it based on every column, so having 8 columns there should be 8 tables based on Cassandra query plan recommendation. The cassandra keyspace schema would be someting like this: Table 1 (timebucket,col1, ...,col8, primary key(timebuecket,col1)) to handle select * from input where timebucket = X and col1 = Y Table 8 (timebucket,col1, ...,col8, primary key(timebuecket,col8)) So for each input row, there would be 8X insert in Cassandra (not considering RF) and using TTL of 12 months, production cluster should keep about 2 Peta Bytes of data With recommended node density for Cassandra cluster (2 TB per node), i need a cluster with more than 1000 nodes (which i can not afford) So long story short: I'm looking for an alternative to Apache Cassandra for this application. How HBase would solve these problem: 1. 8X data redundancy due to needed queries 2. nodes with large data density (30 TB data on each node if No.1 could not be solved in HBase), how HBase would handle compaction and node join-remove problems while there is only 5 * 6 TB 7200 SATA Disk available on each node? How much Hbase needs as empty space for template files of compaction? 3. Also i read in some documents (including datastax's) that HBase is more of a offline & data-lake backend that better not to be used as web application backendd which needs less than some seconds QoS in response time. Thanks in advance Sent using Zoho Mail
Re: Extremely high CPU usage after upgrading to Hbase 1.4.4
The createFirstOnRow() is used by ColumnXXFilter's getNextCellHint() method. I am thinking about adding a variant to getNextCellHint() which returns a tuple, representing first on row, consisting of: Cell - the passed in Cell instance byte[] - qualifier array int - qualifier offset int - qualifier length This variant doesn't allocate (new) Cell / KeyValue. This way, FilterListWithOR#shouldPassCurrentCellToFilter can use the returned tuple for comparison. FYI On Fri, Sep 7, 2018 at 10:00 PM Ted Yu wrote: > Thanks for detailed background information. > > I assume your code has done de-dup for the filters contained in > FilterListWithOR. > > I took a look at JIRAs which > touched hbase-client/src/main/java/org/apache/hadoop/hbase/filter in > branch-1.4 > There were a few patches (some were very big) since the release of 1.3.0 > So it is not obvious at first glance which one(s) might be related. > > I noticed ColumnPrefixFilter.getNextCellHint (and > KeyValueUtil.createFirstOnRow) appearing many times in the stack trace. > > I plan to dig more in this area. > > Cheers > > On Fri, Sep 7, 2018 at 11:30 AM Srinidhi Muppalla > wrote: > >> Sure thing. For our table schema, each row represents one user and the >> row key is that user’s unique id in our system. We currently only use one >> column family in the table. The column qualifiers represent an item that >> has been surfaced to that user as well as additional information to >> differentiate the way the item has been surfaced to the user. Without >> getting into too many specifics, the qualifier follows the rough format of: >> >> “Channel-itemId-distinguisher”. >> >> The channel here is the channel through the item was previously surfaced >> to the user. The itemid is the unique id of the item that has been surfaced >> to the user. A distinguisher is some attribute about how that item was >> surfaced to the user. >> >> When we run a scan, we currently only ever run it on one row at a time. >> It was chosen over ‘get’ because (from our understanding) the performance >> difference is negligible, and down the road using scan would allow us some >> more flexibility. >> >> The filter list that is constructed with scan works by using a >> ColumnPrefixFilter as you mentioned. When a user is being communicated to >> on a particular channel, we have a list of items that we want to >> potentially surface for that user. So, we construct a prefix list with the >> channel and each of the item ids in the form of: “channel-itemId”. Then we >> run a scan on that row with that filter list using “WithOr” to get all of >> the matching channel-itemId combinations currently in that row/column >> family in the table. This way we can then know which of the items we want >> to surface to that user on that channel have already been surfaced on that >> channel. The reason we query using a prefix filter is so that we don’t need >> to know the ‘distinguisher’ part of the record when writing the actual >> query, because the distinguisher is only relevant in certain circumstances. >> >> Let me know if this is the information about our query pattern that you >> were looking for and if there is anything I can clarify or add. >> >> Thanks, >> Srinidhi >> >> On 9/6/18, 12:24 PM, "Ted Yu" wrote: >> >> From the stack trace, ColumnPrefixFilter is used during scan. >> >> Can you illustrate how various filters are formed thru >> FilterListWithOR ? >> It would be easier for other people to reproduce the problem given >> your >> query pattern. >> >> Cheers >> >> On Thu, Sep 6, 2018 at 11:43 AM Srinidhi Muppalla < >> srinid...@trulia.com> >> wrote: >> >> > Hi Vlad, >> > >> > Thank you for the suggestion. I recreated the issue and attached >> the stack >> > traces I took. Let me know if there’s any other info I can provide. >> We >> > narrowed the issue down to occurring when upgrading from 1.3.0 to >> any 1.4.x >> > version. >> > >> > Thanks, >> > Srinidhi >> > >> > On 9/4/18, 8:19 PM, "Vladimir Rodionov" >> wrote: >> > >> > Hi, Srinidhi >> > >> > Next time you will see this issue, take jstack of a RS several >> times >> > in a >> > row. W/o stack traces it is hard >> > to tell what was going on with your cluster after upgrade. >> > >> > -Vlad >> > >> > >> > >> > On Tue, Sep 4, 2018 at 3:50 PM Srinidhi Muppalla < >> srinid...@trulia.com >> > > >> > wrote: >> > >> > > Hello all, >> > > >> > > We are currently running Hbase 1.3.0 on an EMR cluster >> running EMR >> > 5.5.0. >> > > Recently, we attempted to upgrade our cluster to using Hbase >> 1.4.4 >> > (along >> > > with upgrading our EMR cluster to 5.16). After upgrading, the >> CPU >> > usage for >> > > all of our region servers spiked up to 90%. The load_one for >> all of >> > our >> > > servers spiked
Re: Extremely high CPU usage after upgrading to Hbase 1.4.4
Thanks for detailed background information. I assume your code has done de-dup for the filters contained in FilterListWithOR. I took a look at JIRAs which touched hbase-client/src/main/java/org/apache/hadoop/hbase/filter in branch-1.4 There were a few patches (some were very big) since the release of 1.3.0 So it is not obvious at first glance which one(s) might be related. I noticed ColumnPrefixFilter.getNextCellHint (and KeyValueUtil.createFirstOnRow) appearing many times in the stack trace. I plan to dig more in this area. Cheers On Fri, Sep 7, 2018 at 11:30 AM Srinidhi Muppalla wrote: > Sure thing. For our table schema, each row represents one user and the row > key is that user’s unique id in our system. We currently only use one > column family in the table. The column qualifiers represent an item that > has been surfaced to that user as well as additional information to > differentiate the way the item has been surfaced to the user. Without > getting into too many specifics, the qualifier follows the rough format of: > > “Channel-itemId-distinguisher”. > > The channel here is the channel through the item was previously surfaced > to the user. The itemid is the unique id of the item that has been surfaced > to the user. A distinguisher is some attribute about how that item was > surfaced to the user. > > When we run a scan, we currently only ever run it on one row at a time. It > was chosen over ‘get’ because (from our understanding) the performance > difference is negligible, and down the road using scan would allow us some > more flexibility. > > The filter list that is constructed with scan works by using a > ColumnPrefixFilter as you mentioned. When a user is being communicated to > on a particular channel, we have a list of items that we want to > potentially surface for that user. So, we construct a prefix list with the > channel and each of the item ids in the form of: “channel-itemId”. Then we > run a scan on that row with that filter list using “WithOr” to get all of > the matching channel-itemId combinations currently in that row/column > family in the table. This way we can then know which of the items we want > to surface to that user on that channel have already been surfaced on that > channel. The reason we query using a prefix filter is so that we don’t need > to know the ‘distinguisher’ part of the record when writing the actual > query, because the distinguisher is only relevant in certain circumstances. > > Let me know if this is the information about our query pattern that you > were looking for and if there is anything I can clarify or add. > > Thanks, > Srinidhi > > On 9/6/18, 12:24 PM, "Ted Yu" wrote: > > From the stack trace, ColumnPrefixFilter is used during scan. > > Can you illustrate how various filters are formed thru > FilterListWithOR ? > It would be easier for other people to reproduce the problem given your > query pattern. > > Cheers > > On Thu, Sep 6, 2018 at 11:43 AM Srinidhi Muppalla < > srinid...@trulia.com> > wrote: > > > Hi Vlad, > > > > Thank you for the suggestion. I recreated the issue and attached the > stack > > traces I took. Let me know if there’s any other info I can provide. > We > > narrowed the issue down to occurring when upgrading from 1.3.0 to > any 1.4.x > > version. > > > > Thanks, > > Srinidhi > > > > On 9/4/18, 8:19 PM, "Vladimir Rodionov" > wrote: > > > > Hi, Srinidhi > > > > Next time you will see this issue, take jstack of a RS several > times > > in a > > row. W/o stack traces it is hard > > to tell what was going on with your cluster after upgrade. > > > > -Vlad > > > > > > > > On Tue, Sep 4, 2018 at 3:50 PM Srinidhi Muppalla < > srinid...@trulia.com > > > > > wrote: > > > > > Hello all, > > > > > > We are currently running Hbase 1.3.0 on an EMR cluster running > EMR > > 5.5.0. > > > Recently, we attempted to upgrade our cluster to using Hbase > 1.4.4 > > (along > > > with upgrading our EMR cluster to 5.16). After upgrading, the > CPU > > usage for > > > all of our region servers spiked up to 90%. The load_one for > all of > > our > > > servers spiked from roughly 1-2 to 10 threads. After > upgrading, the > > number > > > of operations to the cluster hasn’t increased. After giving the > > cluster a > > > few hours, we had to revert the upgrade. From the logs, we are > > unable to > > > tell what is occupying the CPU resources. Is this a known > issue with > > 1.4.4? > > > Any guidance or ideas for debugging the cause would be greatly > > > appreciated. What are the best steps for debugging CPU usage? > > > > > > Thank you, > > > Srinidhi > > > > > > > > > > > >
Re: Extremely high CPU usage after upgrading to Hbase 1.4.4
Sure thing. For our table schema, each row represents one user and the row key is that user’s unique id in our system. We currently only use one column family in the table. The column qualifiers represent an item that has been surfaced to that user as well as additional information to differentiate the way the item has been surfaced to the user. Without getting into too many specifics, the qualifier follows the rough format of: “Channel-itemId-distinguisher”. The channel here is the channel through the item was previously surfaced to the user. The itemid is the unique id of the item that has been surfaced to the user. A distinguisher is some attribute about how that item was surfaced to the user. When we run a scan, we currently only ever run it on one row at a time. It was chosen over ‘get’ because (from our understanding) the performance difference is negligible, and down the road using scan would allow us some more flexibility. The filter list that is constructed with scan works by using a ColumnPrefixFilter as you mentioned. When a user is being communicated to on a particular channel, we have a list of items that we want to potentially surface for that user. So, we construct a prefix list with the channel and each of the item ids in the form of: “channel-itemId”. Then we run a scan on that row with that filter list using “WithOr” to get all of the matching channel-itemId combinations currently in that row/column family in the table. This way we can then know which of the items we want to surface to that user on that channel have already been surfaced on that channel. The reason we query using a prefix filter is so that we don’t need to know the ‘distinguisher’ part of the record when writing the actual query, because the distinguisher is only relevant in certain circumstances. Let me know if this is the information about our query pattern that you were looking for and if there is anything I can clarify or add. Thanks, Srinidhi On 9/6/18, 12:24 PM, "Ted Yu" wrote: From the stack trace, ColumnPrefixFilter is used during scan. Can you illustrate how various filters are formed thru FilterListWithOR ? It would be easier for other people to reproduce the problem given your query pattern. Cheers On Thu, Sep 6, 2018 at 11:43 AM Srinidhi Muppalla wrote: > Hi Vlad, > > Thank you for the suggestion. I recreated the issue and attached the stack > traces I took. Let me know if there’s any other info I can provide. We > narrowed the issue down to occurring when upgrading from 1.3.0 to any 1.4.x > version. > > Thanks, > Srinidhi > > On 9/4/18, 8:19 PM, "Vladimir Rodionov" wrote: > > Hi, Srinidhi > > Next time you will see this issue, take jstack of a RS several times > in a > row. W/o stack traces it is hard > to tell what was going on with your cluster after upgrade. > > -Vlad > > > > On Tue, Sep 4, 2018 at 3:50 PM Srinidhi Muppalla > > wrote: > > > Hello all, > > > > We are currently running Hbase 1.3.0 on an EMR cluster running EMR > 5.5.0. > > Recently, we attempted to upgrade our cluster to using Hbase 1.4.4 > (along > > with upgrading our EMR cluster to 5.16). After upgrading, the CPU > usage for > > all of our region servers spiked up to 90%. The load_one for all of > our > > servers spiked from roughly 1-2 to 10 threads. After upgrading, the > number > > of operations to the cluster hasn’t increased. After giving the > cluster a > > few hours, we had to revert the upgrade. From the logs, we are > unable to > > tell what is occupying the CPU resources. Is this a known issue with > 1.4.4? > > Any guidance or ideas for debugging the cause would be greatly > > appreciated. What are the best steps for debugging CPU usage? > > > > Thank you, > > Srinidhi > > > > >