No, you misunderstand. **Sequential** inserting into HBase is not very efficient for large data. If basically you are inserting a high volume of data with row keys that are all adjacent to each other, this will focus all of the load on one region server only. If the keys for the data being inserted are well distributed over the key space, then the load will be well distributed over the region servers as well.
If your data is keyed by timestamp and/or you are doing bulk uploading, then the solution you refer to is appropriate. Since you say you are not keying by timestamp, then perhaps your keying strategy will be fine. For example, for a Web crawling application of mine, for the retrieved content I use a row key that is the SHA-1 hash of the content. Due to the properties of the hash function, all inserts are well distributed in the key space. Another example: If you are importing data via a MapReduce task, you can build a trivial combiner that randomly distributes keys to a set of reducers which will thus store values into HBase in parallel in random order. - Andy ----- Original Message ---- > From: kishore g <[email protected]> > To: [email protected] > Sent: Thu, January 7, 2010 3:00:28 PM > Subject: Insert streamed data into hbase > > Hi, > > I see that the inserting into hbase is not very efficient for large data. > For event logging i see the solution explained in > http://www.mail-archive.com/[email protected]/msg06010.html > > If my understanding is correct this is applicable if key is timestamp. Is > there a solution to achieve the following > > --> we get stream of events and we want to insert into a table but our key > will be something different from timestamp. > > Is there any way to achieve this efficiently apart from inserting every > event using hbase api's. > > thanks > Kg
