Re: Insert streamed data into hbase

Andrew Purtell Thu, 07 Jan 2010 15:33:03 -0800

No, you misunderstand. **Sequential** inserting into HBase is not very
efficient for large data. If basically you are inserting a high volume of
data with row keys that are all adjacent to each other, this will focus
all of the load on one region server only. If the keys for the data being
inserted are well distributed over the key space, then the load will be
well distributed over the region servers as well.


If your data is keyed by timestamp and/or you are doing bulk uploading,
then the solution you refer to is appropriate. 

Since you say you are not keying by timestamp, then perhaps your keying
strategy will be fine. 

For example, for a Web crawling application of mine, for the retrieved
content I use a row key that is the SHA-1 hash of the content. Due to the
properties of the hash function, all inserts are well distributed in the
key space. 

Another example: If you are importing data via a MapReduce task, you can
build a trivial combiner that randomly distributes keys to a set of
reducers which will thus store values into HBase in parallel in random
order.

   - Andy



----- Original Message ----
> From: kishore g <[email protected]>
> To: [email protected]
> Sent: Thu, January 7, 2010 3:00:28 PM
> Subject: Insert streamed data into hbase
> 
> Hi,
> 
> I see that the inserting into  hbase is not very efficient for large data.
> For event logging i see the solution explained in
> http://www.mail-archive.com/[email protected]/msg06010.html
> 
> If  my understanding is correct this is applicable if key is timestamp. Is
> there a solution to achieve the following
> 
> --> we get stream of events and we want to insert into a table but our key
> will be something different from timestamp.
> 
> Is there any way to achieve this efficiently apart from inserting every
> event using hbase api's.
> 
> thanks
> Kg

Re: Insert streamed data into hbase

Reply via email to