Hope I'm not too late here... regarding hot spotting with sequential keys,
I'd suggest you read this Sematext blog -
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
They present a nice idea there for this kind of issues.

Good Luck!



On Mon, Apr 15, 2013 at 11:18 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. write performance would be lower
>
> The above means poorer performance.
>
> bq. I could batch them up application side
>
> Please do that.
>
> bq. I guess there is no way to turn that off?
>
> That's right.
>
> On Mon, Apr 15, 2013 at 11:15 AM, Kireet <kir...@feedly.com> wrote:
>
> >
> >
> >
> > Thanks for the reply. "write performance would be lower" -> this means
> > better?
> >
> > Also I think I used the wrong terminology regarding batching. I meant to
> > ask if it uses the client side write buffer. I would think not since the
> > append() method returns a Result. I could batch them up application side
> I
> > suppose. Append also seems to return the updated value. This seems like a
> > lot of unnecessary I/O in my case since I am not immediately interested
> in
> > the updated value. I guess there is no way to turn that off?
> >
> >
> > On 4/15/13 1:28 PM, Ted Yu wrote:
> >
> >> I assume you would select HBase 0.94.6.1 (the latest release) for this
> >> project.
> >>
> >> For #1, write performance would be lower if you choose to use Append
> (vs.
> >> using Put).
> >>
> >> bq. Can appends be batched by the client or do they execute immediately?
> >> This depends on your use case. Take a look at the following method in
> >> HTable where you can send a list of actions (Appends):
> >>
> >>    public void batch(final List<?extends Row> actions, final Object[]
> >> results)
> >> For #2
> >> bq. The other would be to prefix the timestamp row key with a random
> >> leading byte.
> >>
> >> This technique has been used elsewhere and is better than the first one.
> >>
> >> Cheers
> >>
> >> On Mon, Apr 15, 2013 at 6:09 AM, Kireet Reddy
> <kireet-Teh5dPVPL8nQT0dZR+*
> >> *a...@public.gmane.org <
> kireet-teh5dpvpl8nqt0dzr%2ba...@public.gmane.org>>
> >> wrote:
> >>
> >>  I are planning to create a "scheduled task list" table in our hbase
> >>> cluster. Essentially we will define a table with key timestamp and then
> >>> the
> >>> row contents will be all the tasks that need to be processed within
> that
> >>> second (or whatever time period). I am trying to do the "reasonably
> wide
> >>> rows" design mentioned in the hbasecon opentsdb talk. A couple of
> >>> questions:
> >>>
> >>> 1. Should we use append or put to create tasks? Since these rows will
> not
> >>> live forever, storage space in not a concern, read/write performance is
> >>> more important. As concurrency increases I would guess the row lock may
> >>> become an issue in append? Can appends be batched by the client or do
> >>> they
> >>> execute immediately?
> >>>
> >>> 2. I am a little worried about hotspots. This basic design may cause
> >>> issues in terms of the table's performance. Many tasks will execute and
> >>> reschedule themselves using the same interval, t + 1 hour for example.
> So
> >>> many the writes may all go to the same block.  Also, we have a lot of
> >>> other
> >>> data so I am worried it may impact performance of unrelated data if the
> >>> region server gets too busy servicing the task list table. I can think
> >>> of 2
> >>> strategies to avoid this. One would be to create N different tables and
> >>> read/write tasks to them randomly. This may spread load across servers,
> >>> but
> >>> there is no guarantee hbase will place the tables on different region
> >>> servers, correct? The other would be to prefix the timestamp row key
> >>> with a
> >>> random leading byte. Then when reading from the task list table,
> >>> consumers
> >>> could scan from any/all possible values of the random byte + current
> >>> timestamp to obtain tasks. Both strategies seem like they could spread
> >>> out
> >>> load, but at the cost of more work/complexity to read tasks from the
> >>> table.
> >>> Do either of those approaches make sense?
> >>>
> >>> On the read side, it seems like a similar problem exists in that all
> >>> consumers will be reading rows based on the current timestamp. Is this
> >>> good
> >>> because the block will very likely be cached or bad because the region
> >>> server may become overloaded? I have a feeling the answer is going to
> be
> >>> "it depends". :)
> >>>
> >>> I did see the previous posts on queues and the tips there - use
> zookeeper
> >>> for coordination, schedule major compactions, etc. Sorry if these
> >>> questions
> >>> are basic, I am pretty new to hbase. Thanks!
> >>>
> >>
> >>
> >
> >
>

Reply via email to