Thanks for the reply. "write performance would be lower" -> this means better?

Also I think I used the wrong terminology regarding batching. I meant to ask if it uses the client side write buffer. I would think not since the append() method returns a Result. I could batch them up application side I suppose. Append also seems to return the updated value. This seems like a lot of unnecessary I/O in my case since I am not immediately interested in the updated value. I guess there is no way to turn that off?

On 4/15/13 1:28 PM, Ted Yu wrote:
I assume you would select HBase 0.94.6.1 (the latest release) for this
project.

For #1, write performance would be lower if you choose to use Append (vs.
using Put).

bq. Can appends be batched by the client or do they execute immediately?
This depends on your use case. Take a look at the following method in
HTable where you can send a list of actions (Appends):

   public void batch(final List<?extends Row> actions, final Object[]
results)
For #2
bq. The other would be to prefix the timestamp row key with a random
leading byte.

This technique has been used elsewhere and is better than the first one.

Cheers

On Mon, Apr 15, 2013 at 6:09 AM, Kireet Reddy 
<kireet-teh5dpvpl8nqt0dzr+a...@public.gmane.org> wrote:

I are planning to create a "scheduled task list" table in our hbase
cluster. Essentially we will define a table with key timestamp and then the
row contents will be all the tasks that need to be processed within that
second (or whatever time period). I am trying to do the "reasonably wide
rows" design mentioned in the hbasecon opentsdb talk. A couple of questions:

1. Should we use append or put to create tasks? Since these rows will not
live forever, storage space in not a concern, read/write performance is
more important. As concurrency increases I would guess the row lock may
become an issue in append? Can appends be batched by the client or do they
execute immediately?

2. I am a little worried about hotspots. This basic design may cause
issues in terms of the table's performance. Many tasks will execute and
reschedule themselves using the same interval, t + 1 hour for example. So
many the writes may all go to the same block.  Also, we have a lot of other
data so I am worried it may impact performance of unrelated data if the
region server gets too busy servicing the task list table. I can think of 2
strategies to avoid this. One would be to create N different tables and
read/write tasks to them randomly. This may spread load across servers, but
there is no guarantee hbase will place the tables on different region
servers, correct? The other would be to prefix the timestamp row key with a
random leading byte. Then when reading from the task list table, consumers
could scan from any/all possible values of the random byte + current
timestamp to obtain tasks. Both strategies seem like they could spread out
load, but at the cost of more work/complexity to read tasks from the table.
Do either of those approaches make sense?

On the read side, it seems like a similar problem exists in that all
consumers will be reading rows based on the current timestamp. Is this good
because the block will very likely be cached or bad because the region
server may become overloaded? I have a feeling the answer is going to be
"it depends". :)

I did see the previous posts on queues and the tips there - use zookeeper
for coordination, schedule major compactions, etc. Sorry if these questions
are basic, I am pretty new to hbase. Thanks!



Reply via email to