I have also considered this method. But what about other columns
without default value(status's default value is 0, so I can think
absence as 0)
e.g. depth, insertTime, ...
anyway, if using put instead of checkAndPut will make it much faster,
I will consider this method.


On Tue, Apr 29, 2014 at 9:44 AM, Jean-Marc Spaggiari
<[email protected]> wrote:
> Simply don't set your status to 0 when you write it first.
>
> Absence mean not read.
> 1 mean read.
> So there is no risk that someone try to set 0 and someone else try to set 1.
>
> Will that be an option?
>
>
> 2014-04-28 21:23 GMT-04:00 Li Li <[email protected]>:
>
>> I am using hbase to store information for a web spider.
>> I have a table to save information of a webpage, the rowkey is url,
>> and there are other columns such as status(int) and depth(int)
>> in the beginning, the status is 0.  A worker thread will select urls
>> whose status is 0 and do something with it and modify it to 1,...
>> there are more than 1 urls link to a given url.
>> e.g.  url1->url url2->url
>> there are two times insertion of url. If I do not use checkAndPut,
>> when thread 1 insert url and the worker thread do something with url
>> and modify its status to 1. Then thread 2 again insert url and reset
>> the status to 0, then the worker thread will do somthing again. That's
>> not I want.
>>
>> On Tue, Apr 29, 2014 at 8:56 AM, Jean-Marc Spaggiari
>> <[email protected]> wrote:
>> > Why do you want to make sure the row is only inserted once? If you insert
>> > the same raw twice the 2nd one will simple overwrite the first one and
>> > HBase will take care of the versions.
>> >
>> > regarding the codes fragments, I don't think the autoflush is going to
>> do a
>> > big difference compared to the cost of the check & put...
>> >
>> >
>> > 2014-04-28 20:50 GMT-04:00 Li Li <[email protected]>:
>> >
>> >> I must use checkAndPut to ensure a row is only inserted once.
>> >> if I have 1000 checkAndPut,will setAutoFlush(false) useful?
>> >> is there any performance difference of the following two code fragments?
>> >> 1.
>> >>     table.setAutoFlush(false);
>> >>     for(int i=0;i<1000;i++){
>> >>          Put put=...
>> >>          table.checkAndPut(,....put);
>> >>     }
>> >> 2.
>> >>     table.setAutoFlush(true);
>> >>     for(int i=0;i<1000;i++){
>> >>          Put put=...
>> >>          table.checkAndPut(,....put);
>> >>     }
>> >>
>> >> On Tue, Apr 29, 2014 at 8:36 AM, Jean-Marc Spaggiari
>> >> <[email protected]> wrote:
>> >> > It depends. Batch a list of puts/gets wll be way faster than
>> checkAndPut,
>> >> > but the result will not be the same... a batch of puts will not do any
>> >> > check...
>> >> >
>> >> >
>> >> > 2014-04-28 20:17 GMT-04:00 Li Li <[email protected]>:
>> >> >
>> >> >> but I have many checkAndPut operations.
>> >> >> will use batch a better solution?
>> >> >>
>> >> >> On Mon, Apr 28, 2014 at 8:01 PM, Jean-Marc Spaggiari
>> >> >> <[email protected]> wrote:
>> >> >> > Hi Li Li,
>> >> >> >
>> >> >> > Yes, threads will impact the performances. If you send all you
>> writes
>> >> >> with
>> >> >> > a single thread, a single HBase handler will take care of them,
>> etc.
>> >> >> HBase
>> >> >> > does not provide a single handler for a single client connexion.
>> It's
>> >> >> able
>> >> >> > to handle multiple threads and clients.
>> >> >> >
>> >> >> > However, it also all depends on the way you send your writes. If
>> you
>> >> >> send a
>> >> >> > single puts(<10000>) per seconds, if will not be better to send 10
>> 000
>> >> >> > threads with a single put.
>> >> >> >
>> >> >> > I will recommend you to run some perf tests on your installation to
>> >> find
>> >> >> a
>> >> >> > good number for your configuration.
>> >> >> >
>> >> >> > JM
>> >> >> >
>> >> >> >
>> >> >> > 2014-04-28 6:27 GMT-04:00 Li Li <[email protected]>:
>> >> >> >
>> >> >> >> hi all,
>> >> >> >>    with the same read/write data, will threads count affect
>> >> performance?
>> >> >> >>    e.g. I have 10,000 write request/second. I don't care the order
>> >> very
>> >> >> >> much.
>> >> >> >>    how many writer threads should I use to obtain maximum
>> throughput?
>> >> >> >>
>> >> >>
>> >>
>>

Reply via email to