I am using hbase to store information for a web spider. I have a table to save information of a webpage, the rowkey is url, and there are other columns such as status(int) and depth(int) in the beginning, the status is 0. A worker thread will select urls whose status is 0 and do something with it and modify it to 1,... there are more than 1 urls link to a given url. e.g. url1->url url2->url there are two times insertion of url. If I do not use checkAndPut, when thread 1 insert url and the worker thread do something with url and modify its status to 1. Then thread 2 again insert url and reset the status to 0, then the worker thread will do somthing again. That's not I want.
On Tue, Apr 29, 2014 at 8:56 AM, Jean-Marc Spaggiari <[email protected]> wrote: > Why do you want to make sure the row is only inserted once? If you insert > the same raw twice the 2nd one will simple overwrite the first one and > HBase will take care of the versions. > > regarding the codes fragments, I don't think the autoflush is going to do a > big difference compared to the cost of the check & put... > > > 2014-04-28 20:50 GMT-04:00 Li Li <[email protected]>: > >> I must use checkAndPut to ensure a row is only inserted once. >> if I have 1000 checkAndPut,will setAutoFlush(false) useful? >> is there any performance difference of the following two code fragments? >> 1. >> table.setAutoFlush(false); >> for(int i=0;i<1000;i++){ >> Put put=... >> table.checkAndPut(,....put); >> } >> 2. >> table.setAutoFlush(true); >> for(int i=0;i<1000;i++){ >> Put put=... >> table.checkAndPut(,....put); >> } >> >> On Tue, Apr 29, 2014 at 8:36 AM, Jean-Marc Spaggiari >> <[email protected]> wrote: >> > It depends. Batch a list of puts/gets wll be way faster than checkAndPut, >> > but the result will not be the same... a batch of puts will not do any >> > check... >> > >> > >> > 2014-04-28 20:17 GMT-04:00 Li Li <[email protected]>: >> > >> >> but I have many checkAndPut operations. >> >> will use batch a better solution? >> >> >> >> On Mon, Apr 28, 2014 at 8:01 PM, Jean-Marc Spaggiari >> >> <[email protected]> wrote: >> >> > Hi Li Li, >> >> > >> >> > Yes, threads will impact the performances. If you send all you writes >> >> with >> >> > a single thread, a single HBase handler will take care of them, etc. >> >> HBase >> >> > does not provide a single handler for a single client connexion. It's >> >> able >> >> > to handle multiple threads and clients. >> >> > >> >> > However, it also all depends on the way you send your writes. If you >> >> send a >> >> > single puts(<10000>) per seconds, if will not be better to send 10 000 >> >> > threads with a single put. >> >> > >> >> > I will recommend you to run some perf tests on your installation to >> find >> >> a >> >> > good number for your configuration. >> >> > >> >> > JM >> >> > >> >> > >> >> > 2014-04-28 6:27 GMT-04:00 Li Li <[email protected]>: >> >> > >> >> >> hi all, >> >> >> with the same read/write data, will threads count affect >> performance? >> >> >> e.g. I have 10,000 write request/second. I don't care the order >> very >> >> >> much. >> >> >> how many writer threads should I use to obtain maximum throughput? >> >> >> >> >> >>
