sorry for all these unclear queries. i turned of WAL on both the doc and index table.
in my system all documents have a UUID (assigned before it comes into the system) i just use this UUID as the rowkey. so duplicates basically means documents with the same id, even if the contents are the same. for a poem like Mary had a little lamb, the whole poem would probably be counted as a single document. if such a document comes, the word counts of the words in the poem would increment by their count in the poem. if multiple docs have the same content but different id, i just treat them as different docs and do the increments. Sincerely, Prakash Kadel On Feb 20, 2013, at 11:14 PM, Michel Segel <michael_se...@hotmail.com> wrote: > > What happens when you have a poem like Mary had a little lamb? > > Did you turn off the WAL on both table inserts, or just the index? > > If you want to avoid processing duplicate docs... You could do this a couple > of ways. The simplest way is to record the doc ID and a check sum for the > doc. If the doc you are processing matches... You can simply do NOOP for the > lines in the doc. (This isn't the fastest, but its easy.) > The other is to run a preprocess which removes duplicate doc from your > directory and you then process the docs... > > Third thing... Do a code review. Sloppy code will kill performance... > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Feb 20, 2013, at 5:26 AM, Prakash Kadel <prakash.ka...@gmail.com> wrote: > >> michael, >> infact i dont care about latency bw doc write and index write. >> today i did some tests. >> turns out turning off WAL does speed up the writes by about a factor of 2. >> interestingly, enabling bloom filter did little to improve the checkandput. >> >> earlier you mentioned >>>>>> The OP doesn't really get in to the use case, so we don't know why the >>>>> Check and Put in the M/R job. >>>>>> He should just be using put() and then a postPut(). >> >> >> the main reason i use checkandput is to make sure the word count index >> doesnt get duplicate increments when duplicate documents come in. >> additionally i also need to dump dup free docs to hdfs for legacy system >> that we have in place. >> is there some way to avoid chechandput? >> >> >> Sincerely, >> Prakash >> >> On Feb 20, 2013, at 10:00 PM, Michel Segel <michael_se...@hotmail.com> wrote: >> >>> I was suggesting removing the write to WAL on your write to the index table >>> only. >>> >>> The thing you have to realize that true low latency systems use databases >>> as a sink. It's the end of the line so to speak. >>> >>> So if you're worried about a small latency between the writing to your doc >>> table, and then the write of your index.. You are designing the wrong >>> system. >>> >>> Consider that it takes some time t to write the base record and then to >>> write the indexes. >>> For that period, you have a Schrödinger's cat problem as to if the row >>> exists or not. Since HBase lacks transactions and ACID, trying to write a >>> solution where you require the low latency... You are using the wrong tool. >>