sorry for all these unclear queries.

i turned of WAL on both the doc and index table.

in my system all documents have a UUID (assigned before it comes into the 
system) i just use this UUID as the rowkey. so duplicates basically means 
documents with the same id, even if the contents are the same.
for a poem like Mary had a little lamb, the whole poem would probably be 
counted as a single document. if such a   document comes, the word counts of 
the words in the poem would increment by their count in the poem.
if multiple docs have the same content but different id, i just treat them as 
different docs and do the increments.


Sincerely,
Prakash Kadel

On Feb 20, 2013, at 11:14 PM, Michel Segel <michael_se...@hotmail.com> wrote:

> 
> What happens when you have a poem like Mary had a little lamb?
> 
> Did you turn off the WAL on both table inserts, or just the index?
> 
> If you want to avoid processing duplicate docs... You could do this a couple 
> of ways. The simplest way is to record the doc ID and a check sum for the 
> doc. If the doc you are processing matches... You can simply do NOOP for the 
> lines in the doc. (This isn't the fastest, but its easy.)
> The other is to run a preprocess which removes duplicate doc from your 
> directory and you then process the docs...
> 
> Third thing... Do a code review. Sloppy code will kill performance...
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Feb 20, 2013, at 5:26 AM, Prakash Kadel <prakash.ka...@gmail.com> wrote:
> 
>> michael, 
>>  infact i dont care about latency bw doc write and index write.
>> today i did some tests.
>> turns out turning off WAL does speed up the writes by about a factor of 2.
>> interestingly, enabling bloom filter did little to improve the checkandput.
>> 
>> earlier you mentioned
>>>>>> The OP doesn't really get in to the use case, so we don't know why the
>>>>> Check and Put in the M/R job.
>>>>>> He should just be using put() and then a postPut().
>> 
>> 
>> the main reason i use checkandput is to make sure the word count index 
>> doesnt get duplicate increments when duplicate documents come in. 
>> additionally i also need to dump dup free docs to hdfs for legacy system 
>> that we have in place.
>> is there some way to avoid chechandput?
>> 
>> 
>> Sincerely,
>> Prakash 
>> 
>> On Feb 20, 2013, at 10:00 PM, Michel Segel <michael_se...@hotmail.com> wrote:
>> 
>>> I was suggesting removing the write to WAL on your write to the index table 
>>> only.
>>> 
>>> The thing you have to realize that true low latency systems use databases 
>>> as a sink. It's the end of the line so to speak.
>>> 
>>> So if you're worried about a small latency between the writing to your doc 
>>> table, and then the write of your index.. You are designing the wrong 
>>> system.
>>> 
>>> Consider that it takes some time t to write the base record and then to 
>>> write the indexes.
>>> For that period, you have a Schrödinger's cat problem as to if the row 
>>> exists or not. Since HBase lacks transactions and ACID, trying to write a 
>>> solution where you require the low latency... You are using the wrong tool.
>> 

Reply via email to