Re: coprocessor enabled put very slow, help please~~~

prakash kadel Mon, 18 Feb 2013 23:41:36 -0800

thanks,
   i am going to do some test and let you know



On Mon, Feb 18, 2013 at 11:13 PM, Michel Segel
<michael_se...@hotmail.com> wrote:
> Why are you using an HTable Pool?
> Why are you closing the table after each iteration through?
>
> Try using 1 HTable object. Turn off WAL
> Initiate in start()
> Close in Stop()
> Surround the use in a try / catch
> If exception caught, re instantiate new HTable connection.
>
> Maybe want to flush the connection after puts.
>
>
> Again not sure why you are using check and put on the base table. Your count 
> could be off.
>
> As an example look at poem/rhyme 'Marry had a little lamb'.
> Then check your word count.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Feb 18, 2013, at 7:21 AM, prakash kadel <prakash.ka...@gmail.com> wrote:
>
>> Thank you guys for your replies,
>> Michael,
>>   I think i didnt make it clear. Here is my use case,
>>
>> I have text documents to insert in the hbase. (With possible duplicates)
>> Suppose i have a document as : " I am working. He is not working"
>>
>> I want to insert this document to a table in hbase, say table "doc"
>>
>> =doc table=
>> -----
>> rowKey : doc_id
>> cf: doc_content
>> value: "I am working. He is not working"
>>
>> Now, i to create another table that stores the word count, say "doc_idx"
>>
>> doc_idx table
>> ---
>> rowKey : I, cf: count, value: 1
>> rowKey : am, cf: count, value: 1
>> rowKey : working, cf: count, value: 2
>> rowKey : He, cf: count, value: 1
>> rowKey : is, cf: count, value: 1
>> rowKey : not, cf: count, value: 1
>>
>> My MR job code:
>> ==============
>>
>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>>    for(String word : doc_content.split("\\s+")) {
>>       Increment inc = new Increment(Bytes.toBytes(word));
>>       inc.addColumn("count", "", 1);
>>    }
>> }
>>
>> Now, i wanted to do some experiments with coprocessors. So, i modified
>> the code as follows.
>>
>> My MR job code:
>> ===============
>>
>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>>
>> Coprocessor code:
>> ===============
>>
>>    public void start(CoprocessorEnvironment env)  {
>>        pool = new HTablePool(conf, 100);
>>    }
>>
>>    public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
>> compareOp,     comparator,  put, result) {
>>
>>                if(!result) return true; // check if the put succeeded
>>
>>        HTableInterface table_idx = pool.getTable("doc_idx");
>>
>>        try {
>>
>>            for(KeyValue contentKV = put.get("doc_content", "")) {
>>                            for(String word :
>> contentKV.getValue().split("\\s+")) {
>>                                Increment inc = new
>> Increment(Bytes.toBytes(word));
>>                                inc.addColumn("count", "", 1);
>>                                table_idx.increment(inc);
>>                            }
>>                       }
>>        } finally {
>>            table_idx.close();
>>        }
>>        return true;
>>    }
>>
>>    public void stop(env) {
>>        pool.close();
>>    }
>>
>> I am a newbee to HBASE. I am not sure this is the way to do.
>> Given that, why is the cooprocessor enabled version much slower than
>> the one without?
>>
>>
>> Sincerely,
>> Prakash Kadel
>>
>>
>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>> <michael_se...@hotmail.com> wrote:
>>>
>>> The  issue I was talking about was the use of a check and put.
>>> The OP wrote:
>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows 
>>>>>>> to
>>>>>>> a index table.
>>>
>>> My question is why does the OP use a checkAndPut, and the RegionObserver's 
>>> postChecAndPut?
>>>
>>>
>>> Here's a good example... 
>>> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>>>
>>> The OP doesn't really get in to the use case, so we don't know why the 
>>> Check and Put in the M/R job.
>>> He should just be using put() and then a postPut().
>>>
>>> Another issue... since he's writing to  a different HTable... how? Does he 
>>> create an HTable instance in the start() method of his RO object and then 
>>> reference it later? Or does he create the instance of the HTable on the fly 
>>> in each postCheckAndPut() ?
>>> Without seeing his code, we don't know.
>>>
>>> Note that this is synchronous set of writes. Your overall return from the 
>>> M/R call to put will wait until the second row is inserted.
>>>
>>> Interestingly enough, you may want to consider disabling the WAL on the 
>>> write to the index.  You can always run a M/R job that rebuilds the index 
>>> should something occur to the system where you might lose the data.  
>>> Indexes *ARE* expendable. ;-)
>>>
>>> Does that explain it?
>>>
>>> -Mike
>>>
>>> On Feb 18, 2013, at 4:57 AM, yonghu <yongyong...@gmail.com> wrote:
>>>
>>>> Hi, Michael
>>>>
>>>> I don't quite understand what do you mean by "round trip back to the
>>>> client". In my understanding, as the RegionServer and TaskTracker can
>>>> be the same node, MR don't have to pull data into client and then
>>>> process.  And you also mention the "unnecessary overhead", can you
>>>> explain a little bit what operations or data processing can be seen as
>>>> "unnecessary overhead".
>>>>
>>>> Thanks
>>>>
>>>> yong
>>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>>> <michael_se...@hotmail.com> wrote:
>>>>> Why?
>>>>>
>>>>> This seems like an unnecessary overhead.
>>>>>
>>>>> You are writing code within the coprocessor on the server.  Pessimistic 
>>>>> code really isn't recommended if you are worried about performance.
>>>>>
>>>>> I have to ask... by the time you have executed the code in your 
>>>>> co-processor, what would cause the initial write to fail?
>>>>>
>>>>>
>>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <prakash.ka...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> its a local read. i just check the last param of PostCheckAndPut 
>>>>>> indicating if the Put succeeded. Incase if the put success, i insert a 
>>>>>> row in another table
>>>>>>
>>>>>> Sincerely,
>>>>>> Prakash Kadel
>>>>>>
>>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <w...@us.ibm.com> wrote:
>>>>>>
>>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the nature 
>>>>>>> of
>>>>>>> LSM, read is much slower compared to a write...
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Wei
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From:   Prakash Kadel <prakash.ka...@gmail.com>
>>>>>>> To:     "user@hbase.apache.org" <user@hbase.apache.org>,
>>>>>>> Date:   02/17/2013 07:49 PM
>>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> hi,
>>>>>>> i am trying to insert few million documents to hbase with mapreduce. To
>>>>>>> enable quick search of docs i want to have some indexes, so i tried to 
>>>>>>> use
>>>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>>>> coprocessors not supposed to increase the latency?
>>>>>>> my settings:
>>>>>>> 3 region servers
>>>>>>> 60 maps
>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows 
>>>>>>> to
>>>>>>> a index table.
>>>>>>>
>>>>>>>
>>>>>>> Sincerely,
>>>>>>> Prakash
>>>>>
>>>>> Michael Segel  | (m) 312.755.9623
>>>>>
>>>>> Segel and Associates
>>

Re: coprocessor enabled put very slow, help please~~~

Reply via email to