Re: Coprocessor end point vs MapReduce?

Doug Meil Thu, 18 Oct 2012 12:19:29 -0700

I agree with the concern and there isn't a ton of guidance on this area
yet.




On 10/18/12 2:01 PM, "Michael Segel" <[email protected]> wrote:

>Doug, 
>
>One thing that concerns me is that a lot of folks are gravitating to
>Coprocessors and may be using them for the wrong thing.
>Has anyone done any sort of research as to some of the limitations and
>negative impacts on using coprocessors?
>
>While I haven't really toyed with the idea of bulk deletes, periodic
>deletes is probably not a good use of coprocessors.... however using them
>to synchronize tables would be a valid use case.
>
>Thx
>
>-Mike
>
>On Oct 18, 2012, at 7:36 AM, Doug Meil <[email protected]>
>wrote:
>
>> 
>> To echo what Mike said about KISS, would you use triggers for a large
>> time-sensitive batch job in an RDBMS?  It's possible, but probably not.
>> Then you might want to think twice about using co-processors for such a
>> purpose with HBase.
>> 
>> 
>> 
>> 
>> 
>> On 10/17/12 9:50 PM, "Michael Segel" <[email protected]> wrote:
>> 
>>> Run your weekly job in a low priority fair scheduler/capacity scheduler
>>> queue. 
>>> 
>>> Maybe its just me, but I look at Coprocessors as a similar structure to
>>> RDBMS triggers and stored procedures.
>>> You need to restrain and use them sparingly otherwise you end up
>>>creating
>>> performance issues.
>>> 
>>> Just IMHO.
>>> 
>>> -Mike
>>> 
>>> On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari
>>> <[email protected]> wrote:
>>> 
>>>> I don't have any concern about the time it's taking. It's more about
>>>> the load it's putting on the cluster. I have other jobs that I need to
>>>> run (secondary index, data processing, etc.). So the more time this
>>>> new job is taking, the less CPU the others will have.
>>>> 
>>>> I tried the M/R and I really liked the way it's done. So my only
>>>> concern will really be the performance of the delete part.
>>>> 
>>>> That's why I'm wondering what's the best practice to move a row to
>>>> another table.
>>>> 
>>>> 2012/10/17, Michael Segel <[email protected]>:
>>>>> If you're going to be running this weekly, I would suggest that you
>>>>> stick
>>>>> with the M/R job.
>>>>> 
>>>>> Is there any reason why you need to be worried about the time it
>>>>>takes
>>>>> to do
>>>>> the deletes?
>>>>> 
>>>>> 
>>>>> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari
>>>>> <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi Mike,
>>>>>> 
>>>>>> I'm expecting to run the job weekly. I initially thought about using
>>>>>> end points because I found HBASE-6942 which was a good example for
>>>>>>my
>>>>>> needs.
>>>>>> 
>>>>>> I'm fine with the Put part for the Map/Reduce, but I'm not sure
>>>>>>about
>>>>>> the delete. That's why I look at coprocessors. Then I figure that I
>>>>>> also can do the Put on the coprocessor side.
>>>>>> 
>>>>>> On a M/R, can I delete the row I'm dealing with based on some
>>>>>>criteria
>>>>>> like timestamp? If I do that, I will not do bulk deletes, but I will
>>>>>> delete the rows one by one, right? Which might be very slow.
>>>>>> 
>>>>>> If in the future I want to run the job daily, might that be an
>>>>>>issue?
>>>>>> 
>>>>>> Or should I go with the initial idea of doing the Put with the M/R
>>>>>>job
>>>>>> and the delete with HBASE-6942?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> JM
>>>>>> 
>>>>>> 
>>>>>> 2012/10/17, Michael Segel <[email protected]>:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm a firm believer in KISS (Keep It Simple, Stupid)
>>>>>>> 
>>>>>>> The Map/Reduce (map job only) is the simplest and least prone to
>>>>>>> failure.
>>>>>>> 
>>>>>>> Not sure why you would want to do this using coprocessors.
>>>>>>> 
>>>>>>> How often are you running this job? It sounds like its going to be
>>>>>>> sporadic.
>>>>>>> 
>>>>>>> -Mike
>>>>>>> 
>>>>>>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari
>>>>>>> <[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Can someone please help me to understand the pros and cons between
>>>>>>>> those 2 options for the following usecase?
>>>>>>>> 
>>>>>>>> I need to transfer all the rows between 2 timestamps to another
>>>>>>>> table.
>>>>>>>> 
>>>>>>>> My first idea was to run a MapReduce to map the rows and store
>>>>>>>>them
>>>>>>>> on
>>>>>>>> another table, and then delete them using an end point
>>>>>>>>coprocessor.
>>>>>>>> But the more I look into it, the more I think the MapReduce is
>>>>>>>>not a
>>>>>>>> good idea and I should use a coprocessor instead.
>>>>>>>> 
>>>>>>>> BUT... The MapReduce framework guarantee me that it will run
>>>>>>>>against
>>>>>>>> all the regions. I tried to stop a regionserver while the job was
>>>>>>>> running. The region moved, and the MapReduce restarted the job
>>>>>>>>from
>>>>>>>> the new location. Will the coprocessor do the same thing?
>>>>>>>> 
>>>>>>>> Also, I found the webconsole for the MapReduce with the number of
>>>>>>>> jobs, the status, etc. Is there the same thing with the
>>>>>>>> coprocessors?
>>>>>>>> 
>>>>>>>> Are all coprocessors running at the same time on all regions,
>>>>>>>>which
>>>>>>>> mean we can have 100 of them running on a regionserver at a time?
>>>>>>>>Or
>>>>>>>> are they running like the MapReduce jobs based on some configured
>>>>>>>> values?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> JM
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>
>

Re: Coprocessor end point vs MapReduce?

Reply via email to