Re: Coprocessor end point vs MapReduce?

Michael Segel Thu, 18 Oct 2012 11:04:04 -0700

Doug, 

One thing that concerns me is that a lot of folks are gravitating to 
Coprocessors and may be using them for the wrong thing. 
Has anyone done any sort of research as to some of the limitations and negative 
impacts on using coprocessors?


While I haven't really toyed with the idea of bulk deletes, periodic deletes is 
probably not a good use of coprocessors.... however using them to synchronize 
tables would be a valid use case.

Thx

-Mike

On Oct 18, 2012, at 7:36 AM, Doug Meil <doug.m...@explorysmedical.com> wrote:

> 
> To echo what Mike said about KISS, would you use triggers for a large
> time-sensitive batch job in an RDBMS?  It's possible, but probably not.
> Then you might want to think twice about using co-processors for such a
> purpose with HBase.
> 
> 
> 
> 
> 
> On 10/17/12 9:50 PM, "Michael Segel" <michael_se...@hotmail.com> wrote:
> 
>> Run your weekly job in a low priority fair scheduler/capacity scheduler
>> queue. 
>> 
>> Maybe its just me, but I look at Coprocessors as a similar structure to
>> RDBMS triggers and stored procedures.
>> You need to restrain and use them sparingly otherwise you end up creating
>> performance issues.
>> 
>> Just IMHO.
>> 
>> -Mike
>> 
>> On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari
>> <jean-m...@spaggiari.org> wrote:
>> 
>>> I don't have any concern about the time it's taking. It's more about
>>> the load it's putting on the cluster. I have other jobs that I need to
>>> run (secondary index, data processing, etc.). So the more time this
>>> new job is taking, the less CPU the others will have.
>>> 
>>> I tried the M/R and I really liked the way it's done. So my only
>>> concern will really be the performance of the delete part.
>>> 
>>> That's why I'm wondering what's the best practice to move a row to
>>> another table.
>>> 
>>> 2012/10/17, Michael Segel <michael_se...@hotmail.com>:
>>>> If you're going to be running this weekly, I would suggest that you
>>>> stick
>>>> with the M/R job.
>>>> 
>>>> Is there any reason why you need to be worried about the time it takes
>>>> to do
>>>> the deletes?
>>>> 
>>>> 
>>>> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari
>>>> <jean-m...@spaggiari.org>
>>>> wrote:
>>>> 
>>>>> Hi Mike,
>>>>> 
>>>>> I'm expecting to run the job weekly. I initially thought about using
>>>>> end points because I found HBASE-6942 which was a good example for my
>>>>> needs.
>>>>> 
>>>>> I'm fine with the Put part for the Map/Reduce, but I'm not sure about
>>>>> the delete. That's why I look at coprocessors. Then I figure that I
>>>>> also can do the Put on the coprocessor side.
>>>>> 
>>>>> On a M/R, can I delete the row I'm dealing with based on some criteria
>>>>> like timestamp? If I do that, I will not do bulk deletes, but I will
>>>>> delete the rows one by one, right? Which might be very slow.
>>>>> 
>>>>> If in the future I want to run the job daily, might that be an issue?
>>>>> 
>>>>> Or should I go with the initial idea of doing the Put with the M/R job
>>>>> and the delete with HBASE-6942?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> JM
>>>>> 
>>>>> 
>>>>> 2012/10/17, Michael Segel <michael_se...@hotmail.com>:
>>>>>> Hi,
>>>>>> 
>>>>>> I'm a firm believer in KISS (Keep It Simple, Stupid)
>>>>>> 
>>>>>> The Map/Reduce (map job only) is the simplest and least prone to
>>>>>> failure.
>>>>>> 
>>>>>> Not sure why you would want to do this using coprocessors.
>>>>>> 
>>>>>> How often are you running this job? It sounds like its going to be
>>>>>> sporadic.
>>>>>> 
>>>>>> -Mike
>>>>>> 
>>>>>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari
>>>>>> <jean-m...@spaggiari.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Can someone please help me to understand the pros and cons between
>>>>>>> those 2 options for the following usecase?
>>>>>>> 
>>>>>>> I need to transfer all the rows between 2 timestamps to another
>>>>>>> table.
>>>>>>> 
>>>>>>> My first idea was to run a MapReduce to map the rows and store them
>>>>>>> on
>>>>>>> another table, and then delete them using an end point coprocessor.
>>>>>>> But the more I look into it, the more I think the MapReduce is not a
>>>>>>> good idea and I should use a coprocessor instead.
>>>>>>> 
>>>>>>> BUT... The MapReduce framework guarantee me that it will run against
>>>>>>> all the regions. I tried to stop a regionserver while the job was
>>>>>>> running. The region moved, and the MapReduce restarted the job from
>>>>>>> the new location. Will the coprocessor do the same thing?
>>>>>>> 
>>>>>>> Also, I found the webconsole for the MapReduce with the number of
>>>>>>> jobs, the status, etc. Is there the same thing with the
>>>>>>> coprocessors?
>>>>>>> 
>>>>>>> Are all coprocessors running at the same time on all regions, which
>>>>>>> mean we can have 100 of them running on a regionserver at a time? Or
>>>>>>> are they running like the MapReduce jobs based on some configured
>>>>>>> values?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> JM
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 
> 
>

Re: Coprocessor end point vs MapReduce?

Reply via email to