I agree with the concern and there isn't a ton of guidance on this area yet.
On 10/18/12 2:01 PM, "Michael Segel" <[email protected]> wrote: >Doug, > >One thing that concerns me is that a lot of folks are gravitating to >Coprocessors and may be using them for the wrong thing. >Has anyone done any sort of research as to some of the limitations and >negative impacts on using coprocessors? > >While I haven't really toyed with the idea of bulk deletes, periodic >deletes is probably not a good use of coprocessors.... however using them >to synchronize tables would be a valid use case. > >Thx > >-Mike > >On Oct 18, 2012, at 7:36 AM, Doug Meil <[email protected]> >wrote: > >> >> To echo what Mike said about KISS, would you use triggers for a large >> time-sensitive batch job in an RDBMS? It's possible, but probably not. >> Then you might want to think twice about using co-processors for such a >> purpose with HBase. >> >> >> >> >> >> On 10/17/12 9:50 PM, "Michael Segel" <[email protected]> wrote: >> >>> Run your weekly job in a low priority fair scheduler/capacity scheduler >>> queue. >>> >>> Maybe its just me, but I look at Coprocessors as a similar structure to >>> RDBMS triggers and stored procedures. >>> You need to restrain and use them sparingly otherwise you end up >>>creating >>> performance issues. >>> >>> Just IMHO. >>> >>> -Mike >>> >>> On Oct 17, 2012, at 8:44 PM, Jean-Marc Spaggiari >>> <[email protected]> wrote: >>> >>>> I don't have any concern about the time it's taking. It's more about >>>> the load it's putting on the cluster. I have other jobs that I need to >>>> run (secondary index, data processing, etc.). So the more time this >>>> new job is taking, the less CPU the others will have. >>>> >>>> I tried the M/R and I really liked the way it's done. So my only >>>> concern will really be the performance of the delete part. >>>> >>>> That's why I'm wondering what's the best practice to move a row to >>>> another table. >>>> >>>> 2012/10/17, Michael Segel <[email protected]>: >>>>> If you're going to be running this weekly, I would suggest that you >>>>> stick >>>>> with the M/R job. >>>>> >>>>> Is there any reason why you need to be worried about the time it >>>>>takes >>>>> to do >>>>> the deletes? >>>>> >>>>> >>>>> On Oct 17, 2012, at 8:19 PM, Jean-Marc Spaggiari >>>>> <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Mike, >>>>>> >>>>>> I'm expecting to run the job weekly. I initially thought about using >>>>>> end points because I found HBASE-6942 which was a good example for >>>>>>my >>>>>> needs. >>>>>> >>>>>> I'm fine with the Put part for the Map/Reduce, but I'm not sure >>>>>>about >>>>>> the delete. That's why I look at coprocessors. Then I figure that I >>>>>> also can do the Put on the coprocessor side. >>>>>> >>>>>> On a M/R, can I delete the row I'm dealing with based on some >>>>>>criteria >>>>>> like timestamp? If I do that, I will not do bulk deletes, but I will >>>>>> delete the rows one by one, right? Which might be very slow. >>>>>> >>>>>> If in the future I want to run the job daily, might that be an >>>>>>issue? >>>>>> >>>>>> Or should I go with the initial idea of doing the Put with the M/R >>>>>>job >>>>>> and the delete with HBASE-6942? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> JM >>>>>> >>>>>> >>>>>> 2012/10/17, Michael Segel <[email protected]>: >>>>>>> Hi, >>>>>>> >>>>>>> I'm a firm believer in KISS (Keep It Simple, Stupid) >>>>>>> >>>>>>> The Map/Reduce (map job only) is the simplest and least prone to >>>>>>> failure. >>>>>>> >>>>>>> Not sure why you would want to do this using coprocessors. >>>>>>> >>>>>>> How often are you running this job? It sounds like its going to be >>>>>>> sporadic. >>>>>>> >>>>>>> -Mike >>>>>>> >>>>>>> On Oct 17, 2012, at 7:11 PM, Jean-Marc Spaggiari >>>>>>> <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Can someone please help me to understand the pros and cons between >>>>>>>> those 2 options for the following usecase? >>>>>>>> >>>>>>>> I need to transfer all the rows between 2 timestamps to another >>>>>>>> table. >>>>>>>> >>>>>>>> My first idea was to run a MapReduce to map the rows and store >>>>>>>>them >>>>>>>> on >>>>>>>> another table, and then delete them using an end point >>>>>>>>coprocessor. >>>>>>>> But the more I look into it, the more I think the MapReduce is >>>>>>>>not a >>>>>>>> good idea and I should use a coprocessor instead. >>>>>>>> >>>>>>>> BUT... The MapReduce framework guarantee me that it will run >>>>>>>>against >>>>>>>> all the regions. I tried to stop a regionserver while the job was >>>>>>>> running. The region moved, and the MapReduce restarted the job >>>>>>>>from >>>>>>>> the new location. Will the coprocessor do the same thing? >>>>>>>> >>>>>>>> Also, I found the webconsole for the MapReduce with the number of >>>>>>>> jobs, the status, etc. Is there the same thing with the >>>>>>>> coprocessors? >>>>>>>> >>>>>>>> Are all coprocessors running at the same time on all regions, >>>>>>>>which >>>>>>>> mean we can have 100 of them running on a regionserver at a time? >>>>>>>>Or >>>>>>>> are they running like the MapReduce jobs based on some configured >>>>>>>> values? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> JM >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> >> >> > >
