But - if I may follow up on myself - I'll definitely keep my eyes more open for times when we really don't need a reducer. I can see what you are saying and that people should think a bit more laterally and use hbase for different and potentially more efficient workflows.
-chris On Sep 16, 2011, at 2:54 PM, Chris Tarnas wrote: > Hi Mike, > > It's analysis* and aggregation, not just aggregation so it's a bit more > complex. Each row in the input generates at least one new row of data when we > are done. > > For our data sizes (~1 billion 2-3kb rows per job now and growing) we > originally did normal inserts, but then we switched to bulk imports - it was > much faster and a lot less stress on the regionservers. Bulk importing uses a > reducer, so even if we went through and changed our M/R pipelines to use a > temp table for organized intermediate data, the most efficient way to > populate the temp table would be via the bulk loader - using a reducer anyway. > > -chris > > * Sorry to be broad but for business reasons I can't talk to much about the > analysis details. > > > On Sep 16, 2011, at 1:11 PM, Michael Segel wrote: > >> >> Chris, >> >> I don't know what sort of aggregation you are doing, but again, why not >> write to a temp table instead of using a reducer? >> >> >> >> >>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... >>> From: [email protected] >>> Date: Fri, 16 Sep 2011 11:58:05 -0700 >>> To: [email protected] >>> >>> If only I could make NY in Nov :) >>> >>> We extract out large numbers of DNA sequence reads from HBase, run them >>> through M/R pipelines to analyze and aggregate and then we load the results >>> back in. Definitely specialized usage, but I could see other perfectly >>> valid uses for reducers with HBase. >>> >>> -chris >>> >>> On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >>> >>>> >>>> Sonal, >>>> >>>> You do realize that HBase is a "database", right? ;-) >>>> >>>> So again, why do you need a reducer? ;-) >>>> >>>> Using your example... >>>> "Again, there will be many cases where one may want a reducer, say trying >>>> to count the occurrence of words in a particular column." >>>> >>>> You can do this one of two ways... >>>> 1) Dynamic Counters in Hadoop. >>>> 2) Use a temp table and auto increment the value in a column which >>>> contains the word count. (Fat row where rowkey is doc_id and column is >>>> word or rowkey is doc_id|word) >>>> >>>> I'm sorry but if you go through all of your examples of why you would want >>>> to use a reducer, you end up finding out that writing to an HBase table >>>> would be faster than a reduce job. >>>> (Again we haven't done an exhaustive search, but in all of the HBase jobs >>>> we've run... no reducers were necessary.) >>>> >>>> The point I'm trying to make is that you want to avoid using a reducer >>>> whenever possible and if you think about your problem... you can probably >>>> come up with a solution that avoids the reducer... >>>> >>>> >>>> HTH >>>> >>>> -Mike >>>> PS. I haven't looked at *all* of the potential use cases of HBase which is >>>> why I don't want to say you'll never need a reducer. I will say that based >>>> on what we've done at my client's site, we try very hard to avoid reducers. >>>> [Note, I'm sure I'm going to get hammered on this when I head to NY in >>>> Nov. :-) ] >>>> >>>> >>>>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... >>>>> From: [email protected] >>>>> To: [email protected] >>>>> >>>>> Hi Michael, >>>>> >>>>> Yes, thanks, I understand the fact that reducers can be expensive with all >>>>> the shuffling and the sorting, and you may not need them always. At the >>>>> same >>>>> time, there are many cases where reducers are useful, like secondary >>>>> sorting. In many cases, one can have multiple map phases and not have a >>>>> reduce phase at all. Again, there will be many cases where one may want a >>>>> reducer, say trying to count the occurrence of words in a particular >>>>> column. >>>>> >>>>> >>>>> With this thought chain, I do not feel ready to say that when dealing with >>>>> HBase, I really dont want to use a reducer. Please correct me if I am >>>>> wrong. >>>>> >>>>> Thanks again. >>>>> >>>>> Best Regards, >>>>> Sonal >>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>>>> Nube Technologies <http://www.nubetech.co> >>>>> >>>>> <http://in.linkedin.com/in/sonalgoyal> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >>>>> <[email protected]>wrote: >>>>> >>>>>> >>>>>> Sonal, >>>>>> >>>>>> Just because you have a m/r job doesn't mean that you need to reduce >>>>>> anything. You can have a job that contains only a mapper. >>>>>> Or your job runner can have a series of map jobs in serial. >>>>>> >>>>>> Most if not all of the map/reduce jobs where we pull data from HBase, >>>>>> don't >>>>>> require a reducer. >>>>>> >>>>>> To give you a simple example... if I want to determine the table schema >>>>>> where I am storing some sort of structured data... >>>>>> I just write a m/r job which opens a table, scan's the table counting the >>>>>> occurrence of each column name via dynamic counters. >>>>>> >>>>>> There is no need for a reducer. >>>>>> >>>>>> Does that help? >>>>>> >>>>>> >>>>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >>>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer >>>>>> ... >>>>>>> From: [email protected] >>>>>>> To: [email protected] >>>>>>> >>>>>>> Michel, >>>>>>> >>>>>>> Sorry can you please help me understand what you mean when you say that >>>>>> when >>>>>>> dealing with HBase, you really dont want to use a reducer? Here, Hbase >>>>>>> is >>>>>>> being used as the input to the MR job. >>>>>>> >>>>>>> Thanks >>>>>>> Sonal >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[email protected] >>>>>>> wrote: >>>>>>> >>>>>>>> I think you need to get a little bit more information. >>>>>>>> Reducers are expensive. >>>>>>>> When Thomas says that he is aggregating data, what exactly does he >>>>>> mean? >>>>>>>> When dealing w HBase, you really don't want to use a reducer. >>>>>>>> >>>>>>>> You may want to run two map jobs and it could be that just dumping the >>>>>>>> output via jdbc makes the most sense. >>>>>>>> >>>>>>>> We are starting to see a lot of questions where the OP isn't providing >>>>>>>> enough information so that the recommendation could be wrong... >>>>>>>> >>>>>>>> >>>>>>>> Sent from a remote device. Please excuse any typos... >>>>>>>> >>>>>>>> Mike Segel >>>>>>>> >>>>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[email protected]> >>>>>> wrote: >>>>>>>> >>>>>>>>> There is a DBOutputFormat class in the >>>>>> org.apache,hadoop.mapreduce.lib.db >>>>>>>>> package, you could use that. Or you could write to the hdfs and then >>>>>> use >>>>>>>>> something like HIHO[1] to export to the db. I have been working >>>>>>>> extensively >>>>>>>>> in this area, you can write to me directly if you need any help. >>>>>>>>> >>>>>>>>> 1. https://github.com/sonalgoyal/hiho >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> Sonal >>>>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>>>>>>>> Nube Technologies <http://www.nubetech.co> >>>>>>>>> >>>>>>>>> <http://in.linkedin.com/in/sonalgoyal> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> writing a MR-Job to process HBase data and store aggregated data in >>>>>>>>>> Oracle. How would you do that in a MR-job? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Currently, for test purposes we write the result into a HBase table >>>>>>>>>> again by using a TableReducer. Is there something like a >>>>>> OracleReducer, >>>>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one simply use >>>>>>>>>> plan JDBC code in the reduce step? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thomas >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>> >> >
