Chris, I don't know what sort of aggregation you are doing, but again, why not write to a temp table instead of using a reducer?
> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... > From: [email protected] > Date: Fri, 16 Sep 2011 11:58:05 -0700 > To: [email protected] > > If only I could make NY in Nov :) > > We extract out large numbers of DNA sequence reads from HBase, run them > through M/R pipelines to analyze and aggregate and then we load the results > back in. Definitely specialized usage, but I could see other perfectly valid > uses for reducers with HBase. > > -chris > > On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: > > > > > Sonal, > > > > You do realize that HBase is a "database", right? ;-) > > > > So again, why do you need a reducer? ;-) > > > > Using your example... > > "Again, there will be many cases where one may want a reducer, say trying > > to count the occurrence of words in a particular column." > > > > You can do this one of two ways... > > 1) Dynamic Counters in Hadoop. > > 2) Use a temp table and auto increment the value in a column which contains > > the word count. (Fat row where rowkey is doc_id and column is word or > > rowkey is doc_id|word) > > > > I'm sorry but if you go through all of your examples of why you would want > > to use a reducer, you end up finding out that writing to an HBase table > > would be faster than a reduce job. > > (Again we haven't done an exhaustive search, but in all of the HBase jobs > > we've run... no reducers were necessary.) > > > > The point I'm trying to make is that you want to avoid using a reducer > > whenever possible and if you think about your problem... you can probably > > come up with a solution that avoids the reducer... > > > > > > HTH > > > > -Mike > > PS. I haven't looked at *all* of the potential use cases of HBase which is > > why I don't want to say you'll never need a reducer. I will say that based > > on what we've done at my client's site, we try very hard to avoid reducers. > > [Note, I'm sure I'm going to get hammered on this when I head to NY in Nov. > > :-) ] > > > > > >> Date: Fri, 16 Sep 2011 23:00:49 +0530 > >> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... > >> From: [email protected] > >> To: [email protected] > >> > >> Hi Michael, > >> > >> Yes, thanks, I understand the fact that reducers can be expensive with all > >> the shuffling and the sorting, and you may not need them always. At the > >> same > >> time, there are many cases where reducers are useful, like secondary > >> sorting. In many cases, one can have multiple map phases and not have a > >> reduce phase at all. Again, there will be many cases where one may want a > >> reducer, say trying to count the occurrence of words in a particular > >> column. > >> > >> > >> With this thought chain, I do not feel ready to say that when dealing with > >> HBase, I really dont want to use a reducer. Please correct me if I am > >> wrong. > >> > >> Thanks again. > >> > >> Best Regards, > >> Sonal > >> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > >> Nube Technologies <http://www.nubetech.co> > >> > >> <http://in.linkedin.com/in/sonalgoyal> > >> > >> > >> > >> > >> > >> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel > >> <[email protected]>wrote: > >> > >>> > >>> Sonal, > >>> > >>> Just because you have a m/r job doesn't mean that you need to reduce > >>> anything. You can have a job that contains only a mapper. > >>> Or your job runner can have a series of map jobs in serial. > >>> > >>> Most if not all of the map/reduce jobs where we pull data from HBase, > >>> don't > >>> require a reducer. > >>> > >>> To give you a simple example... if I want to determine the table schema > >>> where I am storing some sort of structured data... > >>> I just write a m/r job which opens a table, scan's the table counting the > >>> occurrence of each column name via dynamic counters. > >>> > >>> There is no need for a reducer. > >>> > >>> Does that help? > >>> > >>> > >>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 > >>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer > >>> ... > >>>> From: [email protected] > >>>> To: [email protected] > >>>> > >>>> Michel, > >>>> > >>>> Sorry can you please help me understand what you mean when you say that > >>> when > >>>> dealing with HBase, you really dont want to use a reducer? Here, Hbase is > >>>> being used as the input to the MR job. > >>>> > >>>> Thanks > >>>> Sonal > >>>> > >>>> > >>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[email protected] > >>>> wrote: > >>>> > >>>>> I think you need to get a little bit more information. > >>>>> Reducers are expensive. > >>>>> When Thomas says that he is aggregating data, what exactly does he > >>> mean? > >>>>> When dealing w HBase, you really don't want to use a reducer. > >>>>> > >>>>> You may want to run two map jobs and it could be that just dumping the > >>>>> output via jdbc makes the most sense. > >>>>> > >>>>> We are starting to see a lot of questions where the OP isn't providing > >>>>> enough information so that the recommendation could be wrong... > >>>>> > >>>>> > >>>>> Sent from a remote device. Please excuse any typos... > >>>>> > >>>>> Mike Segel > >>>>> > >>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[email protected]> > >>> wrote: > >>>>> > >>>>>> There is a DBOutputFormat class in the > >>> org.apache,hadoop.mapreduce.lib.db > >>>>>> package, you could use that. Or you could write to the hdfs and then > >>> use > >>>>>> something like HIHO[1] to export to the db. I have been working > >>>>> extensively > >>>>>> in this area, you can write to me directly if you need any help. > >>>>>> > >>>>>> 1. https://github.com/sonalgoyal/hiho > >>>>>> > >>>>>> Best Regards, > >>>>>> Sonal > >>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > >>>>>> Nube Technologies <http://www.nubetech.co> > >>>>>> > >>>>>> <http://in.linkedin.com/in/sonalgoyal> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> writing a MR-Job to process HBase data and store aggregated data in > >>>>>>> Oracle. How would you do that in a MR-job? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Currently, for test purposes we write the result into a HBase table > >>>>>>> again by using a TableReducer. Is there something like a > >>> OracleReducer, > >>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one simply use > >>>>>>> plan JDBC code in the reduce step? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Thomas > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>> > >>> > > >
