I am trying to do something similar with HBase Map/Reduce. I have event ids and amounts stored in hbase in the following format: prefix-event_id_type-timestamp-event_id as the row key and amount as the value I want to be able to aggregate the amounts based on the event id type and for this I am using a reducer. I basically reduce on the eventidtype from the incoming row in the map phase, and perform the aggregation in the reducer on the amounts for the event types. Then I write back the results into HBase.
I hadn't thought about writing values directly into a temp HBase table as suggested by Mike in the map phase. For this case, each mapper can declare its own mapperId_event_type row with totalAmount and for each row it receives, do a get , add the current amount, and then a put. We are basically then doing a get/add/put for every row that a mapper receives. Is this any more efficient when compared to the overhead of sorting/partitioning for a reducer ? At the end of the mapping phase, aggregating the output of all the mappers should be trivial. On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel <[email protected]> wrote: > > Doug and company... > > Look, I'm not saying that there aren't m/r jobs were you might need reducers > when working w HBase. What I am saying is that if we look at what you're > attempting to do, you may end up getting better performance if you created a > temp table in HBase and let HBase do some of the heavy lifting where you are > currently using a reducer. From the jobs that we run, when we looked at what > we were doing, there wasn't any need for a reducer. I suspect that its true > of other jobs. > > Remember that HBase is much more than just an HFile format to persist stuff. > > Even looking at Sonal's example... you have other ways of doing the record > counts like dynamic counters or using a temp table in HBase which I believe > will give you better performance numbers, although I haven't benchmarked > either against a reducer. > > Does that make sense? > > -Mike > > > > From: [email protected] > > To: [email protected] > > Date: Fri, 16 Sep 2011 15:41:44 -0400 > > Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ... > > > > > > Chris, agreed... There are sometimes that reducers aren't required, and > > then situations where they are useful. We have both kinds of jobs. > > > > For others following the thread, I updated the book recently with more MR > > examples (read-only, read-write, read-summary) > > > > http://hbase.apache.org/book.html#mapreduce.example > > > > > > As to the question that started this thread... > > > > > > re: "Store aggregated data in Oracle. " > > > > To me, that sounds a like the "read-summary" example with JDBC-Oracle in > > the reduce step. > > > > > > > > > > > > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote: > > > > >If only I could make NY in Nov :) > > > > > >We extract out large numbers of DNA sequence reads from HBase, run them > > >through M/R pipelines to analyze and aggregate and then we load the > > >results back in. Definitely specialized usage, but I could see other > > >perfectly valid uses for reducers with HBase. > > > > > >-chris > > > > > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: > > > > > >> > > >> Sonal, > > >> > > >> You do realize that HBase is a "database", right? ;-) > > >> > > >> So again, why do you need a reducer? ;-) > > >> > > >> Using your example... > > >> "Again, there will be many cases where one may want a reducer, say > > >>trying to count the occurrence of words in a particular column." > > >> > > >> You can do this one of two ways... > > >> 1) Dynamic Counters in Hadoop. > > >> 2) Use a temp table and auto increment the value in a column which > > >>contains the word count. (Fat row where rowkey is doc_id and column is > > >>word or rowkey is doc_id|word) > > >> > > >> I'm sorry but if you go through all of your examples of why you would > > >>want to use a reducer, you end up finding out that writing to an HBase > > >>table would be faster than a reduce job. > > >> (Again we haven't done an exhaustive search, but in all of the HBase > > >>jobs we've run... no reducers were necessary.) > > >> > > >> The point I'm trying to make is that you want to avoid using a reducer > > >>whenever possible and if you think about your problem... you can > > >>probably come up with a solution that avoids the reducer... > > >> > > >> > > >> HTH > > >> > > >> -Mike > > >> PS. I haven't looked at *all* of the potential use cases of HBase which > > >>is why I don't want to say you'll never need a reducer. I will say that > > >>based on what we've done at my client's site, we try very hard to avoid > > >>reducers. > > >> [Note, I'm sure I'm going to get hammered on this when I head to NY in > > >>Nov. :-) ] > > >> > > >> > > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 > > >>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer > > >>>... > > >>> From: [email protected] > > >>> To: [email protected] > > >>> > > >>> Hi Michael, > > >>> > > >>> Yes, thanks, I understand the fact that reducers can be expensive with > > >>>all > > >>> the shuffling and the sorting, and you may not need them always. At > > >>>the same > > >>> time, there are many cases where reducers are useful, like secondary > > >>> sorting. In many cases, one can have multiple map phases and not have a > > >>> reduce phase at all. Again, there will be many cases where one may > > >>>want a > > >>> reducer, say trying to count the occurrence of words in a particular > > >>>column. > > >>> > > >>> > > >>> With this thought chain, I do not feel ready to say that when dealing > > >>>with > > >>> HBase, I really dont want to use a reducer. Please correct me if I am > > >>> wrong. > > >>> > > >>> Thanks again. > > >>> > > >>> Best Regards, > > >>> Sonal > > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > > >>> Nube Technologies <http://www.nubetech.co> > > >>> > > >>> <http://in.linkedin.com/in/sonalgoyal> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel > > >>> <[email protected]>wrote: > > >>> > > >>>> > > >>>> Sonal, > > >>>> > > >>>> Just because you have a m/r job doesn't mean that you need to reduce > > >>>> anything. You can have a job that contains only a mapper. > > >>>> Or your job runner can have a series of map jobs in serial. > > >>>> > > >>>> Most if not all of the map/reduce jobs where we pull data from HBase, > > >>>>don't > > >>>> require a reducer. > > >>>> > > >>>> To give you a simple example... if I want to determine the table > > >>>>schema > > >>>> where I am storing some sort of structured data... > > >>>> I just write a m/r job which opens a table, scan's the table counting > > >>>>the > > >>>> occurrence of each column name via dynamic counters. > > >>>> > > >>>> There is no need for a reducer. > > >>>> > > >>>> Does that help? > > >>>> > > >>>> > > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 > > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, > > >>>>>JDBCReducer > > >>>> ... > > >>>>> From: [email protected] > > >>>>> To: [email protected] > > >>>>> > > >>>>> Michel, > > >>>>> > > >>>>> Sorry can you please help me understand what you mean when you say > > >>>>>that > > >>>> when > > >>>>> dealing with HBase, you really dont want to use a reducer? Here, > > >>>>>Hbase is > > >>>>> being used as the input to the MR job. > > >>>>> > > >>>>> Thanks > > >>>>> Sonal > > >>>>> > > >>>>> > > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel > > >>>>><[email protected] > > >>>>> wrote: > > >>>>> > > >>>>>> I think you need to get a little bit more information. > > >>>>>> Reducers are expensive. > > >>>>>> When Thomas says that he is aggregating data, what exactly does he > > >>>> mean? > > >>>>>> When dealing w HBase, you really don't want to use a reducer. > > >>>>>> > > >>>>>> You may want to run two map jobs and it could be that just dumping > > >>>>>>the > > >>>>>> output via jdbc makes the most sense. > > >>>>>> > > >>>>>> We are starting to see a lot of questions where the OP isn't > > >>>>>>providing > > >>>>>> enough information so that the recommendation could be wrong... > > >>>>>> > > >>>>>> > > >>>>>> Sent from a remote device. Please excuse any typos... > > >>>>>> > > >>>>>> Mike Segel > > >>>>>> > > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[email protected]> > > >>>> wrote: > > >>>>>> > > >>>>>>> There is a DBOutputFormat class in the > > >>>> org.apache,hadoop.mapreduce.lib.db > > >>>>>>> package, you could use that. Or you could write to the hdfs and > > >>>>>>>then > > >>>> use > > >>>>>>> something like HIHO[1] to export to the db. I have been working > > >>>>>> extensively > > >>>>>>> in this area, you can write to me directly if you need any help. > > >>>>>>> > > >>>>>>> 1. https://github.com/sonalgoyal/hiho > > >>>>>>> > > >>>>>>> Best Regards, > > >>>>>>> Sonal > > >>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > > >>>>>>> Nube Technologies <http://www.nubetech.co> > > >>>>>>> > > >>>>>>> <http://in.linkedin.com/in/sonalgoyal> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < > > >>>>>>> [email protected]> wrote: > > >>>>>>> > > >>>>>>>> Hello, > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> writing a MR-Job to process HBase data and store aggregated data > > >>>>>>>>in > > >>>>>>>> Oracle. How would you do that in a MR-job? > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Currently, for test purposes we write the result into a HBase > > >>>>>>>>table > > >>>>>>>> again by using a TableReducer. Is there something like a > > >>>> OracleReducer, > > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one simply > > >>>>>>>>use > > >>>>>>>> plan JDBC code in the reduce step? > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Thanks! > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Thomas > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>> > > >>>> > > >>>> > > >> > > > > >
