However, if the aggregations in the mapper were kept in a HashMap (key being the aggregate, value being the count), and then the mapper made a single pass over this map during the cleanup method and then did the checkAndPuts, it would mean that the writes would only happen once per map-task, and not do it on a per-row basis (which would be really expensive).
A single region on a single RS could handle that no problem. On 9/16/11 9:00 PM, "Sam Seigal" <[email protected]> wrote: >I see what you are saying about the temp table being hosted at a >single regions server - especially for a limited set of rows that >just care about the aggregations, but receive a lot of traffic. I >wonder if this will also be the case, if I was to use the source table >to maintain these temporary records, and not create a temp table on >the fly ... > >On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil ><[email protected]> wrote: >> >> I'll add this to the book in the MR section. >> >> >> >> >> >> On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote: >> >>> >>>I was in the middle of responding to Mike's email when yours arrived, so >>>I'll respond to both. >>> >>>I think the temp-table idea is interesting. The caution is that a >>>default >>>temp-table creation will be hosted on a single RS and thus be a >>>bottleneck >>>for aggregation. So I would imagine that you would need to tune the >>>temp-table for the job and pre-create regions. >>> >>>Doug >>> >>> >>> >>>On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote: >>> >>>>I am trying to do something similar with HBase Map/Reduce. >>>> >>>>I have event ids and amounts stored in hbase in the following format: >>>>prefix-event_id_type-timestamp-event_id as the row key and amount as >>>>the value >>>>I want to be able to aggregate the amounts based on the event id type >>>>and for this I am using a reducer. I basically reduce on the >>>>eventidtype from the incoming row in the map phase, and perform the >>>>aggregation in the reducer on the amounts for the event types. Then I >>>>write back the results into HBase. >>>> >>>>I hadn't thought about writing values directly into a temp HBase table >>>>as suggested by Mike in the map phase. >>>> >>>>For this case, each mapper can declare its own mapperId_event_type row >>>>with totalAmount and for each row it receives, do a get , add the >>>>current amount, and then a put. We are basically then doing a >>>>get/add/put for every row that a mapper receives. Is this any more >>>>efficient when compared to the overhead of sorting/partitioning for a >>>>reducer ? >>>> >>>>At the end of the mapping phase, aggregating the output of all the >>>>mappers should be trivial. >>>> >>>> >>>> >>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>>><[email protected]> wrote: >>>>> >>>>> Doug and company... >>>>> >>>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>>reducers when working w HBase. What I am saying is that if we look at >>>>>what you're attempting to do, you may end up getting better >>>>>performance >>>>>if you created a temp table in HBase and let HBase do some of the >>>>>heavy >>>>>lifting where you are currently using a reducer. From the jobs that we >>>>>run, when we looked at what we were doing, there wasn't any need for a >>>>>reducer. I suspect that its true of other jobs. >>>>> >>>>> Remember that HBase is much more than just an HFile format to persist >>>>>stuff. >>>>> >>>>> Even looking at Sonal's example... you have other ways of doing the >>>>>record counts like dynamic counters or using a temp table in HBase >>>>>which >>>>>I believe will give you better performance numbers, although I haven't >>>>>benchmarked either against a reducer. >>>>> >>>>> Does that make sense? >>>>> >>>>> -Mike >>>>> >>>>> >>>>> > From: [email protected] >>>>> > To: [email protected] >>>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >>>>> > Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>JDBCReducer ... >>>>> > >>>>> > >>>>> > Chris, agreed... There are sometimes that reducers aren't required, >>>>>and >>>>> > then situations where they are useful. We have both kinds of jobs. >>>>> > >>>>> > For others following the thread, I updated the book recently with >>>>>more MR >>>>> > examples (read-only, read-write, read-summary) >>>>> > >>>>> > http://hbase.apache.org/book.html#mapreduce.example >>>>> > >>>>> > >>>>> > As to the question that started this thread... >>>>> > >>>>> > >>>>> > re: "Store aggregated data in Oracle. " >>>>> > >>>>> > To me, that sounds a like the "read-summary" example with >>>>>JDBC-Oracle >>>>>in >>>>> > the reduce step. >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote: >>>>> > >>>>> > >If only I could make NY in Nov :) >>>>> > > >>>>> > >We extract out large numbers of DNA sequence reads from HBase, run >>>>>them >>>>> > >through M/R pipelines to analyze and aggregate and then we load >>>>>the >>>>> > >results back in. Definitely specialized usage, but I could see >>>>>other >>>>> > >perfectly valid uses for reducers with HBase. >>>>> > > >>>>> > >-chris >>>>> > > >>>>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >>>>> > > >>>>> > >> >>>>> > >> Sonal, >>>>> > >> >>>>> > >> You do realize that HBase is a "database", right? ;-) >>>>> > >> >>>>> > >> So again, why do you need a reducer? ;-) >>>>> > >> >>>>> > >> Using your example... >>>>> > >> "Again, there will be many cases where one may want a reducer, >>>>>say >>>>> > >>trying to count the occurrence of words in a particular column." >>>>> > >> >>>>> > >> You can do this one of two ways... >>>>> > >> 1) Dynamic Counters in Hadoop. >>>>> > >> 2) Use a temp table and auto increment the value in a column >>>>>which >>>>> > >>contains the word count. (Fat row where rowkey is doc_id and >>>>>column is >>>>> > >>word or rowkey is doc_id|word) >>>>> > >> >>>>> > >> I'm sorry but if you go through all of your examples of why you >>>>>would >>>>> > >>want to use a reducer, you end up finding out that writing to an >>>>>HBase >>>>> > >>table would be faster than a reduce job. >>>>> > >> (Again we haven't done an exhaustive search, but in all of the >>>>>HBase >>>>> > >>jobs we've run... no reducers were necessary.) >>>>> > >> >>>>> > >> The point I'm trying to make is that you want to avoid using a >>>>>reducer >>>>> > >>whenever possible and if you think about your problem... you can >>>>> > >>probably come up with a solution that avoids the reducer... >>>>> > >> >>>>> > >> >>>>> > >> HTH >>>>> > >> >>>>> > >> -Mike >>>>> > >> PS. I haven't looked at *all* of the potential use cases of >>>>>HBase >>>>>which >>>>> > >>is why I don't want to say you'll never need a reducer. I will >>>>>say >>>>>that >>>>> > >>based on what we've done at my client's site, we try very hard to >>>>>avoid >>>>> > >>reducers. >>>>> > >> [Note, I'm sure I'm going to get hammered on this when I head to >>>>>NY in >>>>> > >>Nov. :-) ] >>>>> > >> >>>>> > >> >>>>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>>>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>JDBCReducer >>>>> > >>>... >>>>> > >>> From: [email protected] >>>>> > >>> To: [email protected] >>>>> > >>> >>>>> > >>> Hi Michael, >>>>> > >>> >>>>> > >>> Yes, thanks, I understand the fact that reducers can be >>>>>expensive >>>>>with >>>>> > >>>all >>>>> > >>> the shuffling and the sorting, and you may not need them >>>>>always. >>>>>At >>>>> > >>>the same >>>>> > >>> time, there are many cases where reducers are useful, like >>>>>secondary >>>>> > >>> sorting. In many cases, one can have multiple map phases and >>>>>not >>>>>have a >>>>> > >>> reduce phase at all. Again, there will be many cases where one >>>>>may >>>>> > >>>want a >>>>> > >>> reducer, say trying to count the occurrence of words in a >>>>>particular >>>>> > >>>column. >>>>> > >>> >>>>> > >>> >>>>> > >>> With this thought chain, I do not feel ready to say that when >>>>>dealing >>>>> > >>>with >>>>> > >>> HBase, I really dont want to use a reducer. Please correct me >>>>>if >>>>>I am >>>>> > >>> wrong. >>>>> > >>> >>>>> > >>> Thanks again. >>>>> > >>> >>>>> > >>> Best Regards, >>>>> > >>> Sonal >>>>> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>>>> > >>> Nube Technologies <http://www.nubetech.co> >>>>> > >>> >>>>> > >>> <http://in.linkedin.com/in/sonalgoyal> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >>>>> > >>> <[email protected]>wrote: >>>>> > >>> >>>>> > >>>> >>>>> > >>>> Sonal, >>>>> > >>>> >>>>> > >>>> Just because you have a m/r job doesn't mean that you need to >>>>>reduce >>>>> > >>>> anything. You can have a job that contains only a mapper. >>>>> > >>>> Or your job runner can have a series of map jobs in serial. >>>>> > >>>> >>>>> > >>>> Most if not all of the map/reduce jobs where we pull data from >>>>>HBase, >>>>> > >>>>don't >>>>> > >>>> require a reducer. >>>>> > >>>> >>>>> > >>>> To give you a simple example... if I want to determine the >>>>>table >>>>> > >>>>schema >>>>> > >>>> where I am storing some sort of structured data... >>>>> > >>>> I just write a m/r job which opens a table, scan's the table >>>>>counting >>>>> > >>>>the >>>>> > >>>> occurrence of each column name via dynamic counters. >>>>> > >>>> >>>>> > >>>> There is no need for a reducer. >>>>> > >>>> >>>>> > >>>> Does that help? >>>>> > >>>> >>>>> > >>>> >>>>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >>>>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>> > >>>>>JDBCReducer >>>>> > >>>> ... >>>>> > >>>>> From: [email protected] >>>>> > >>>>> To: [email protected] >>>>> > >>>>> >>>>> > >>>>> Michel, >>>>> > >>>>> >>>>> > >>>>> Sorry can you please help me understand what you mean when >>>>>you >>>>>say >>>>> > >>>>>that >>>>> > >>>> when >>>>> > >>>>> dealing with HBase, you really dont want to use a reducer? >>>>>Here, >>>>> > >>>>>Hbase is >>>>> > >>>>> being used as the input to the MR job. >>>>> > >>>>> >>>>> > >>>>> Thanks >>>>> > >>>>> Sonal >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel >>>>> > >>>>><[email protected] >>>>> > >>>>> wrote: >>>>> > >>>>> >>>>> > >>>>>> I think you need to get a little bit more information. >>>>> > >>>>>> Reducers are expensive. >>>>> > >>>>>> When Thomas says that he is aggregating data, what exactly >>>>>does he >>>>> > >>>> mean? >>>>> > >>>>>> When dealing w HBase, you really don't want to use a >>>>>reducer. >>>>> > >>>>>> >>>>> > >>>>>> You may want to run two map jobs and it could be that just >>>>>dumping >>>>> > >>>>>>the >>>>> > >>>>>> output via jdbc makes the most sense. >>>>> > >>>>>> >>>>> > >>>>>> We are starting to see a lot of questions where the OP isn't >>>>> > >>>>>>providing >>>>> > >>>>>> enough information so that the recommendation could be >>>>>wrong... >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> Sent from a remote device. Please excuse any typos... >>>>> > >>>>>> >>>>> > >>>>>> Mike Segel >>>>> > >>>>>> >>>>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal >>>>><[email protected]> >>>>> > >>>> wrote: >>>>> > >>>>>> >>>>> > >>>>>>> There is a DBOutputFormat class in the >>>>> > >>>> org.apache,hadoop.mapreduce.lib.db >>>>> > >>>>>>> package, you could use that. Or you could write to the hdfs >>>>>and >>>>> > >>>>>>>then >>>>> > >>>> use >>>>> > >>>>>>> something like HIHO[1] to export to the db. I have been >>>>>working >>>>> > >>>>>> extensively >>>>> > >>>>>>> in this area, you can write to me directly if you need any >>>>>help. >>>>> > >>>>>>> >>>>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho >>>>> > >>>>>>> >>>>> > >>>>>>> Best Regards, >>>>> > >>>>>>> Sonal >>>>> > >>>>>>> Crux: Reporting for HBase >>>>><https://github.com/sonalgoyal/crux> >>>>> > >>>>>>> Nube Technologies <http://www.nubetech.co> >>>>> > >>>>>>> >>>>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < >>>>> > >>>>>>> [email protected]> wrote: >>>>> > >>>>>>> >>>>> > >>>>>>>> Hello, >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> writing a MR-Job to process HBase data and store >>>>>aggregated >>>>>data >>>>> > >>>>>>>>in >>>>> > >>>>>>>> Oracle. How would you do that in a MR-job? >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> Currently, for test purposes we write the result into a >>>>>HBase >>>>> > >>>>>>>>table >>>>> > >>>>>>>> again by using a TableReducer. Is there something like a >>>>> > >>>> OracleReducer, >>>>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one >>>>>simply >>>>> > >>>>>>>>use >>>>> > >>>>>>>> plan JDBC code in the reduce step? >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> Thanks! >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> Thomas >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >> >>>>> > > >>>>> > >>> >> >>
