I'll add this to the book in the MR section.
On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote: > >I was in the middle of responding to Mike's email when yours arrived, so >I'll respond to both. > >I think the temp-table idea is interesting. The caution is that a default >temp-table creation will be hosted on a single RS and thus be a bottleneck >for aggregation. So I would imagine that you would need to tune the >temp-table for the job and pre-create regions. > >Doug > > > >On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote: > >>I am trying to do something similar with HBase Map/Reduce. >> >>I have event ids and amounts stored in hbase in the following format: >>prefix-event_id_type-timestamp-event_id as the row key and amount as >>the value >>I want to be able to aggregate the amounts based on the event id type >>and for this I am using a reducer. I basically reduce on the >>eventidtype from the incoming row in the map phase, and perform the >>aggregation in the reducer on the amounts for the event types. Then I >>write back the results into HBase. >> >>I hadn't thought about writing values directly into a temp HBase table >>as suggested by Mike in the map phase. >> >>For this case, each mapper can declare its own mapperId_event_type row >>with totalAmount and for each row it receives, do a get , add the >>current amount, and then a put. We are basically then doing a >>get/add/put for every row that a mapper receives. Is this any more >>efficient when compared to the overhead of sorting/partitioning for a >>reducer ? >> >>At the end of the mapping phase, aggregating the output of all the >>mappers should be trivial. >> >> >> >>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >><[email protected]> wrote: >>> >>> Doug and company... >>> >>> Look, I'm not saying that there aren't m/r jobs were you might need >>>reducers when working w HBase. What I am saying is that if we look at >>>what you're attempting to do, you may end up getting better performance >>>if you created a temp table in HBase and let HBase do some of the heavy >>>lifting where you are currently using a reducer. From the jobs that we >>>run, when we looked at what we were doing, there wasn't any need for a >>>reducer. I suspect that its true of other jobs. >>> >>> Remember that HBase is much more than just an HFile format to persist >>>stuff. >>> >>> Even looking at Sonal's example... you have other ways of doing the >>>record counts like dynamic counters or using a temp table in HBase which >>>I believe will give you better performance numbers, although I haven't >>>benchmarked either against a reducer. >>> >>> Does that make sense? >>> >>> -Mike >>> >>> >>> > From: [email protected] >>> > To: [email protected] >>> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >>> > Subject: Re: Writing MR-Job: Something like OracleReducer, >>>JDBCReducer ... >>> > >>> > >>> > Chris, agreed... There are sometimes that reducers aren't required, >>>and >>> > then situations where they are useful. We have both kinds of jobs. >>> > >>> > For others following the thread, I updated the book recently with >>>more MR >>> > examples (read-only, read-write, read-summary) >>> > >>> > http://hbase.apache.org/book.html#mapreduce.example >>> > >>> > >>> > As to the question that started this thread... >>> > >>> > >>> > re: "Store aggregated data in Oracle. " >>> > >>> > To me, that sounds a like the "read-summary" example with JDBC-Oracle >>>in >>> > the reduce step. >>> > >>> > >>> > >>> > >>> > >>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote: >>> > >>> > >If only I could make NY in Nov :) >>> > > >>> > >We extract out large numbers of DNA sequence reads from HBase, run >>>them >>> > >through M/R pipelines to analyze and aggregate and then we load the >>> > >results back in. Definitely specialized usage, but I could see other >>> > >perfectly valid uses for reducers with HBase. >>> > > >>> > >-chris >>> > > >>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >>> > > >>> > >> >>> > >> Sonal, >>> > >> >>> > >> You do realize that HBase is a "database", right? ;-) >>> > >> >>> > >> So again, why do you need a reducer? ;-) >>> > >> >>> > >> Using your example... >>> > >> "Again, there will be many cases where one may want a reducer, say >>> > >>trying to count the occurrence of words in a particular column." >>> > >> >>> > >> You can do this one of two ways... >>> > >> 1) Dynamic Counters in Hadoop. >>> > >> 2) Use a temp table and auto increment the value in a column which >>> > >>contains the word count. (Fat row where rowkey is doc_id and >>>column is >>> > >>word or rowkey is doc_id|word) >>> > >> >>> > >> I'm sorry but if you go through all of your examples of why you >>>would >>> > >>want to use a reducer, you end up finding out that writing to an >>>HBase >>> > >>table would be faster than a reduce job. >>> > >> (Again we haven't done an exhaustive search, but in all of the >>>HBase >>> > >>jobs we've run... no reducers were necessary.) >>> > >> >>> > >> The point I'm trying to make is that you want to avoid using a >>>reducer >>> > >>whenever possible and if you think about your problem... you can >>> > >>probably come up with a solution that avoids the reducer... >>> > >> >>> > >> >>> > >> HTH >>> > >> >>> > >> -Mike >>> > >> PS. I haven't looked at *all* of the potential use cases of HBase >>>which >>> > >>is why I don't want to say you'll never need a reducer. I will say >>>that >>> > >>based on what we've done at my client's site, we try very hard to >>>avoid >>> > >>reducers. >>> > >> [Note, I'm sure I'm going to get hammered on this when I head to >>>NY in >>> > >>Nov. :-) ] >>> > >> >>> > >> >>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>JDBCReducer >>> > >>>... >>> > >>> From: [email protected] >>> > >>> To: [email protected] >>> > >>> >>> > >>> Hi Michael, >>> > >>> >>> > >>> Yes, thanks, I understand the fact that reducers can be expensive >>>with >>> > >>>all >>> > >>> the shuffling and the sorting, and you may not need them always. >>>At >>> > >>>the same >>> > >>> time, there are many cases where reducers are useful, like >>>secondary >>> > >>> sorting. In many cases, one can have multiple map phases and not >>>have a >>> > >>> reduce phase at all. Again, there will be many cases where one >>>may >>> > >>>want a >>> > >>> reducer, say trying to count the occurrence of words in a >>>particular >>> > >>>column. >>> > >>> >>> > >>> >>> > >>> With this thought chain, I do not feel ready to say that when >>>dealing >>> > >>>with >>> > >>> HBase, I really dont want to use a reducer. Please correct me if >>>I am >>> > >>> wrong. >>> > >>> >>> > >>> Thanks again. >>> > >>> >>> > >>> Best Regards, >>> > >>> Sonal >>> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> >>> > >>> Nube Technologies <http://www.nubetech.co> >>> > >>> >>> > >>> <http://in.linkedin.com/in/sonalgoyal> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >>> > >>> <[email protected]>wrote: >>> > >>> >>> > >>>> >>> > >>>> Sonal, >>> > >>>> >>> > >>>> Just because you have a m/r job doesn't mean that you need to >>>reduce >>> > >>>> anything. You can have a job that contains only a mapper. >>> > >>>> Or your job runner can have a series of map jobs in serial. >>> > >>>> >>> > >>>> Most if not all of the map/reduce jobs where we pull data from >>>HBase, >>> > >>>>don't >>> > >>>> require a reducer. >>> > >>>> >>> > >>>> To give you a simple example... if I want to determine the table >>> > >>>>schema >>> > >>>> where I am storing some sort of structured data... >>> > >>>> I just write a m/r job which opens a table, scan's the table >>>counting >>> > >>>>the >>> > >>>> occurrence of each column name via dynamic counters. >>> > >>>> >>> > >>>> There is no need for a reducer. >>> > >>>> >>> > >>>> Does that help? >>> > >>>> >>> > >>>> >>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>> > >>>>>JDBCReducer >>> > >>>> ... >>> > >>>>> From: [email protected] >>> > >>>>> To: [email protected] >>> > >>>>> >>> > >>>>> Michel, >>> > >>>>> >>> > >>>>> Sorry can you please help me understand what you mean when you >>>say >>> > >>>>>that >>> > >>>> when >>> > >>>>> dealing with HBase, you really dont want to use a reducer? >>>Here, >>> > >>>>>Hbase is >>> > >>>>> being used as the input to the MR job. >>> > >>>>> >>> > >>>>> Thanks >>> > >>>>> Sonal >>> > >>>>> >>> > >>>>> >>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel >>> > >>>>><[email protected] >>> > >>>>> wrote: >>> > >>>>> >>> > >>>>>> I think you need to get a little bit more information. >>> > >>>>>> Reducers are expensive. >>> > >>>>>> When Thomas says that he is aggregating data, what exactly >>>does he >>> > >>>> mean? >>> > >>>>>> When dealing w HBase, you really don't want to use a reducer. >>> > >>>>>> >>> > >>>>>> You may want to run two map jobs and it could be that just >>>dumping >>> > >>>>>>the >>> > >>>>>> output via jdbc makes the most sense. >>> > >>>>>> >>> > >>>>>> We are starting to see a lot of questions where the OP isn't >>> > >>>>>>providing >>> > >>>>>> enough information so that the recommendation could be >>>wrong... >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> Sent from a remote device. Please excuse any typos... >>> > >>>>>> >>> > >>>>>> Mike Segel >>> > >>>>>> >>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal >>><[email protected]> >>> > >>>> wrote: >>> > >>>>>> >>> > >>>>>>> There is a DBOutputFormat class in the >>> > >>>> org.apache,hadoop.mapreduce.lib.db >>> > >>>>>>> package, you could use that. Or you could write to the hdfs >>>and >>> > >>>>>>>then >>> > >>>> use >>> > >>>>>>> something like HIHO[1] to export to the db. I have been >>>working >>> > >>>>>> extensively >>> > >>>>>>> in this area, you can write to me directly if you need any >>>help. >>> > >>>>>>> >>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho >>> > >>>>>>> >>> > >>>>>>> Best Regards, >>> > >>>>>>> Sonal >>> > >>>>>>> Crux: Reporting for HBase >>><https://github.com/sonalgoyal/crux> >>> > >>>>>>> Nube Technologies <http://www.nubetech.co> >>> > >>>>>>> >>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < >>> > >>>>>>> [email protected]> wrote: >>> > >>>>>>> >>> > >>>>>>>> Hello, >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> writing a MR-Job to process HBase data and store aggregated >>>data >>> > >>>>>>>>in >>> > >>>>>>>> Oracle. How would you do that in a MR-job? >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> Currently, for test purposes we write the result into a >>>HBase >>> > >>>>>>>>table >>> > >>>>>>>> again by using a TableReducer. Is there something like a >>> > >>>> OracleReducer, >>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one >>>simply >>> > >>>>>>>>use >>> > >>>>>>>> plan JDBC code in the reduce step? >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> Thanks! >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> Thomas >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>> >>> > >>>> >>> > >>>> >>> > >> >>> > > >>> > >
