Map-task heap size would definitely be a concern, but since the hashmap would only contain aggregations, ostensibly this map would be holding a far smaller number of the rows that were passed into the mapper.
At least that's how I'd use it. On 9/16/11 9:39 PM, "Sam Seigal" <[email protected]> wrote: >Aren't there memory considerations with this approach ? I would assume >the HashMap can get pretty big , if it retains in memory every record >that passes through .. (Apologies, if I am being ignorant with my >limited knowledge of hadoop's internal workings ... ) > >On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil ><[email protected]> wrote: >> >> However, if the aggregations in the mapper were kept in a HashMap (key >> being the aggregate, value being the count), and then the mapper made a >> single pass over this map during the cleanup method and then did the >> checkAndPuts, it would mean that the writes would only happen once per >> map-task, and not do it on a per-row basis (which would be really >> expensive). >> >> A single region on a single RS could handle that no problem. >> >> >> >> >> On 9/16/11 9:00 PM, "Sam Seigal" <[email protected]> wrote: >> >>>I see what you are saying about the temp table being hosted at a >>>single regions server - especially for a limited set of rows that >>>just care about the aggregations, but receive a lot of traffic. I >>>wonder if this will also be the case, if I was to use the source table >>>to maintain these temporary records, and not create a temp table on >>>the fly ... >>> >>>On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil >>><[email protected]> wrote: >>>> >>>> I'll add this to the book in the MR section. >>>> >>>> >>>> >>>> >>>> >>>> On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote: >>>> >>>>> >>>>>I was in the middle of responding to Mike's email when yours arrived, >>>>>so >>>>>I'll respond to both. >>>>> >>>>>I think the temp-table idea is interesting. The caution is that a >>>>>default >>>>>temp-table creation will be hosted on a single RS and thus be a >>>>>bottleneck >>>>>for aggregation. So I would imagine that you would need to tune the >>>>>temp-table for the job and pre-create regions. >>>>> >>>>>Doug >>>>> >>>>> >>>>> >>>>>On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote: >>>>> >>>>>>I am trying to do something similar with HBase Map/Reduce. >>>>>> >>>>>>I have event ids and amounts stored in hbase in the following format: >>>>>>prefix-event_id_type-timestamp-event_id as the row key and amount as >>>>>>the value >>>>>>I want to be able to aggregate the amounts based on the event id type >>>>>>and for this I am using a reducer. I basically reduce on the >>>>>>eventidtype from the incoming row in the map phase, and perform the >>>>>>aggregation in the reducer on the amounts for the event types. Then I >>>>>>write back the results into HBase. >>>>>> >>>>>>I hadn't thought about writing values directly into a temp HBase >>>>>>table >>>>>>as suggested by Mike in the map phase. >>>>>> >>>>>>For this case, each mapper can declare its own mapperId_event_type >>>>>>row >>>>>>with totalAmount and for each row it receives, do a get , add the >>>>>>current amount, and then a put. We are basically then doing a >>>>>>get/add/put for every row that a mapper receives. Is this any more >>>>>>efficient when compared to the overhead of sorting/partitioning for a >>>>>>reducer ? >>>>>> >>>>>>At the end of the mapping phase, aggregating the output of all the >>>>>>mappers should be trivial. >>>>>> >>>>>> >>>>>> >>>>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>>>>><[email protected]> wrote: >>>>>>> >>>>>>> Doug and company... >>>>>>> >>>>>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>>>>reducers when working w HBase. What I am saying is that if we look >>>>>>>at >>>>>>>what you're attempting to do, you may end up getting better >>>>>>>performance >>>>>>>if you created a temp table in HBase and let HBase do some of the >>>>>>>heavy >>>>>>>lifting where you are currently using a reducer. From the jobs that >>>>>>>we >>>>>>>run, when we looked at what we were doing, there wasn't any need >>>>>>>for a >>>>>>>reducer. I suspect that its true of other jobs. >>>>>>> >>>>>>> Remember that HBase is much more than just an HFile format to >>>>>>>persist >>>>>>>stuff. >>>>>>> >>>>>>> Even looking at Sonal's example... you have other ways of doing the >>>>>>>record counts like dynamic counters or using a temp table in HBase >>>>>>>which >>>>>>>I believe will give you better performance numbers, although I >>>>>>>haven't >>>>>>>benchmarked either against a reducer. >>>>>>> >>>>>>> Does that make sense? >>>>>>> >>>>>>> -Mike >>>>>>> >>>>>>> >>>>>>> > From: [email protected] >>>>>>> > To: [email protected] >>>>>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400 >>>>>>> > Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>>>JDBCReducer ... >>>>>>> > >>>>>>> > >>>>>>> > Chris, agreed... There are sometimes that reducers aren't >>>>>>>required, >>>>>>>and >>>>>>> > then situations where they are useful. We have both kinds of >>>>>>>jobs. >>>>>>> > >>>>>>> > For others following the thread, I updated the book recently with >>>>>>>more MR >>>>>>> > examples (read-only, read-write, read-summary) >>>>>>> > >>>>>>> > http://hbase.apache.org/book.html#mapreduce.example >>>>>>> > >>>>>>> > >>>>>>> > As to the question that started this thread... >>>>>>> > >>>>>>> > >>>>>>> > re: "Store aggregated data in Oracle. " >>>>>>> > >>>>>>> > To me, that sounds a like the "read-summary" example with >>>>>>>JDBC-Oracle >>>>>>>in >>>>>>> > the reduce step. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote: >>>>>>> > >>>>>>> > >If only I could make NY in Nov :) >>>>>>> > > >>>>>>> > >We extract out large numbers of DNA sequence reads from HBase, >>>>>>>run >>>>>>>them >>>>>>> > >through M/R pipelines to analyze and aggregate and then we load >>>>>>>the >>>>>>> > >results back in. Definitely specialized usage, but I could see >>>>>>>other >>>>>>> > >perfectly valid uses for reducers with HBase. >>>>>>> > > >>>>>>> > >-chris >>>>>>> > > >>>>>>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >>>>>>> > > >>>>>>> > >> >>>>>>> > >> Sonal, >>>>>>> > >> >>>>>>> > >> You do realize that HBase is a "database", right? ;-) >>>>>>> > >> >>>>>>> > >> So again, why do you need a reducer? ;-) >>>>>>> > >> >>>>>>> > >> Using your example... >>>>>>> > >> "Again, there will be many cases where one may want a reducer, >>>>>>>say >>>>>>> > >>trying to count the occurrence of words in a particular >>>>>>>column." >>>>>>> > >> >>>>>>> > >> You can do this one of two ways... >>>>>>> > >> 1) Dynamic Counters in Hadoop. >>>>>>> > >> 2) Use a temp table and auto increment the value in a column >>>>>>>which >>>>>>> > >>contains the word count. (Fat row where rowkey is doc_id and >>>>>>>column is >>>>>>> > >>word or rowkey is doc_id|word) >>>>>>> > >> >>>>>>> > >> I'm sorry but if you go through all of your examples of why >>>>>>>you >>>>>>>would >>>>>>> > >>want to use a reducer, you end up finding out that writing to >>>>>>>an >>>>>>>HBase >>>>>>> > >>table would be faster than a reduce job. >>>>>>> > >> (Again we haven't done an exhaustive search, but in all of the >>>>>>>HBase >>>>>>> > >>jobs we've run... no reducers were necessary.) >>>>>>> > >> >>>>>>> > >> The point I'm trying to make is that you want to avoid using a >>>>>>>reducer >>>>>>> > >>whenever possible and if you think about your problem... you >>>>>>>can >>>>>>> > >>probably come up with a solution that avoids the reducer... >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> HTH >>>>>>> > >> >>>>>>> > >> -Mike >>>>>>> > >> PS. I haven't looked at *all* of the potential use cases of >>>>>>>HBase >>>>>>>which >>>>>>> > >>is why I don't want to say you'll never need a reducer. I will >>>>>>>say >>>>>>>that >>>>>>> > >>based on what we've done at my client's site, we try very hard >>>>>>>to >>>>>>>avoid >>>>>>> > >>reducers. >>>>>>> > >> [Note, I'm sure I'm going to get hammered on this when I head >>>>>>>to >>>>>>>NY in >>>>>>> > >>Nov. :-) ] >>>>>>> > >> >>>>>>> > >> >>>>>>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>>>>>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>>>JDBCReducer >>>>>>> > >>>... >>>>>>> > >>> From: [email protected] >>>>>>> > >>> To: [email protected] >>>>>>> > >>> >>>>>>> > >>> Hi Michael, >>>>>>> > >>> >>>>>>> > >>> Yes, thanks, I understand the fact that reducers can be >>>>>>>expensive >>>>>>>with >>>>>>> > >>>all >>>>>>> > >>> the shuffling and the sorting, and you may not need them >>>>>>>always. >>>>>>>At >>>>>>> > >>>the same >>>>>>> > >>> time, there are many cases where reducers are useful, like >>>>>>>secondary >>>>>>> > >>> sorting. In many cases, one can have multiple map phases and >>>>>>>not >>>>>>>have a >>>>>>> > >>> reduce phase at all. Again, there will be many cases where >>>>>>>one >>>>>>>may >>>>>>> > >>>want a >>>>>>> > >>> reducer, say trying to count the occurrence of words in a >>>>>>>particular >>>>>>> > >>>column. >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> With this thought chain, I do not feel ready to say that when >>>>>>>dealing >>>>>>> > >>>with >>>>>>> > >>> HBase, I really dont want to use a reducer. Please correct me >>>>>>>if >>>>>>>I am >>>>>>> > >>> wrong. >>>>>>> > >>> >>>>>>> > >>> Thanks again. >>>>>>> > >>> >>>>>>> > >>> Best Regards, >>>>>>> > >>> Sonal >>>>>>> > >>> Crux: Reporting for HBase >>>>>>><https://github.com/sonalgoyal/crux> >>>>>>> > >>> Nube Technologies <http://www.nubetech.co> >>>>>>> > >>> >>>>>>> > >>> <http://in.linkedin.com/in/sonalgoyal> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >>>>>>> > >>> <[email protected]>wrote: >>>>>>> > >>> >>>>>>> > >>>> >>>>>>> > >>>> Sonal, >>>>>>> > >>>> >>>>>>> > >>>> Just because you have a m/r job doesn't mean that you need >>>>>>>to >>>>>>>reduce >>>>>>> > >>>> anything. You can have a job that contains only a mapper. >>>>>>> > >>>> Or your job runner can have a series of map jobs in serial. >>>>>>> > >>>> >>>>>>> > >>>> Most if not all of the map/reduce jobs where we pull data >>>>>>>from >>>>>>>HBase, >>>>>>> > >>>>don't >>>>>>> > >>>> require a reducer. >>>>>>> > >>>> >>>>>>> > >>>> To give you a simple example... if I want to determine the >>>>>>>table >>>>>>> > >>>>schema >>>>>>> > >>>> where I am storing some sort of structured data... >>>>>>> > >>>> I just write a m/r job which opens a table, scan's the table >>>>>>>counting >>>>>>> > >>>>the >>>>>>> > >>>> occurrence of each column name via dynamic counters. >>>>>>> > >>>> >>>>>>> > >>>> There is no need for a reducer. >>>>>>> > >>>> >>>>>>> > >>>> Does that help? >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >>>>>>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>>> > >>>>>JDBCReducer >>>>>>> > >>>> ... >>>>>>> > >>>>> From: [email protected] >>>>>>> > >>>>> To: [email protected] >>>>>>> > >>>>> >>>>>>> > >>>>> Michel, >>>>>>> > >>>>> >>>>>>> > >>>>> Sorry can you please help me understand what you mean when >>>>>>>you >>>>>>>say >>>>>>> > >>>>>that >>>>>>> > >>>> when >>>>>>> > >>>>> dealing with HBase, you really dont want to use a reducer? >>>>>>>Here, >>>>>>> > >>>>>Hbase is >>>>>>> > >>>>> being used as the input to the MR job. >>>>>>> > >>>>> >>>>>>> > >>>>> Thanks >>>>>>> > >>>>> Sonal >>>>>>> > >>>>> >>>>>>> > >>>>> >>>>>>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel >>>>>>> > >>>>><[email protected] >>>>>>> > >>>>> wrote: >>>>>>> > >>>>> >>>>>>> > >>>>>> I think you need to get a little bit more information. >>>>>>> > >>>>>> Reducers are expensive. >>>>>>> > >>>>>> When Thomas says that he is aggregating data, what exactly >>>>>>>does he >>>>>>> > >>>> mean? >>>>>>> > >>>>>> When dealing w HBase, you really don't want to use a >>>>>>>reducer. >>>>>>> > >>>>>> >>>>>>> > >>>>>> You may want to run two map jobs and it could be that just >>>>>>>dumping >>>>>>> > >>>>>>the >>>>>>> > >>>>>> output via jdbc makes the most sense. >>>>>>> > >>>>>> >>>>>>> > >>>>>> We are starting to see a lot of questions where the OP >>>>>>>isn't >>>>>>> > >>>>>>providing >>>>>>> > >>>>>> enough information so that the recommendation could be >>>>>>>wrong... >>>>>>> > >>>>>> >>>>>>> > >>>>>> >>>>>>> > >>>>>> Sent from a remote device. Please excuse any typos... >>>>>>> > >>>>>> >>>>>>> > >>>>>> Mike Segel >>>>>>> > >>>>>> >>>>>>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal >>>>>>><[email protected]> >>>>>>> > >>>> wrote: >>>>>>> > >>>>>> >>>>>>> > >>>>>>> There is a DBOutputFormat class in the >>>>>>> > >>>> org.apache,hadoop.mapreduce.lib.db >>>>>>> > >>>>>>> package, you could use that. Or you could write to the >>>>>>>hdfs >>>>>>>and >>>>>>> > >>>>>>>then >>>>>>> > >>>> use >>>>>>> > >>>>>>> something like HIHO[1] to export to the db. I have been >>>>>>>working >>>>>>> > >>>>>> extensively >>>>>>> > >>>>>>> in this area, you can write to me directly if you need >>>>>>>any >>>>>>>help. >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> Best Regards, >>>>>>> > >>>>>>> Sonal >>>>>>> > >>>>>>> Crux: Reporting for HBase >>>>>>><https://github.com/sonalgoyal/crux> >>>>>>> > >>>>>>> Nube Technologies <http://www.nubetech.co> >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal> >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < >>>>>>> > >>>>>>> [email protected]> wrote: >>>>>>> > >>>>>>> >>>>>>> > >>>>>>>> Hello, >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> writing a MR-Job to process HBase data and store >>>>>>>aggregated >>>>>>>data >>>>>>> > >>>>>>>>in >>>>>>> > >>>>>>>> Oracle. How would you do that in a MR-job? >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> Currently, for test purposes we write the result into a >>>>>>>HBase >>>>>>> > >>>>>>>>table >>>>>>> > >>>>>>>> again by using a TableReducer. Is there something like a >>>>>>> > >>>> OracleReducer, >>>>>>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should >>>>>>>one >>>>>>>simply >>>>>>> > >>>>>>>>use >>>>>>> > >>>>>>>> plan JDBC code in the reduce step? >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> Thanks! >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> Thomas >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > >> >>>>>>> > > >>>>>>> > >>>>> >>>> >>>> >> >>
