I am trying to do something similar with HBase Map/Reduce.

I have event ids and amounts stored in hbase in the following format:
prefix-event_id_type-timestamp-event_id as the row key and  amount as
the value
I want to be able to aggregate the amounts based on the event id type
and for this I am using a reducer. I basically reduce on the
eventidtype from the incoming row in the map phase, and perform the
aggregation in the reducer on the amounts for the event types. Then I
write back the results into HBase.

I hadn't thought about writing values directly into a temp HBase table
as suggested by Mike in the map phase.

For this case, each mapper can declare its own mapperId_event_type row
with totalAmount and for each row it receives, do a get , add the
current amount, and then a put. We are basically then doing a
get/add/put for every row that a mapper receives. Is this any more
efficient when compared to the overhead of sorting/partitioning for a
reducer ?

At the end of the mapping phase, aggregating the output of all the
mappers should be trivial.



On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel
<[email protected]> wrote:
>
> Doug and company...
>
> Look, I'm not saying that there aren't m/r jobs were you might need reducers 
> when working w HBase. What I am saying is that if we look at what you're 
> attempting to do, you may end up getting better performance if you created a 
> temp table in HBase and let HBase do some of the heavy lifting where you are 
> currently using a reducer. From the jobs that we run, when we looked at what 
> we were doing, there wasn't any need for a reducer. I suspect that its true 
> of other jobs.
>
> Remember that HBase is much more than just an HFile format to persist stuff.
>
> Even looking at Sonal's example... you have other ways of doing the record 
> counts like dynamic counters or using a temp table in HBase which I believe 
> will give you better performance numbers, although I haven't benchmarked 
> either against a reducer.
>
> Does that make sense?
>
> -Mike
>
>
> > From: [email protected]
> > To: [email protected]
> > Date: Fri, 16 Sep 2011 15:41:44 -0400
> > Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
> >
> >
> > Chris, agreed... There are sometimes that reducers aren't required, and
> > then situations where they are useful.  We have both kinds of jobs.
> >
> > For others following the thread, I updated the book recently with more MR
> > examples (read-only, read-write, read-summary)
> >
> > http://hbase.apache.org/book.html#mapreduce.example
> >
> >
> > As to the question that started this thread...
> >
> >
> > re:  "Store aggregated data in Oracle. "
> >
> > To me, that sounds a like the "read-summary" example with JDBC-Oracle in
> > the reduce step.
> >
> >
> >
> >
> >
> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote:
> >
> > >If only I could make NY in Nov :)
> > >
> > >We extract out large numbers of DNA sequence reads from HBase, run them
> > >through M/R pipelines to analyze and aggregate and then we load the
> > >results back in. Definitely specialized usage, but I could see other
> > >perfectly valid uses for reducers with HBase.
> > >
> > >-chris
> > >
> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
> > >
> > >>
> > >> Sonal,
> > >>
> > >> You do realize that HBase is a "database", right? ;-)
> > >>
> > >> So again, why do you need a reducer?  ;-)
> > >>
> > >> Using your example...
> > >> "Again, there will be many cases where one may want a reducer, say
> > >>trying to count the occurrence of words in a particular column."
> > >>
> > >> You can do this one of two ways...
> > >> 1) Dynamic Counters in Hadoop.
> > >> 2) Use a temp table and auto increment the value in a column which
> > >>contains the word count.  (Fat row where rowkey is doc_id and column is
> > >>word or rowkey is doc_id|word)
> > >>
> > >> I'm sorry but if you go through all of your examples of why you would
> > >>want to use a reducer, you end up finding out that writing to an HBase
> > >>table would be faster than a reduce job.
> > >> (Again we haven't done an exhaustive search, but in all of the HBase
> > >>jobs we've run... no reducers were necessary.)
> > >>
> > >> The point I'm trying to make is that you want to avoid using a reducer
> > >>whenever possible and if you think about your problem... you can
> > >>probably come up with a solution that avoids the reducer...
> > >>
> > >>
> > >> HTH
> > >>
> > >> -Mike
> > >> PS. I haven't looked at *all* of the potential use cases of HBase which
> > >>is why I don't want to say you'll never need a reducer. I will say that
> > >>based on what we've done at my client's site, we try very hard to avoid
> > >>reducers.
> > >> [Note, I'm sure I'm going to get hammered on this when I head to NY in
> > >>Nov. :-)   ]
> > >>
> > >>
> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530
> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer
> > >>>...
> > >>> From: [email protected]
> > >>> To: [email protected]
> > >>>
> > >>> Hi Michael,
> > >>>
> > >>> Yes, thanks, I understand the fact that reducers can be expensive with
> > >>>all
> > >>> the shuffling and the sorting, and you may not need them always. At
> > >>>the same
> > >>> time, there are many cases where reducers are useful, like secondary
> > >>> sorting. In many cases, one can have multiple map phases and not have a
> > >>> reduce phase at all. Again, there will be many cases where one may
> > >>>want a
> > >>> reducer, say trying to count the occurrence of words in a particular
> > >>>column.
> > >>>
> > >>>
> > >>> With this thought chain, I do not feel ready to say that when dealing
> > >>>with
> > >>> HBase, I really dont want to use a reducer. Please correct me if I am
> > >>> wrong.
> > >>>
> > >>> Thanks again.
> > >>>
> > >>> Best Regards,
> > >>> Sonal
> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> > >>> Nube Technologies <http://www.nubetech.co>
> > >>>
> > >>> <http://in.linkedin.com/in/sonalgoyal>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
> > >>> <[email protected]>wrote:
> > >>>
> > >>>>
> > >>>> Sonal,
> > >>>>
> > >>>> Just because you have a m/r job doesn't mean that you need to reduce
> > >>>> anything. You can have a job that contains only a mapper.
> > >>>> Or your job runner can have a series of map jobs in serial.
> > >>>>
> > >>>> Most if not all of the map/reduce jobs where we pull data from HBase,
> > >>>>don't
> > >>>> require a reducer.
> > >>>>
> > >>>> To give you a simple example... if I want to determine the table
> > >>>>schema
> > >>>> where I am storing some sort of structured data...
> > >>>> I just write a m/r job which opens a table, scan's the table counting
> > >>>>the
> > >>>> occurrence of each column name via dynamic counters.
> > >>>>
> > >>>> There is no need for a reducer.
> > >>>>
> > >>>> Does that help?
> > >>>>
> > >>>>
> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,
> > >>>>>JDBCReducer
> > >>>> ...
> > >>>>> From: [email protected]
> > >>>>> To: [email protected]
> > >>>>>
> > >>>>> Michel,
> > >>>>>
> > >>>>> Sorry can you please help me understand what you mean when you say
> > >>>>>that
> > >>>> when
> > >>>>> dealing with HBase, you really dont want to use a reducer? Here,
> > >>>>>Hbase is
> > >>>>> being used as the input to the MR job.
> > >>>>>
> > >>>>> Thanks
> > >>>>> Sonal
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel
> > >>>>><[email protected]
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I think you need to get a little bit more information.
> > >>>>>> Reducers are expensive.
> > >>>>>> When Thomas says that he is aggregating data, what exactly does he
> > >>>> mean?
> > >>>>>> When dealing w HBase, you really don't want to use a reducer.
> > >>>>>>
> > >>>>>> You may want to run two map jobs and it could be that just dumping
> > >>>>>>the
> > >>>>>> output via jdbc makes the most sense.
> > >>>>>>
> > >>>>>> We are starting to see a lot of questions where the OP isn't
> > >>>>>>providing
> > >>>>>> enough information so that the recommendation could be wrong...
> > >>>>>>
> > >>>>>>
> > >>>>>> Sent from a remote device. Please excuse any typos...
> > >>>>>>
> > >>>>>> Mike Segel
> > >>>>>>
> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[email protected]>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> There is a DBOutputFormat class in the
> > >>>> org.apache,hadoop.mapreduce.lib.db
> > >>>>>>> package, you could use that. Or you could write to the hdfs and
> > >>>>>>>then
> > >>>> use
> > >>>>>>> something like HIHO[1] to export to the db. I have been working
> > >>>>>> extensively
> > >>>>>>> in this area, you can write to me directly if you need any help.
> > >>>>>>>
> > >>>>>>> 1. https://github.com/sonalgoyal/hiho
> > >>>>>>>
> > >>>>>>> Best Regards,
> > >>>>>>> Sonal
> > >>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> > >>>>>>> Nube Technologies <http://www.nubetech.co>
> > >>>>>>>
> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
> > >>>>>>> [email protected]> wrote:
> > >>>>>>>
> > >>>>>>>> Hello,
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> writing a MR-Job to process HBase data and store aggregated data
> > >>>>>>>>in
> > >>>>>>>> Oracle. How would you do that in a MR-job?
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Currently, for test purposes we write the result into a HBase
> > >>>>>>>>table
> > >>>>>>>> again by using a TableReducer. Is there something like a
> > >>>> OracleReducer,
> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one simply
> > >>>>>>>>use
> > >>>>>>>> plan JDBC code in the reduce step?
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Thanks!
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Thomas
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >
> >

Reply via email to