Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Doug Meil Fri, 16 Sep 2011 18:14:46 -0700

However, if the aggregations in the mapper were kept in a HashMap (key
being the aggregate, value being the count), and then the mapper made a
single pass over this map during the cleanup method and then did the
checkAndPuts, it would mean that the writes would only happen once per
map-task, and not do it on a per-row basis (which would be really
expensive).


A single region on a single RS could handle that no problem.




On 9/16/11 9:00 PM, "Sam Seigal" <[email protected]> wrote:

>I see what you are saying about the temp table being hosted at a
>single regions server  - especially for a limited set of rows that
>just care about the aggregations, but receive a lot of traffic. I
>wonder if this will also be the case, if I was to use the source table
>to maintain these temporary records, and not create a temp table on
>the fly ...
>
>On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil
><[email protected]> wrote:
>>
>> I'll add this to the book in the MR section.
>>
>>
>>
>>
>>
>> On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote:
>>
>>>
>>>I was in the middle of responding to Mike's email when yours arrived, so
>>>I'll respond to both.
>>>
>>>I think the temp-table idea is interesting.  The caution is that a
>>>default
>>>temp-table creation will be hosted on a single RS and thus be a
>>>bottleneck
>>>for aggregation.  So I would imagine that you would need to tune the
>>>temp-table for the job and pre-create regions.
>>>
>>>Doug
>>>
>>>
>>>
>>>On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote:
>>>
>>>>I am trying to do something similar with HBase Map/Reduce.
>>>>
>>>>I have event ids and amounts stored in hbase in the following format:
>>>>prefix-event_id_type-timestamp-event_id as the row key and  amount as
>>>>the value
>>>>I want to be able to aggregate the amounts based on the event id type
>>>>and for this I am using a reducer. I basically reduce on the
>>>>eventidtype from the incoming row in the map phase, and perform the
>>>>aggregation in the reducer on the amounts for the event types. Then I
>>>>write back the results into HBase.
>>>>
>>>>I hadn't thought about writing values directly into a temp HBase table
>>>>as suggested by Mike in the map phase.
>>>>
>>>>For this case, each mapper can declare its own mapperId_event_type row
>>>>with totalAmount and for each row it receives, do a get , add the
>>>>current amount, and then a put. We are basically then doing a
>>>>get/add/put for every row that a mapper receives. Is this any more
>>>>efficient when compared to the overhead of sorting/partitioning for a
>>>>reducer ?
>>>>
>>>>At the end of the mapping phase, aggregating the output of all the
>>>>mappers should be trivial.
>>>>
>>>>
>>>>
>>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel
>>>><[email protected]> wrote:
>>>>>
>>>>> Doug and company...
>>>>>
>>>>> Look, I'm not saying that there aren't m/r jobs were you might need
>>>>>reducers when working w HBase. What I am saying is that if we look at
>>>>>what you're attempting to do, you may end up getting better
>>>>>performance
>>>>>if you created a temp table in HBase and let HBase do some of the
>>>>>heavy
>>>>>lifting where you are currently using a reducer. From the jobs that we
>>>>>run, when we looked at what we were doing, there wasn't any need for a
>>>>>reducer. I suspect that its true of other jobs.
>>>>>
>>>>> Remember that HBase is much more than just an HFile format to persist
>>>>>stuff.
>>>>>
>>>>> Even looking at Sonal's example... you have other ways of doing the
>>>>>record counts like dynamic counters or using a temp table in HBase
>>>>>which
>>>>>I believe will give you better performance numbers, although I haven't
>>>>>benchmarked either against a reducer.
>>>>>
>>>>> Does that make sense?
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>> > From: [email protected]
>>>>> > To: [email protected]
>>>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400
>>>>> > Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>>JDBCReducer ...
>>>>> >
>>>>> >
>>>>> > Chris, agreed... There are sometimes that reducers aren't required,
>>>>>and
>>>>> > then situations where they are useful.  We have both kinds of jobs.
>>>>> >
>>>>> > For others following the thread, I updated the book recently with
>>>>>more MR
>>>>> > examples (read-only, read-write, read-summary)
>>>>> >
>>>>> > http://hbase.apache.org/book.html#mapreduce.example
>>>>> >
>>>>> >
>>>>> > As to the question that started this thread...
>>>>> >
>>>>> >
>>>>> > re:  "Store aggregated data in Oracle. "
>>>>> >
>>>>> > To me, that sounds a like the "read-summary" example with
>>>>>JDBC-Oracle
>>>>>in
>>>>> > the reduce step.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote:
>>>>> >
>>>>> > >If only I could make NY in Nov :)
>>>>> > >
>>>>> > >We extract out large numbers of DNA sequence reads from HBase, run
>>>>>them
>>>>> > >through M/R pipelines to analyze and aggregate and then we load
>>>>>the
>>>>> > >results back in. Definitely specialized usage, but I could see
>>>>>other
>>>>> > >perfectly valid uses for reducers with HBase.
>>>>> > >
>>>>> > >-chris
>>>>> > >
>>>>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>>>>> > >
>>>>> > >>
>>>>> > >> Sonal,
>>>>> > >>
>>>>> > >> You do realize that HBase is a "database", right? ;-)
>>>>> > >>
>>>>> > >> So again, why do you need a reducer?  ;-)
>>>>> > >>
>>>>> > >> Using your example...
>>>>> > >> "Again, there will be many cases where one may want a reducer,
>>>>>say
>>>>> > >>trying to count the occurrence of words in a particular column."
>>>>> > >>
>>>>> > >> You can do this one of two ways...
>>>>> > >> 1) Dynamic Counters in Hadoop.
>>>>> > >> 2) Use a temp table and auto increment the value in a column
>>>>>which
>>>>> > >>contains the word count.  (Fat row where rowkey is doc_id and
>>>>>column is
>>>>> > >>word or rowkey is doc_id|word)
>>>>> > >>
>>>>> > >> I'm sorry but if you go through all of your examples of why you
>>>>>would
>>>>> > >>want to use a reducer, you end up finding out that writing to an
>>>>>HBase
>>>>> > >>table would be faster than a reduce job.
>>>>> > >> (Again we haven't done an exhaustive search, but in all of the
>>>>>HBase
>>>>> > >>jobs we've run... no reducers were necessary.)
>>>>> > >>
>>>>> > >> The point I'm trying to make is that you want to avoid using a
>>>>>reducer
>>>>> > >>whenever possible and if you think about your problem... you can
>>>>> > >>probably come up with a solution that avoids the reducer...
>>>>> > >>
>>>>> > >>
>>>>> > >> HTH
>>>>> > >>
>>>>> > >> -Mike
>>>>> > >> PS. I haven't looked at *all* of the potential use cases of
>>>>>HBase
>>>>>which
>>>>> > >>is why I don't want to say you'll never need a reducer. I will
>>>>>say
>>>>>that
>>>>> > >>based on what we've done at my client's site, we try very hard to
>>>>>avoid
>>>>> > >>reducers.
>>>>> > >> [Note, I'm sure I'm going to get hammered on this when I head to
>>>>>NY in
>>>>> > >>Nov. :-)   ]
>>>>> > >>
>>>>> > >>
>>>>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>>>>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>>JDBCReducer
>>>>> > >>>...
>>>>> > >>> From: [email protected]
>>>>> > >>> To: [email protected]
>>>>> > >>>
>>>>> > >>> Hi Michael,
>>>>> > >>>
>>>>> > >>> Yes, thanks, I understand the fact that reducers can be
>>>>>expensive
>>>>>with
>>>>> > >>>all
>>>>> > >>> the shuffling and the sorting, and you may not need them
>>>>>always.
>>>>>At
>>>>> > >>>the same
>>>>> > >>> time, there are many cases where reducers are useful, like
>>>>>secondary
>>>>> > >>> sorting. In many cases, one can have multiple map phases and
>>>>>not
>>>>>have a
>>>>> > >>> reduce phase at all. Again, there will be many cases where one
>>>>>may
>>>>> > >>>want a
>>>>> > >>> reducer, say trying to count the occurrence of words in a
>>>>>particular
>>>>> > >>>column.
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> With this thought chain, I do not feel ready to say that when
>>>>>dealing
>>>>> > >>>with
>>>>> > >>> HBase, I really dont want to use a reducer. Please correct me
>>>>>if
>>>>>I am
>>>>> > >>> wrong.
>>>>> > >>>
>>>>> > >>> Thanks again.
>>>>> > >>>
>>>>> > >>> Best Regards,
>>>>> > >>> Sonal
>>>>> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>>> > >>> Nube Technologies <http://www.nubetech.co>
>>>>> > >>>
>>>>> > >>> <http://in.linkedin.com/in/sonalgoyal>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>>>>> > >>> <[email protected]>wrote:
>>>>> > >>>
>>>>> > >>>>
>>>>> > >>>> Sonal,
>>>>> > >>>>
>>>>> > >>>> Just because you have a m/r job doesn't mean that you need to
>>>>>reduce
>>>>> > >>>> anything. You can have a job that contains only a mapper.
>>>>> > >>>> Or your job runner can have a series of map jobs in serial.
>>>>> > >>>>
>>>>> > >>>> Most if not all of the map/reduce jobs where we pull data from
>>>>>HBase,
>>>>> > >>>>don't
>>>>> > >>>> require a reducer.
>>>>> > >>>>
>>>>> > >>>> To give you a simple example... if I want to determine the
>>>>>table
>>>>> > >>>>schema
>>>>> > >>>> where I am storing some sort of structured data...
>>>>> > >>>> I just write a m/r job which opens a table, scan's the table
>>>>>counting
>>>>> > >>>>the
>>>>> > >>>> occurrence of each column name via dynamic counters.
>>>>> > >>>>
>>>>> > >>>> There is no need for a reducer.
>>>>> > >>>>
>>>>> > >>>> Does that help?
>>>>> > >>>>
>>>>> > >>>>
>>>>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>>>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>> > >>>>>JDBCReducer
>>>>> > >>>> ...
>>>>> > >>>>> From: [email protected]
>>>>> > >>>>> To: [email protected]
>>>>> > >>>>>
>>>>> > >>>>> Michel,
>>>>> > >>>>>
>>>>> > >>>>> Sorry can you please help me understand what you mean when
>>>>>you
>>>>>say
>>>>> > >>>>>that
>>>>> > >>>> when
>>>>> > >>>>> dealing with HBase, you really dont want to use a reducer?
>>>>>Here,
>>>>> > >>>>>Hbase is
>>>>> > >>>>> being used as the input to the MR job.
>>>>> > >>>>>
>>>>> > >>>>> Thanks
>>>>> > >>>>> Sonal
>>>>> > >>>>>
>>>>> > >>>>>
>>>>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel
>>>>> > >>>>><[email protected]
>>>>> > >>>>> wrote:
>>>>> > >>>>>
>>>>> > >>>>>> I think you need to get a little bit more information.
>>>>> > >>>>>> Reducers are expensive.
>>>>> > >>>>>> When Thomas says that he is aggregating data, what exactly
>>>>>does he
>>>>> > >>>> mean?
>>>>> > >>>>>> When dealing w HBase, you really don't want to use a
>>>>>reducer.
>>>>> > >>>>>>
>>>>> > >>>>>> You may want to run two map jobs and it could be that just
>>>>>dumping
>>>>> > >>>>>>the
>>>>> > >>>>>> output via jdbc makes the most sense.
>>>>> > >>>>>>
>>>>> > >>>>>> We are starting to see a lot of questions where the OP isn't
>>>>> > >>>>>>providing
>>>>> > >>>>>> enough information so that the recommendation could be
>>>>>wrong...
>>>>> > >>>>>>
>>>>> > >>>>>>
>>>>> > >>>>>> Sent from a remote device. Please excuse any typos...
>>>>> > >>>>>>
>>>>> > >>>>>> Mike Segel
>>>>> > >>>>>>
>>>>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal
>>>>><[email protected]>
>>>>> > >>>> wrote:
>>>>> > >>>>>>
>>>>> > >>>>>>> There is a DBOutputFormat class in the
>>>>> > >>>> org.apache,hadoop.mapreduce.lib.db
>>>>> > >>>>>>> package, you could use that. Or you could write to the hdfs
>>>>>and
>>>>> > >>>>>>>then
>>>>> > >>>> use
>>>>> > >>>>>>> something like HIHO[1] to export to the db. I have been
>>>>>working
>>>>> > >>>>>> extensively
>>>>> > >>>>>>> in this area, you can write to me directly if you need any
>>>>>help.
>>>>> > >>>>>>>
>>>>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho
>>>>> > >>>>>>>
>>>>> > >>>>>>> Best Regards,
>>>>> > >>>>>>> Sonal
>>>>> > >>>>>>> Crux: Reporting for HBase
>>>>><https://github.com/sonalgoyal/crux>
>>>>> > >>>>>>> Nube Technologies <http://www.nubetech.co>
>>>>> > >>>>>>>
>>>>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>> > >>>>>>>
>>>>> > >>>>>>>
>>>>> > >>>>>>>
>>>>> > >>>>>>>
>>>>> > >>>>>>>
>>>>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
>>>>> > >>>>>>> [email protected]> wrote:
>>>>> > >>>>>>>
>>>>> > >>>>>>>> Hello,
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> writing a MR-Job to process HBase data and store
>>>>>aggregated
>>>>>data
>>>>> > >>>>>>>>in
>>>>> > >>>>>>>> Oracle. How would you do that in a MR-job?
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> Currently, for test purposes we write the result into a
>>>>>HBase
>>>>> > >>>>>>>>table
>>>>> > >>>>>>>> again by using a TableReducer. Is there something like a
>>>>> > >>>> OracleReducer,
>>>>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one
>>>>>simply
>>>>> > >>>>>>>>use
>>>>> > >>>>>>>> plan JDBC code in the reduce step?
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> Thanks!
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> Thomas
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>
>>>>> > >>>>
>>>>> > >>>>
>>>>> > >>
>>>>> > >
>>>>> >
>>>
>>
>>

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Reply via email to