Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Doug Meil Fri, 16 Sep 2011 18:48:24 -0700

Map-task heap size would definitely be a concern, but since the hashmap
would only contain aggregations, ostensibly this map would be holding a
far smaller number of the rows that were passed into the mapper.


At least that's how I'd use it.



On 9/16/11 9:39 PM, "Sam Seigal" <[email protected]> wrote:

>Aren't there memory considerations with this approach ? I would assume
>the HashMap can get pretty big , if it retains in memory every record
>that passes through .. (Apologies, if I am being ignorant with my
>limited knowledge of hadoop's internal workings ... )
>
>On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil
><[email protected]> wrote:
>>
>> However, if the aggregations in the mapper were kept in a HashMap (key
>> being the aggregate, value being the count), and then the mapper made a
>> single pass over this map during the cleanup method and then did the
>> checkAndPuts, it would mean that the writes would only happen once per
>> map-task, and not do it on a per-row basis (which would be really
>> expensive).
>>
>> A single region on a single RS could handle that no problem.
>>
>>
>>
>>
>> On 9/16/11 9:00 PM, "Sam Seigal" <[email protected]> wrote:
>>
>>>I see what you are saying about the temp table being hosted at a
>>>single regions server  - especially for a limited set of rows that
>>>just care about the aggregations, but receive a lot of traffic. I
>>>wonder if this will also be the case, if I was to use the source table
>>>to maintain these temporary records, and not create a temp table on
>>>the fly ...
>>>
>>>On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil
>>><[email protected]> wrote:
>>>>
>>>> I'll add this to the book in the MR section.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote:
>>>>
>>>>>
>>>>>I was in the middle of responding to Mike's email when yours arrived,
>>>>>so
>>>>>I'll respond to both.
>>>>>
>>>>>I think the temp-table idea is interesting.  The caution is that a
>>>>>default
>>>>>temp-table creation will be hosted on a single RS and thus be a
>>>>>bottleneck
>>>>>for aggregation.  So I would imagine that you would need to tune the
>>>>>temp-table for the job and pre-create regions.
>>>>>
>>>>>Doug
>>>>>
>>>>>
>>>>>
>>>>>On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote:
>>>>>
>>>>>>I am trying to do something similar with HBase Map/Reduce.
>>>>>>
>>>>>>I have event ids and amounts stored in hbase in the following format:
>>>>>>prefix-event_id_type-timestamp-event_id as the row key and  amount as
>>>>>>the value
>>>>>>I want to be able to aggregate the amounts based on the event id type
>>>>>>and for this I am using a reducer. I basically reduce on the
>>>>>>eventidtype from the incoming row in the map phase, and perform the
>>>>>>aggregation in the reducer on the amounts for the event types. Then I
>>>>>>write back the results into HBase.
>>>>>>
>>>>>>I hadn't thought about writing values directly into a temp HBase
>>>>>>table
>>>>>>as suggested by Mike in the map phase.
>>>>>>
>>>>>>For this case, each mapper can declare its own mapperId_event_type
>>>>>>row
>>>>>>with totalAmount and for each row it receives, do a get , add the
>>>>>>current amount, and then a put. We are basically then doing a
>>>>>>get/add/put for every row that a mapper receives. Is this any more
>>>>>>efficient when compared to the overhead of sorting/partitioning for a
>>>>>>reducer ?
>>>>>>
>>>>>>At the end of the mapping phase, aggregating the output of all the
>>>>>>mappers should be trivial.
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel
>>>>>><[email protected]> wrote:
>>>>>>>
>>>>>>> Doug and company...
>>>>>>>
>>>>>>> Look, I'm not saying that there aren't m/r jobs were you might need
>>>>>>>reducers when working w HBase. What I am saying is that if we look
>>>>>>>at
>>>>>>>what you're attempting to do, you may end up getting better
>>>>>>>performance
>>>>>>>if you created a temp table in HBase and let HBase do some of the
>>>>>>>heavy
>>>>>>>lifting where you are currently using a reducer. From the jobs that
>>>>>>>we
>>>>>>>run, when we looked at what we were doing, there wasn't any need
>>>>>>>for a
>>>>>>>reducer. I suspect that its true of other jobs.
>>>>>>>
>>>>>>> Remember that HBase is much more than just an HFile format to
>>>>>>>persist
>>>>>>>stuff.
>>>>>>>
>>>>>>> Even looking at Sonal's example... you have other ways of doing the
>>>>>>>record counts like dynamic counters or using a temp table in HBase
>>>>>>>which
>>>>>>>I believe will give you better performance numbers, although I
>>>>>>>haven't
>>>>>>>benchmarked either against a reducer.
>>>>>>>
>>>>>>> Does that make sense?
>>>>>>>
>>>>>>> -Mike
>>>>>>>
>>>>>>>
>>>>>>> > From: [email protected]
>>>>>>> > To: [email protected]
>>>>>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400
>>>>>>> > Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>>>>JDBCReducer ...
>>>>>>> >
>>>>>>> >
>>>>>>> > Chris, agreed... There are sometimes that reducers aren't
>>>>>>>required,
>>>>>>>and
>>>>>>> > then situations where they are useful.  We have both kinds of
>>>>>>>jobs.
>>>>>>> >
>>>>>>> > For others following the thread, I updated the book recently with
>>>>>>>more MR
>>>>>>> > examples (read-only, read-write, read-summary)
>>>>>>> >
>>>>>>> > http://hbase.apache.org/book.html#mapreduce.example
>>>>>>> >
>>>>>>> >
>>>>>>> > As to the question that started this thread...
>>>>>>> >
>>>>>>> >
>>>>>>> > re:  "Store aggregated data in Oracle. "
>>>>>>> >
>>>>>>> > To me, that sounds a like the "read-summary" example with
>>>>>>>JDBC-Oracle
>>>>>>>in
>>>>>>> > the reduce step.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote:
>>>>>>> >
>>>>>>> > >If only I could make NY in Nov :)
>>>>>>> > >
>>>>>>> > >We extract out large numbers of DNA sequence reads from HBase,
>>>>>>>run
>>>>>>>them
>>>>>>> > >through M/R pipelines to analyze and aggregate and then we load
>>>>>>>the
>>>>>>> > >results back in. Definitely specialized usage, but I could see
>>>>>>>other
>>>>>>> > >perfectly valid uses for reducers with HBase.
>>>>>>> > >
>>>>>>> > >-chris
>>>>>>> > >
>>>>>>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>>>>>>> > >
>>>>>>> > >>
>>>>>>> > >> Sonal,
>>>>>>> > >>
>>>>>>> > >> You do realize that HBase is a "database", right? ;-)
>>>>>>> > >>
>>>>>>> > >> So again, why do you need a reducer?  ;-)
>>>>>>> > >>
>>>>>>> > >> Using your example...
>>>>>>> > >> "Again, there will be many cases where one may want a reducer,
>>>>>>>say
>>>>>>> > >>trying to count the occurrence of words in a particular
>>>>>>>column."
>>>>>>> > >>
>>>>>>> > >> You can do this one of two ways...
>>>>>>> > >> 1) Dynamic Counters in Hadoop.
>>>>>>> > >> 2) Use a temp table and auto increment the value in a column
>>>>>>>which
>>>>>>> > >>contains the word count.  (Fat row where rowkey is doc_id and
>>>>>>>column is
>>>>>>> > >>word or rowkey is doc_id|word)
>>>>>>> > >>
>>>>>>> > >> I'm sorry but if you go through all of your examples of why
>>>>>>>you
>>>>>>>would
>>>>>>> > >>want to use a reducer, you end up finding out that writing to
>>>>>>>an
>>>>>>>HBase
>>>>>>> > >>table would be faster than a reduce job.
>>>>>>> > >> (Again we haven't done an exhaustive search, but in all of the
>>>>>>>HBase
>>>>>>> > >>jobs we've run... no reducers were necessary.)
>>>>>>> > >>
>>>>>>> > >> The point I'm trying to make is that you want to avoid using a
>>>>>>>reducer
>>>>>>> > >>whenever possible and if you think about your problem... you
>>>>>>>can
>>>>>>> > >>probably come up with a solution that avoids the reducer...
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> HTH
>>>>>>> > >>
>>>>>>> > >> -Mike
>>>>>>> > >> PS. I haven't looked at *all* of the potential use cases of
>>>>>>>HBase
>>>>>>>which
>>>>>>> > >>is why I don't want to say you'll never need a reducer. I will
>>>>>>>say
>>>>>>>that
>>>>>>> > >>based on what we've done at my client's site, we try very hard
>>>>>>>to
>>>>>>>avoid
>>>>>>> > >>reducers.
>>>>>>> > >> [Note, I'm sure I'm going to get hammered on this when I head
>>>>>>>to
>>>>>>>NY in
>>>>>>> > >>Nov. :-)   ]
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>>>>>>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>>>>JDBCReducer
>>>>>>> > >>>...
>>>>>>> > >>> From: [email protected]
>>>>>>> > >>> To: [email protected]
>>>>>>> > >>>
>>>>>>> > >>> Hi Michael,
>>>>>>> > >>>
>>>>>>> > >>> Yes, thanks, I understand the fact that reducers can be
>>>>>>>expensive
>>>>>>>with
>>>>>>> > >>>all
>>>>>>> > >>> the shuffling and the sorting, and you may not need them
>>>>>>>always.
>>>>>>>At
>>>>>>> > >>>the same
>>>>>>> > >>> time, there are many cases where reducers are useful, like
>>>>>>>secondary
>>>>>>> > >>> sorting. In many cases, one can have multiple map phases and
>>>>>>>not
>>>>>>>have a
>>>>>>> > >>> reduce phase at all. Again, there will be many cases where
>>>>>>>one
>>>>>>>may
>>>>>>> > >>>want a
>>>>>>> > >>> reducer, say trying to count the occurrence of words in a
>>>>>>>particular
>>>>>>> > >>>column.
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> With this thought chain, I do not feel ready to say that when
>>>>>>>dealing
>>>>>>> > >>>with
>>>>>>> > >>> HBase, I really dont want to use a reducer. Please correct me
>>>>>>>if
>>>>>>>I am
>>>>>>> > >>> wrong.
>>>>>>> > >>>
>>>>>>> > >>> Thanks again.
>>>>>>> > >>>
>>>>>>> > >>> Best Regards,
>>>>>>> > >>> Sonal
>>>>>>> > >>> Crux: Reporting for HBase
>>>>>>><https://github.com/sonalgoyal/crux>
>>>>>>> > >>> Nube Technologies <http://www.nubetech.co>
>>>>>>> > >>>
>>>>>>> > >>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>>>>>>> > >>> <[email protected]>wrote:
>>>>>>> > >>>
>>>>>>> > >>>>
>>>>>>> > >>>> Sonal,
>>>>>>> > >>>>
>>>>>>> > >>>> Just because you have a m/r job doesn't mean that you need
>>>>>>>to
>>>>>>>reduce
>>>>>>> > >>>> anything. You can have a job that contains only a mapper.
>>>>>>> > >>>> Or your job runner can have a series of map jobs in serial.
>>>>>>> > >>>>
>>>>>>> > >>>> Most if not all of the map/reduce jobs where we pull data
>>>>>>>from
>>>>>>>HBase,
>>>>>>> > >>>>don't
>>>>>>> > >>>> require a reducer.
>>>>>>> > >>>>
>>>>>>> > >>>> To give you a simple example... if I want to determine the
>>>>>>>table
>>>>>>> > >>>>schema
>>>>>>> > >>>> where I am storing some sort of structured data...
>>>>>>> > >>>> I just write a m/r job which opens a table, scan's the table
>>>>>>>counting
>>>>>>> > >>>>the
>>>>>>> > >>>> occurrence of each column name via dynamic counters.
>>>>>>> > >>>>
>>>>>>> > >>>> There is no need for a reducer.
>>>>>>> > >>>>
>>>>>>> > >>>> Does that help?
>>>>>>> > >>>>
>>>>>>> > >>>>
>>>>>>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>>>>>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>>>>> > >>>>>JDBCReducer
>>>>>>> > >>>> ...
>>>>>>> > >>>>> From: [email protected]
>>>>>>> > >>>>> To: [email protected]
>>>>>>> > >>>>>
>>>>>>> > >>>>> Michel,
>>>>>>> > >>>>>
>>>>>>> > >>>>> Sorry can you please help me understand what you mean when
>>>>>>>you
>>>>>>>say
>>>>>>> > >>>>>that
>>>>>>> > >>>> when
>>>>>>> > >>>>> dealing with HBase, you really dont want to use a reducer?
>>>>>>>Here,
>>>>>>> > >>>>>Hbase is
>>>>>>> > >>>>> being used as the input to the MR job.
>>>>>>> > >>>>>
>>>>>>> > >>>>> Thanks
>>>>>>> > >>>>> Sonal
>>>>>>> > >>>>>
>>>>>>> > >>>>>
>>>>>>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel
>>>>>>> > >>>>><[email protected]
>>>>>>> > >>>>> wrote:
>>>>>>> > >>>>>
>>>>>>> > >>>>>> I think you need to get a little bit more information.
>>>>>>> > >>>>>> Reducers are expensive.
>>>>>>> > >>>>>> When Thomas says that he is aggregating data, what exactly
>>>>>>>does he
>>>>>>> > >>>> mean?
>>>>>>> > >>>>>> When dealing w HBase, you really don't want to use a
>>>>>>>reducer.
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> You may want to run two map jobs and it could be that just
>>>>>>>dumping
>>>>>>> > >>>>>>the
>>>>>>> > >>>>>> output via jdbc makes the most sense.
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> We are starting to see a lot of questions where the OP
>>>>>>>isn't
>>>>>>> > >>>>>>providing
>>>>>>> > >>>>>> enough information so that the recommendation could be
>>>>>>>wrong...
>>>>>>> > >>>>>>
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> Mike Segel
>>>>>>> > >>>>>>
>>>>>>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal
>>>>>>><[email protected]>
>>>>>>> > >>>> wrote:
>>>>>>> > >>>>>>
>>>>>>> > >>>>>>> There is a DBOutputFormat class in the
>>>>>>> > >>>> org.apache,hadoop.mapreduce.lib.db
>>>>>>> > >>>>>>> package, you could use that. Or you could write to the
>>>>>>>hdfs
>>>>>>>and
>>>>>>> > >>>>>>>then
>>>>>>> > >>>> use
>>>>>>> > >>>>>>> something like HIHO[1] to export to the db. I have been
>>>>>>>working
>>>>>>> > >>>>>> extensively
>>>>>>> > >>>>>>> in this area, you can write to me directly if you need
>>>>>>>any
>>>>>>>help.
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> Best Regards,
>>>>>>> > >>>>>>> Sonal
>>>>>>> > >>>>>>> Crux: Reporting for HBase
>>>>>>><https://github.com/sonalgoyal/crux>
>>>>>>> > >>>>>>> Nube Technologies <http://www.nubetech.co>
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
>>>>>>> > >>>>>>> [email protected]> wrote:
>>>>>>> > >>>>>>>
>>>>>>> > >>>>>>>> Hello,
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> writing a MR-Job to process HBase data and store
>>>>>>>aggregated
>>>>>>>data
>>>>>>> > >>>>>>>>in
>>>>>>> > >>>>>>>> Oracle. How would you do that in a MR-job?
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> Currently, for test purposes we write the result into a
>>>>>>>HBase
>>>>>>> > >>>>>>>>table
>>>>>>> > >>>>>>>> again by using a TableReducer. Is there something like a
>>>>>>> > >>>> OracleReducer,
>>>>>>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should
>>>>>>>one
>>>>>>>simply
>>>>>>> > >>>>>>>>use
>>>>>>> > >>>>>>>> plan JDBC code in the reduce step?
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> Thanks!
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>> Thomas
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>>>
>>>>>>> > >>>>>>
>>>>>>> > >>>>
>>>>>>> > >>>>
>>>>>>> > >>
>>>>>>> > >
>>>>>>> >
>>>>>
>>>>
>>>>
>>
>>

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Reply via email to