Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Doug Meil Fri, 16 Sep 2011 17:25:10 -0700

I'll add this to the book in the MR section.





On 9/16/11 8:22 PM, "Doug Meil" <[email protected]> wrote:

>
>I was in the middle of responding to Mike's email when yours arrived, so
>I'll respond to both.
>
>I think the temp-table idea is interesting.  The caution is that a default
>temp-table creation will be hosted on a single RS and thus be a bottleneck
>for aggregation.  So I would imagine that you would need to tune the
>temp-table for the job and pre-create regions.
>
>Doug
>
>
>
>On 9/16/11 8:16 PM, "Sam Seigal" <[email protected]> wrote:
>
>>I am trying to do something similar with HBase Map/Reduce.
>>
>>I have event ids and amounts stored in hbase in the following format:
>>prefix-event_id_type-timestamp-event_id as the row key and  amount as
>>the value
>>I want to be able to aggregate the amounts based on the event id type
>>and for this I am using a reducer. I basically reduce on the
>>eventidtype from the incoming row in the map phase, and perform the
>>aggregation in the reducer on the amounts for the event types. Then I
>>write back the results into HBase.
>>
>>I hadn't thought about writing values directly into a temp HBase table
>>as suggested by Mike in the map phase.
>>
>>For this case, each mapper can declare its own mapperId_event_type row
>>with totalAmount and for each row it receives, do a get , add the
>>current amount, and then a put. We are basically then doing a
>>get/add/put for every row that a mapper receives. Is this any more
>>efficient when compared to the overhead of sorting/partitioning for a
>>reducer ?
>>
>>At the end of the mapping phase, aggregating the output of all the
>>mappers should be trivial.
>>
>>
>>
>>On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel
>><[email protected]> wrote:
>>>
>>> Doug and company...
>>>
>>> Look, I'm not saying that there aren't m/r jobs were you might need
>>>reducers when working w HBase. What I am saying is that if we look at
>>>what you're attempting to do, you may end up getting better performance
>>>if you created a temp table in HBase and let HBase do some of the heavy
>>>lifting where you are currently using a reducer. From the jobs that we
>>>run, when we looked at what we were doing, there wasn't any need for a
>>>reducer. I suspect that its true of other jobs.
>>>
>>> Remember that HBase is much more than just an HFile format to persist
>>>stuff.
>>>
>>> Even looking at Sonal's example... you have other ways of doing the
>>>record counts like dynamic counters or using a temp table in HBase which
>>>I believe will give you better performance numbers, although I haven't
>>>benchmarked either against a reducer.
>>>
>>> Does that make sense?
>>>
>>> -Mike
>>>
>>>
>>> > From: [email protected]
>>> > To: [email protected]
>>> > Date: Fri, 16 Sep 2011 15:41:44 -0400
>>> > Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>JDBCReducer ...
>>> >
>>> >
>>> > Chris, agreed... There are sometimes that reducers aren't required,
>>>and
>>> > then situations where they are useful.  We have both kinds of jobs.
>>> >
>>> > For others following the thread, I updated the book recently with
>>>more MR
>>> > examples (read-only, read-write, read-summary)
>>> >
>>> > http://hbase.apache.org/book.html#mapreduce.example
>>> >
>>> >
>>> > As to the question that started this thread...
>>> >
>>> >
>>> > re:  "Store aggregated data in Oracle. "
>>> >
>>> > To me, that sounds a like the "read-summary" example with JDBC-Oracle
>>>in
>>> > the reduce step.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On 9/16/11 2:58 PM, "Chris Tarnas" <[email protected]> wrote:
>>> >
>>> > >If only I could make NY in Nov :)
>>> > >
>>> > >We extract out large numbers of DNA sequence reads from HBase, run
>>>them
>>> > >through M/R pipelines to analyze and aggregate and then we load the
>>> > >results back in. Definitely specialized usage, but I could see other
>>> > >perfectly valid uses for reducers with HBase.
>>> > >
>>> > >-chris
>>> > >
>>> > >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>>> > >
>>> > >>
>>> > >> Sonal,
>>> > >>
>>> > >> You do realize that HBase is a "database", right? ;-)
>>> > >>
>>> > >> So again, why do you need a reducer?  ;-)
>>> > >>
>>> > >> Using your example...
>>> > >> "Again, there will be many cases where one may want a reducer, say
>>> > >>trying to count the occurrence of words in a particular column."
>>> > >>
>>> > >> You can do this one of two ways...
>>> > >> 1) Dynamic Counters in Hadoop.
>>> > >> 2) Use a temp table and auto increment the value in a column which
>>> > >>contains the word count.  (Fat row where rowkey is doc_id and
>>>column is
>>> > >>word or rowkey is doc_id|word)
>>> > >>
>>> > >> I'm sorry but if you go through all of your examples of why you
>>>would
>>> > >>want to use a reducer, you end up finding out that writing to an
>>>HBase
>>> > >>table would be faster than a reduce job.
>>> > >> (Again we haven't done an exhaustive search, but in all of the
>>>HBase
>>> > >>jobs we've run... no reducers were necessary.)
>>> > >>
>>> > >> The point I'm trying to make is that you want to avoid using a
>>>reducer
>>> > >>whenever possible and if you think about your problem... you can
>>> > >>probably come up with a solution that avoids the reducer...
>>> > >>
>>> > >>
>>> > >> HTH
>>> > >>
>>> > >> -Mike
>>> > >> PS. I haven't looked at *all* of the potential use cases of HBase
>>>which
>>> > >>is why I don't want to say you'll never need a reducer. I will say
>>>that
>>> > >>based on what we've done at my client's site, we try very hard to
>>>avoid
>>> > >>reducers.
>>> > >> [Note, I'm sure I'm going to get hammered on this when I head to
>>>NY in
>>> > >>Nov. :-)   ]
>>> > >>
>>> > >>
>>> > >>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>>> > >>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>>JDBCReducer
>>> > >>>...
>>> > >>> From: [email protected]
>>> > >>> To: [email protected]
>>> > >>>
>>> > >>> Hi Michael,
>>> > >>>
>>> > >>> Yes, thanks, I understand the fact that reducers can be expensive
>>>with
>>> > >>>all
>>> > >>> the shuffling and the sorting, and you may not need them always.
>>>At
>>> > >>>the same
>>> > >>> time, there are many cases where reducers are useful, like
>>>secondary
>>> > >>> sorting. In many cases, one can have multiple map phases and not
>>>have a
>>> > >>> reduce phase at all. Again, there will be many cases where one
>>>may
>>> > >>>want a
>>> > >>> reducer, say trying to count the occurrence of words in a
>>>particular
>>> > >>>column.
>>> > >>>
>>> > >>>
>>> > >>> With this thought chain, I do not feel ready to say that when
>>>dealing
>>> > >>>with
>>> > >>> HBase, I really dont want to use a reducer. Please correct me if
>>>I am
>>> > >>> wrong.
>>> > >>>
>>> > >>> Thanks again.
>>> > >>>
>>> > >>> Best Regards,
>>> > >>> Sonal
>>> > >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>> > >>> Nube Technologies <http://www.nubetech.co>
>>> > >>>
>>> > >>> <http://in.linkedin.com/in/sonalgoyal>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>>> > >>> <[email protected]>wrote:
>>> > >>>
>>> > >>>>
>>> > >>>> Sonal,
>>> > >>>>
>>> > >>>> Just because you have a m/r job doesn't mean that you need to
>>>reduce
>>> > >>>> anything. You can have a job that contains only a mapper.
>>> > >>>> Or your job runner can have a series of map jobs in serial.
>>> > >>>>
>>> > >>>> Most if not all of the map/reduce jobs where we pull data from
>>>HBase,
>>> > >>>>don't
>>> > >>>> require a reducer.
>>> > >>>>
>>> > >>>> To give you a simple example... if I want to determine the table
>>> > >>>>schema
>>> > >>>> where I am storing some sort of structured data...
>>> > >>>> I just write a m/r job which opens a table, scan's the table
>>>counting
>>> > >>>>the
>>> > >>>> occurrence of each column name via dynamic counters.
>>> > >>>>
>>> > >>>> There is no need for a reducer.
>>> > >>>>
>>> > >>>> Does that help?
>>> > >>>>
>>> > >>>>
>>> > >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>> > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,
>>> > >>>>>JDBCReducer
>>> > >>>> ...
>>> > >>>>> From: [email protected]
>>> > >>>>> To: [email protected]
>>> > >>>>>
>>> > >>>>> Michel,
>>> > >>>>>
>>> > >>>>> Sorry can you please help me understand what you mean when you
>>>say
>>> > >>>>>that
>>> > >>>> when
>>> > >>>>> dealing with HBase, you really dont want to use a reducer?
>>>Here,
>>> > >>>>>Hbase is
>>> > >>>>> being used as the input to the MR job.
>>> > >>>>>
>>> > >>>>> Thanks
>>> > >>>>> Sonal
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel
>>> > >>>>><[email protected]
>>> > >>>>> wrote:
>>> > >>>>>
>>> > >>>>>> I think you need to get a little bit more information.
>>> > >>>>>> Reducers are expensive.
>>> > >>>>>> When Thomas says that he is aggregating data, what exactly
>>>does he
>>> > >>>> mean?
>>> > >>>>>> When dealing w HBase, you really don't want to use a reducer.
>>> > >>>>>>
>>> > >>>>>> You may want to run two map jobs and it could be that just
>>>dumping
>>> > >>>>>>the
>>> > >>>>>> output via jdbc makes the most sense.
>>> > >>>>>>
>>> > >>>>>> We are starting to see a lot of questions where the OP isn't
>>> > >>>>>>providing
>>> > >>>>>> enough information so that the recommendation could be
>>>wrong...
>>> > >>>>>>
>>> > >>>>>>
>>> > >>>>>> Sent from a remote device. Please excuse any typos...
>>> > >>>>>>
>>> > >>>>>> Mike Segel
>>> > >>>>>>
>>> > >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal
>>><[email protected]>
>>> > >>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> There is a DBOutputFormat class in the
>>> > >>>> org.apache,hadoop.mapreduce.lib.db
>>> > >>>>>>> package, you could use that. Or you could write to the hdfs
>>>and
>>> > >>>>>>>then
>>> > >>>> use
>>> > >>>>>>> something like HIHO[1] to export to the db. I have been
>>>working
>>> > >>>>>> extensively
>>> > >>>>>>> in this area, you can write to me directly if you need any
>>>help.
>>> > >>>>>>>
>>> > >>>>>>> 1. https://github.com/sonalgoyal/hiho
>>> > >>>>>>>
>>> > >>>>>>> Best Regards,
>>> > >>>>>>> Sonal
>>> > >>>>>>> Crux: Reporting for HBase
>>><https://github.com/sonalgoyal/crux>
>>> > >>>>>>> Nube Technologies <http://www.nubetech.co>
>>> > >>>>>>>
>>> > >>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
>>> > >>>>>>> [email protected]> wrote:
>>> > >>>>>>>
>>> > >>>>>>>> Hello,
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> writing a MR-Job to process HBase data and store aggregated
>>>data
>>> > >>>>>>>>in
>>> > >>>>>>>> Oracle. How would you do that in a MR-job?
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Currently, for test purposes we write the result into a
>>>HBase
>>> > >>>>>>>>table
>>> > >>>>>>>> again by using a TableReducer. Is there something like a
>>> > >>>> OracleReducer,
>>> > >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one
>>>simply
>>> > >>>>>>>>use
>>> > >>>>>>>> plan JDBC code in the reduce step?
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Thanks!
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> Thomas
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>
>>> > >>>>
>>> > >>>>
>>> > >>
>>> > >
>>> >
>

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Reply via email to