RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Michael Segel Fri, 16 Sep 2011 13:11:35 -0700

Chris,

I don't know what sort of aggregation you are doing, but again, why not write 
to a temp table instead of using a reducer?





> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
> From: [email protected]
> Date: Fri, 16 Sep 2011 11:58:05 -0700
> To: [email protected]
> 
> If only I could make NY in Nov :)
> 
> We extract out large numbers of DNA sequence reads from HBase, run them 
> through M/R pipelines to analyze and aggregate and then we load the results 
> back in. Definitely specialized usage, but I could see other perfectly valid 
> uses for reducers with HBase.
> 
> -chris
>  
> On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
> 
> > 
> > Sonal,
> > 
> > You do realize that HBase is a "database", right? ;-)
> > 
> > So again, why do you need a reducer?  ;-)
> > 
> > Using your example...
> > "Again, there will be many cases where one may want a reducer, say trying 
> > to count the occurrence of words in a particular column."
> > 
> > You can do this one of two ways...
> > 1) Dynamic Counters in Hadoop.
> > 2) Use a temp table and auto increment the value in a column which contains 
> > the word count.  (Fat row where rowkey is doc_id and column is word or 
> > rowkey is doc_id|word)
> > 
> > I'm sorry but if you go through all of your examples of why you would want 
> > to use a reducer, you end up finding out that writing to an HBase table 
> > would be faster than a reduce job.
> > (Again we haven't done an exhaustive search, but in all of the HBase jobs 
> > we've run... no reducers were necessary.)
> > 
> > The point I'm trying to make is that you want to avoid using a reducer 
> > whenever possible and if you think about your problem... you can probably 
> > come up with a solution that avoids the reducer...
> > 
> > 
> > HTH
> > 
> > -Mike
> > PS. I haven't looked at *all* of the potential use cases of HBase which is 
> > why I don't want to say you'll never need a reducer. I will say that based 
> > on what we've done at my client's site, we try very hard to avoid reducers.
> > [Note, I'm sure I'm going to get hammered on this when I head to NY in Nov. 
> > :-)   ]
> > 
> > 
> >> Date: Fri, 16 Sep 2011 23:00:49 +0530
> >> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
> >> From: [email protected]
> >> To: [email protected]
> >> 
> >> Hi Michael,
> >> 
> >> Yes, thanks, I understand the fact that reducers can be expensive with all
> >> the shuffling and the sorting, and you may not need them always. At the 
> >> same
> >> time, there are many cases where reducers are useful, like secondary
> >> sorting. In many cases, one can have multiple map phases and not have a
> >> reduce phase at all. Again, there will be many cases where one may want a
> >> reducer, say trying to count the occurrence of words in a particular 
> >> column.
> >> 
> >> 
> >> With this thought chain, I do not feel ready to say that when dealing with
> >> HBase, I really dont want to use a reducer. Please correct me if I am
> >> wrong.
> >> 
> >> Thanks again.
> >> 
> >> Best Regards,
> >> Sonal
> >> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> >> Nube Technologies <http://www.nubetech.co>
> >> 
> >> <http://in.linkedin.com/in/sonalgoyal>
> >> 
> >> 
> >> 
> >> 
> >> 
> >> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
> >> <[email protected]>wrote:
> >> 
> >>> 
> >>> Sonal,
> >>> 
> >>> Just because you have a m/r job doesn't mean that you need to reduce
> >>> anything. You can have a job that contains only a mapper.
> >>> Or your job runner can have a series of map jobs in serial.
> >>> 
> >>> Most if not all of the map/reduce jobs where we pull data from HBase, 
> >>> don't
> >>> require a reducer.
> >>> 
> >>> To give you a simple example... if I want to determine the table schema
> >>> where I am storing some sort of structured data...
> >>> I just write a m/r job which opens a table, scan's the table counting the
> >>> occurrence of each column name via dynamic counters.
> >>> 
> >>> There is no need for a reducer.
> >>> 
> >>> Does that help?
> >>> 
> >>> 
> >>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
> >>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer
> >>> ...
> >>>> From: [email protected]
> >>>> To: [email protected]
> >>>> 
> >>>> Michel,
> >>>> 
> >>>> Sorry can you please help me understand what you mean when you say that
> >>> when
> >>>> dealing with HBase, you really dont want to use a reducer? Here, Hbase is
> >>>> being used as the input to the MR job.
> >>>> 
> >>>> Thanks
> >>>> Sonal
> >>>> 
> >>>> 
> >>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[email protected]
> >>>> wrote:
> >>>> 
> >>>>> I think you need to get a little bit more information.
> >>>>> Reducers are expensive.
> >>>>> When Thomas says that he is aggregating data, what exactly does he
> >>> mean?
> >>>>> When dealing w HBase, you really don't want to use a reducer.
> >>>>> 
> >>>>> You may want to run two map jobs and it could be that just dumping the
> >>>>> output via jdbc makes the most sense.
> >>>>> 
> >>>>> We are starting to see a lot of questions where the OP isn't providing
> >>>>> enough information so that the recommendation could be wrong...
> >>>>> 
> >>>>> 
> >>>>> Sent from a remote device. Please excuse any typos...
> >>>>> 
> >>>>> Mike Segel
> >>>>> 
> >>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[email protected]>
> >>> wrote:
> >>>>> 
> >>>>>> There is a DBOutputFormat class in the
> >>> org.apache,hadoop.mapreduce.lib.db
> >>>>>> package, you could use that. Or you could write to the hdfs and then
> >>> use
> >>>>>> something like HIHO[1] to export to the db. I have been working
> >>>>> extensively
> >>>>>> in this area, you can write to me directly if you need any help.
> >>>>>> 
> >>>>>> 1. https://github.com/sonalgoyal/hiho
> >>>>>> 
> >>>>>> Best Regards,
> >>>>>> Sonal
> >>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> >>>>>> Nube Technologies <http://www.nubetech.co>
> >>>>>> 
> >>>>>> <http://in.linkedin.com/in/sonalgoyal>
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
> >>>>>> [email protected]> wrote:
> >>>>>> 
> >>>>>>> Hello,
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> writing a MR-Job to process HBase data and store aggregated data in
> >>>>>>> Oracle. How would you do that in a MR-job?
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Currently, for test purposes we write the result into a HBase table
> >>>>>>> again by using a TableReducer. Is there something like a
> >>> OracleReducer,
> >>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one simply use
> >>>>>>> plan JDBC code in the reduce step?
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Thanks!
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Thomas
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>> 
> >>> 
> >>> 
> >                                       
>

RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Reply via email to