Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Chris Tarnas Fri, 16 Sep 2011 15:34:32 -0700

But - if I may follow up on myself - I'll definitely keep my eyes more open for 
times when we really don't need a reducer. I can see what you are saying and 
that people should think a bit more laterally and use hbase for different and 
potentially more efficient workflows.


-chris

On Sep 16, 2011, at 2:54 PM, Chris Tarnas wrote:

> Hi Mike,
> 
> It's analysis* and aggregation, not just aggregation so it's a bit more 
> complex. Each row in the input generates at least one new row of data when we 
> are done.
> 
> For our data sizes (~1 billion 2-3kb rows per job now and growing) we 
> originally did normal inserts, but then we switched to bulk imports - it was 
> much faster and a lot less stress on the regionservers. Bulk importing uses a 
> reducer, so even if we went through and changed our M/R pipelines to use a 
> temp table for organized intermediate data, the most efficient way to 
> populate the temp table would be via the bulk loader - using a reducer anyway.
> 
> -chris
> 
> * Sorry to be broad but for business reasons I can't talk to much about the 
> analysis details.
> 
> 
> On Sep 16, 2011, at 1:11 PM, Michael Segel wrote:
> 
>> 
>> Chris,
>> 
>> I don't know what sort of aggregation you are doing, but again, why not 
>> write to a temp table instead of using a reducer?
>> 
>> 
>> 
>> 
>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
>>> From: [email protected]
>>> Date: Fri, 16 Sep 2011 11:58:05 -0700
>>> To: [email protected]
>>> 
>>> If only I could make NY in Nov :)
>>> 
>>> We extract out large numbers of DNA sequence reads from HBase, run them 
>>> through M/R pipelines to analyze and aggregate and then we load the results 
>>> back in. Definitely specialized usage, but I could see other perfectly 
>>> valid uses for reducers with HBase.
>>> 
>>> -chris
>>> 
>>> On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>>> 
>>>> 
>>>> Sonal,
>>>> 
>>>> You do realize that HBase is a "database", right? ;-)
>>>> 
>>>> So again, why do you need a reducer?  ;-)
>>>> 
>>>> Using your example...
>>>> "Again, there will be many cases where one may want a reducer, say trying 
>>>> to count the occurrence of words in a particular column."
>>>> 
>>>> You can do this one of two ways...
>>>> 1) Dynamic Counters in Hadoop.
>>>> 2) Use a temp table and auto increment the value in a column which 
>>>> contains the word count.  (Fat row where rowkey is doc_id and column is 
>>>> word or rowkey is doc_id|word)
>>>> 
>>>> I'm sorry but if you go through all of your examples of why you would want 
>>>> to use a reducer, you end up finding out that writing to an HBase table 
>>>> would be faster than a reduce job.
>>>> (Again we haven't done an exhaustive search, but in all of the HBase jobs 
>>>> we've run... no reducers were necessary.)
>>>> 
>>>> The point I'm trying to make is that you want to avoid using a reducer 
>>>> whenever possible and if you think about your problem... you can probably 
>>>> come up with a solution that avoids the reducer...
>>>> 
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> PS. I haven't looked at *all* of the potential use cases of HBase which is 
>>>> why I don't want to say you'll never need a reducer. I will say that based 
>>>> on what we've done at my client's site, we try very hard to avoid reducers.
>>>> [Note, I'm sure I'm going to get hammered on this when I head to NY in 
>>>> Nov. :-)   ]
>>>> 
>>>> 
>>>>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
>>>>> From: [email protected]
>>>>> To: [email protected]
>>>>> 
>>>>> Hi Michael,
>>>>> 
>>>>> Yes, thanks, I understand the fact that reducers can be expensive with all
>>>>> the shuffling and the sorting, and you may not need them always. At the 
>>>>> same
>>>>> time, there are many cases where reducers are useful, like secondary
>>>>> sorting. In many cases, one can have multiple map phases and not have a
>>>>> reduce phase at all. Again, there will be many cases where one may want a
>>>>> reducer, say trying to count the occurrence of words in a particular 
>>>>> column.
>>>>> 
>>>>> 
>>>>> With this thought chain, I do not feel ready to say that when dealing with
>>>>> HBase, I really dont want to use a reducer. Please correct me if I am
>>>>> wrong.
>>>>> 
>>>>> Thanks again.
>>>>> 
>>>>> Best Regards,
>>>>> Sonal
>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>>> Nube Technologies <http://www.nubetech.co>
>>>>> 
>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>>>>> <[email protected]>wrote:
>>>>> 
>>>>>> 
>>>>>> Sonal,
>>>>>> 
>>>>>> Just because you have a m/r job doesn't mean that you need to reduce
>>>>>> anything. You can have a job that contains only a mapper.
>>>>>> Or your job runner can have a series of map jobs in serial.
>>>>>> 
>>>>>> Most if not all of the map/reduce jobs where we pull data from HBase, 
>>>>>> don't
>>>>>> require a reducer.
>>>>>> 
>>>>>> To give you a simple example... if I want to determine the table schema
>>>>>> where I am storing some sort of structured data...
>>>>>> I just write a m/r job which opens a table, scan's the table counting the
>>>>>> occurrence of each column name via dynamic counters.
>>>>>> 
>>>>>> There is no need for a reducer.
>>>>>> 
>>>>>> Does that help?
>>>>>> 
>>>>>> 
>>>>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer
>>>>>> ...
>>>>>>> From: [email protected]
>>>>>>> To: [email protected]
>>>>>>> 
>>>>>>> Michel,
>>>>>>> 
>>>>>>> Sorry can you please help me understand what you mean when you say that
>>>>>> when
>>>>>>> dealing with HBase, you really dont want to use a reducer? Here, Hbase 
>>>>>>> is
>>>>>>> being used as the input to the MR job.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Sonal
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <[email protected]
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I think you need to get a little bit more information.
>>>>>>>> Reducers are expensive.
>>>>>>>> When Thomas says that he is aggregating data, what exactly does he
>>>>>> mean?
>>>>>>>> When dealing w HBase, you really don't want to use a reducer.
>>>>>>>> 
>>>>>>>> You may want to run two map jobs and it could be that just dumping the
>>>>>>>> output via jdbc makes the most sense.
>>>>>>>> 
>>>>>>>> We are starting to see a lot of questions where the OP isn't providing
>>>>>>>> enough information so that the recommendation could be wrong...
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>> 
>>>>>>>> Mike Segel
>>>>>>>> 
>>>>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <[email protected]>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> There is a DBOutputFormat class in the
>>>>>> org.apache,hadoop.mapreduce.lib.db
>>>>>>>>> package, you could use that. Or you could write to the hdfs and then
>>>>>> use
>>>>>>>>> something like HIHO[1] to export to the db. I have been working
>>>>>>>> extensively
>>>>>>>>> in this area, you can write to me directly if you need any help.
>>>>>>>>> 
>>>>>>>>> 1. https://github.com/sonalgoyal/hiho
>>>>>>>>> 
>>>>>>>>> Best Regards,
>>>>>>>>> Sonal
>>>>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>>>>>>> Nube Technologies <http://www.nubetech.co>
>>>>>>>>> 
>>>>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> writing a MR-Job to process HBase data and store aggregated data in
>>>>>>>>>> Oracle. How would you do that in a MR-job?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Currently, for test purposes we write the result into a HBase table
>>>>>>>>>> again by using a TableReducer. Is there something like a
>>>>>> OracleReducer,
>>>>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one simply use
>>>>>>>>>> plan JDBC code in the reduce step?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thomas
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>                                      
>>> 
>>                                        
>

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

Reply via email to