Guys, Ok... You're putting a lot of thought in to this, which is a good thing.
I really haven't looked at the bulk load, so I have some homework :-) In response to your discussion... 1) how fast is fast enough? I mean sure if you create a temp table on the fly, you could end up w a single region becoming a hot spot. Is it more than just a bottleneck, or can you hurt you RS and HBase? If it's only a bottleneck, remember that this is only a temp table. You have control of setting the max file size and pre splitting. 2) KISS. The first step is starting to realize that you have a database so why do you not want to take advantage of it? :-) Your first iteration may not be the most efficient solution, but it should be faster than using a reducer and/or combiner/reducer. Sure, there's no free lunch, but using the HBase tables should be more efficient. I'm not suggesting that this is always going to be faster, or better, but that from the problem sets we have worked with... It made more sense. ( ok, I'm an old database guy... So my opinion is skewed... ) 3) Keeping data till the end of the task, may work for some jobs. In the cleanup() method you could write out the data, provided you have enough memory... I'm sure there are pros and cons to it... But it's a good design idea to think about. It's really cool that people are now thinking about this... Sent from a remote device. Please excuse any typos... Mike Segel On Sep 16, 2011, at 8:47 PM, Doug Meil <doug.m...@explorysmedical.com> wrote: > > Map-task heap size would definitely be a concern, but since the hashmap > would only contain aggregations, ostensibly this map would be holding a > far smaller number of the rows that were passed into the mapper. > > At least that's how I'd use it. > > > > On 9/16/11 9:39 PM, "Sam Seigal" <selek...@yahoo.com> wrote: > >> Aren't there memory considerations with this approach ? I would assume >> the HashMap can get pretty big , if it retains in memory every record >> that passes through .. (Apologies, if I am being ignorant with my >> limited knowledge of hadoop's internal workings ... ) >> >> On Fri, Sep 16, 2011 at 6:14 PM, Doug Meil >> <doug.m...@explorysmedical.com> wrote: >>> >>> However, if the aggregations in the mapper were kept in a HashMap (key >>> being the aggregate, value being the count), and then the mapper made a >>> single pass over this map during the cleanup method and then did the >>> checkAndPuts, it would mean that the writes would only happen once per >>> map-task, and not do it on a per-row basis (which would be really >>> expensive). >>> >>> A single region on a single RS could handle that no problem. >>> >>> >>> >>> >>> On 9/16/11 9:00 PM, "Sam Seigal" <selek...@yahoo.com> wrote: >>> >>>> I see what you are saying about the temp table being hosted at a >>>> single regions server - especially for a limited set of rows that >>>> just care about the aggregations, but receive a lot of traffic. I >>>> wonder if this will also be the case, if I was to use the source table >>>> to maintain these temporary records, and not create a temp table on >>>> the fly ... >>>> >>>> On Fri, Sep 16, 2011 at 5:24 PM, Doug Meil >>>> <doug.m...@explorysmedical.com> wrote: >>>>> >>>>> I'll add this to the book in the MR section. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 9/16/11 8:22 PM, "Doug Meil" <doug.m...@explorysmedical.com> wrote: >>>>> >>>>>> >>>>>> I was in the middle of responding to Mike's email when yours arrived, >>>>>> so >>>>>> I'll respond to both. >>>>>> >>>>>> I think the temp-table idea is interesting. The caution is that a >>>>>> default >>>>>> temp-table creation will be hosted on a single RS and thus be a >>>>>> bottleneck >>>>>> for aggregation. So I would imagine that you would need to tune the >>>>>> temp-table for the job and pre-create regions. >>>>>> >>>>>> Doug >>>>>> >>>>>> >>>>>> >>>>>> On 9/16/11 8:16 PM, "Sam Seigal" <selek...@yahoo.com> wrote: >>>>>> >>>>>>> I am trying to do something similar with HBase Map/Reduce. >>>>>>> >>>>>>> I have event ids and amounts stored in hbase in the following format: >>>>>>> prefix-event_id_type-timestamp-event_id as the row key and amount as >>>>>>> the value >>>>>>> I want to be able to aggregate the amounts based on the event id type >>>>>>> and for this I am using a reducer. I basically reduce on the >>>>>>> eventidtype from the incoming row in the map phase, and perform the >>>>>>> aggregation in the reducer on the amounts for the event types. Then I >>>>>>> write back the results into HBase. >>>>>>> >>>>>>> I hadn't thought about writing values directly into a temp HBase >>>>>>> table >>>>>>> as suggested by Mike in the map phase. >>>>>>> >>>>>>> For this case, each mapper can declare its own mapperId_event_type >>>>>>> row >>>>>>> with totalAmount and for each row it receives, do a get , add the >>>>>>> current amount, and then a put. We are basically then doing a >>>>>>> get/add/put for every row that a mapper receives. Is this any more >>>>>>> efficient when compared to the overhead of sorting/partitioning for a >>>>>>> reducer ? >>>>>>> >>>>>>> At the end of the mapping phase, aggregating the output of all the >>>>>>> mappers should be trivial. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 16, 2011 at 1:24 PM, Michael Segel >>>>>>> <michael_se...@hotmail.com> wrote: >>>>>>>> >>>>>>>> Doug and company... >>>>>>>> >>>>>>>> Look, I'm not saying that there aren't m/r jobs were you might need >>>>>>>> reducers when working w HBase. What I am saying is that if we look >>>>>>>> at >>>>>>>> what you're attempting to do, you may end up getting better >>>>>>>> performance >>>>>>>> if you created a temp table in HBase and let HBase do some of the >>>>>>>> heavy >>>>>>>> lifting where you are currently using a reducer. From the jobs that >>>>>>>> we >>>>>>>> run, when we looked at what we were doing, there wasn't any need >>>>>>>> for a >>>>>>>> reducer. I suspect that its true of other jobs. >>>>>>>> >>>>>>>> Remember that HBase is much more than just an HFile format to >>>>>>>> persist >>>>>>>> stuff. >>>>>>>> >>>>>>>> Even looking at Sonal's example... you have other ways of doing the >>>>>>>> record counts like dynamic counters or using a temp table in HBase >>>>>>>> which >>>>>>>> I believe will give you better performance numbers, although I >>>>>>>> haven't >>>>>>>> benchmarked either against a reducer. >>>>>>>> >>>>>>>> Does that make sense? >>>>>>>> >>>>>>>> -Mike >>>>>>>> >>>>>>>> >>>>>>>>> From: doug.m...@explorysmedical.com >>>>>>>>> To: user@hbase.apache.org >>>>>>>>> Date: Fri, 16 Sep 2011 15:41:44 -0400 >>>>>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>>>> JDBCReducer ... >>>>>>>>> >>>>>>>>> >>>>>>>>> Chris, agreed... There are sometimes that reducers aren't >>>>>>>> required, >>>>>>>> and >>>>>>>>> then situations where they are useful. We have both kinds of >>>>>>>> jobs. >>>>>>>>> >>>>>>>>> For others following the thread, I updated the book recently with >>>>>>>> more MR >>>>>>>>> examples (read-only, read-write, read-summary) >>>>>>>>> >>>>>>>>> http://hbase.apache.org/book.html#mapreduce.example >>>>>>>>> >>>>>>>>> >>>>>>>>> As to the question that started this thread... >>>>>>>>> >>>>>>>>> >>>>>>>>> re: "Store aggregated data in Oracle. " >>>>>>>>> >>>>>>>>> To me, that sounds a like the "read-summary" example with >>>>>>>> JDBC-Oracle >>>>>>>> in >>>>>>>>> the reduce step. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 9/16/11 2:58 PM, "Chris Tarnas" <c...@email.com> wrote: >>>>>>>>> >>>>>>>>>> If only I could make NY in Nov :) >>>>>>>>>> >>>>>>>>>> We extract out large numbers of DNA sequence reads from HBase, >>>>>>>> run >>>>>>>> them >>>>>>>>>> through M/R pipelines to analyze and aggregate and then we load >>>>>>>> the >>>>>>>>>> results back in. Definitely specialized usage, but I could see >>>>>>>> other >>>>>>>>>> perfectly valid uses for reducers with HBase. >>>>>>>>>> >>>>>>>>>> -chris >>>>>>>>>> >>>>>>>>>> On Sep 16, 2011, at 11:43 AM, Michael Segel wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Sonal, >>>>>>>>>>> >>>>>>>>>>> You do realize that HBase is a "database", right? ;-) >>>>>>>>>>> >>>>>>>>>>> So again, why do you need a reducer? ;-) >>>>>>>>>>> >>>>>>>>>>> Using your example... >>>>>>>>>>> "Again, there will be many cases where one may want a reducer, >>>>>>>> say >>>>>>>>>>> trying to count the occurrence of words in a particular >>>>>>>> column." >>>>>>>>>>> >>>>>>>>>>> You can do this one of two ways... >>>>>>>>>>> 1) Dynamic Counters in Hadoop. >>>>>>>>>>> 2) Use a temp table and auto increment the value in a column >>>>>>>> which >>>>>>>>>>> contains the word count. (Fat row where rowkey is doc_id and >>>>>>>> column is >>>>>>>>>>> word or rowkey is doc_id|word) >>>>>>>>>>> >>>>>>>>>>> I'm sorry but if you go through all of your examples of why >>>>>>>> you >>>>>>>> would >>>>>>>>>>> want to use a reducer, you end up finding out that writing to >>>>>>>> an >>>>>>>> HBase >>>>>>>>>>> table would be faster than a reduce job. >>>>>>>>>>> (Again we haven't done an exhaustive search, but in all of the >>>>>>>> HBase >>>>>>>>>>> jobs we've run... no reducers were necessary.) >>>>>>>>>>> >>>>>>>>>>> The point I'm trying to make is that you want to avoid using a >>>>>>>> reducer >>>>>>>>>>> whenever possible and if you think about your problem... you >>>>>>>> can >>>>>>>>>>> probably come up with a solution that avoids the reducer... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> -Mike >>>>>>>>>>> PS. I haven't looked at *all* of the potential use cases of >>>>>>>> HBase >>>>>>>> which >>>>>>>>>>> is why I don't want to say you'll never need a reducer. I will >>>>>>>> say >>>>>>>> that >>>>>>>>>>> based on what we've done at my client's site, we try very hard >>>>>>>> to >>>>>>>> avoid >>>>>>>>>>> reducers. >>>>>>>>>>> [Note, I'm sure I'm going to get hammered on this when I head >>>>>>>> to >>>>>>>> NY in >>>>>>>>>>> Nov. :-) ] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Date: Fri, 16 Sep 2011 23:00:49 +0530 >>>>>>>>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>>>> JDBCReducer >>>>>>>>>>>> ... >>>>>>>>>>>> From: sonalgoy...@gmail.com >>>>>>>>>>>> To: user@hbase.apache.org >>>>>>>>>>>> >>>>>>>>>>>> Hi Michael, >>>>>>>>>>>> >>>>>>>>>>>> Yes, thanks, I understand the fact that reducers can be >>>>>>>> expensive >>>>>>>> with >>>>>>>>>>>> all >>>>>>>>>>>> the shuffling and the sorting, and you may not need them >>>>>>>> always. >>>>>>>> At >>>>>>>>>>>> the same >>>>>>>>>>>> time, there are many cases where reducers are useful, like >>>>>>>> secondary >>>>>>>>>>>> sorting. In many cases, one can have multiple map phases and >>>>>>>> not >>>>>>>> have a >>>>>>>>>>>> reduce phase at all. Again, there will be many cases where >>>>>>>> one >>>>>>>> may >>>>>>>>>>>> want a >>>>>>>>>>>> reducer, say trying to count the occurrence of words in a >>>>>>>> particular >>>>>>>>>>>> column. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> With this thought chain, I do not feel ready to say that when >>>>>>>> dealing >>>>>>>>>>>> with >>>>>>>>>>>> HBase, I really dont want to use a reducer. Please correct me >>>>>>>> if >>>>>>>> I am >>>>>>>>>>>> wrong. >>>>>>>>>>>> >>>>>>>>>>>> Thanks again. >>>>>>>>>>>> >>>>>>>>>>>> Best Regards, >>>>>>>>>>>> Sonal >>>>>>>>>>>> Crux: Reporting for HBase >>>>>>>> <https://github.com/sonalgoyal/crux> >>>>>>>>>>>> Nube Technologies <http://www.nubetech.co> >>>>>>>>>>>> >>>>>>>>>>>> <http://in.linkedin.com/in/sonalgoyal> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel >>>>>>>>>>>> <michael_se...@hotmail.com>wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Sonal, >>>>>>>>>>>>> >>>>>>>>>>>>> Just because you have a m/r job doesn't mean that you need >>>>>>>> to >>>>>>>> reduce >>>>>>>>>>>>> anything. You can have a job that contains only a mapper. >>>>>>>>>>>>> Or your job runner can have a series of map jobs in serial. >>>>>>>>>>>>> >>>>>>>>>>>>> Most if not all of the map/reduce jobs where we pull data >>>>>>>> from >>>>>>>> HBase, >>>>>>>>>>>>> don't >>>>>>>>>>>>> require a reducer. >>>>>>>>>>>>> >>>>>>>>>>>>> To give you a simple example... if I want to determine the >>>>>>>> table >>>>>>>>>>>>> schema >>>>>>>>>>>>> where I am storing some sort of structured data... >>>>>>>>>>>>> I just write a m/r job which opens a table, scan's the table >>>>>>>> counting >>>>>>>>>>>>> the >>>>>>>>>>>>> occurrence of each column name via dynamic counters. >>>>>>>>>>>>> >>>>>>>>>>>>> There is no need for a reducer. >>>>>>>>>>>>> >>>>>>>>>>>>> Does that help? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530 >>>>>>>>>>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, >>>>>>>>>>>>>> JDBCReducer >>>>>>>>>>>>> ... >>>>>>>>>>>>>> From: sonalgoy...@gmail.com >>>>>>>>>>>>>> To: user@hbase.apache.org >>>>>>>>>>>>>> >>>>>>>>>>>>>> Michel, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sorry can you please help me understand what you mean when >>>>>>>> you >>>>>>>> say >>>>>>>>>>>>>> that >>>>>>>>>>>>> when >>>>>>>>>>>>>> dealing with HBase, you really dont want to use a reducer? >>>>>>>> Here, >>>>>>>>>>>>>> Hbase is >>>>>>>>>>>>>> being used as the input to the MR job. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Sonal >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel >>>>>>>>>>>>>> <michael_se...@hotmail.com >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think you need to get a little bit more information. >>>>>>>>>>>>>>> Reducers are expensive. >>>>>>>>>>>>>>> When Thomas says that he is aggregating data, what exactly >>>>>>>> does he >>>>>>>>>>>>> mean? >>>>>>>>>>>>>>> When dealing w HBase, you really don't want to use a >>>>>>>> reducer. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> You may want to run two map jobs and it could be that just >>>>>>>> dumping >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> output via jdbc makes the most sense. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We are starting to see a lot of questions where the OP >>>>>>>> isn't >>>>>>>>>>>>>>> providing >>>>>>>>>>>>>>> enough information so that the recommendation could be >>>>>>>> wrong... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sent from a remote device. Please excuse any typos... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Mike Segel >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal >>>>>>>> <sonalgoy...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There is a DBOutputFormat class in the >>>>>>>>>>>>> org.apache,hadoop.mapreduce.lib.db >>>>>>>>>>>>>>>> package, you could use that. Or you could write to the >>>>>>>> hdfs >>>>>>>> and >>>>>>>>>>>>>>>> then >>>>>>>>>>>>> use >>>>>>>>>>>>>>>> something like HIHO[1] to export to the db. I have been >>>>>>>> working >>>>>>>>>>>>>>> extensively >>>>>>>>>>>>>>>> in this area, you can write to me directly if you need >>>>>>>> any >>>>>>>> help. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. https://github.com/sonalgoyal/hiho >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>> Sonal >>>>>>>>>>>>>>>> Crux: Reporting for HBase >>>>>>>> <https://github.com/sonalgoyal/crux> >>>>>>>>>>>>>>>> Nube Technologies <http://www.nubetech.co> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> <http://in.linkedin.com/in/sonalgoyal> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas < >>>>>>>>>>>>>>>> thomas.steinmau...@scch.at> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> writing a MR-Job to process HBase data and store >>>>>>>> aggregated >>>>>>>> data >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>> Oracle. How would you do that in a MR-job? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Currently, for test purposes we write the result into a >>>>>>>> HBase >>>>>>>>>>>>>>>>> table >>>>>>>>>>>>>>>>> again by using a TableReducer. Is there something like a >>>>>>>>>>>>> OracleReducer, >>>>>>>>>>>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should >>>>>>>> one >>>>>>>> simply >>>>>>>>>>>>>>>>> use >>>>>>>>>>>>>>>>> plan JDBC code in the reduce step? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thomas >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>> >>>>> >>>>> >>> >>> > >