I found the code still exists in this code base for the old mapred interfaces
src/main/java/org/apache/hadoop/hbase/mapred/TableInputFormatBase.java I'll adapt it for my needs. Thanks! Avery On Apr 9, 2011, at 9:55 AM, Jean-Daniel Cryans wrote: > It's weird, I thought we already did something like that and it seems > that the old TableInputFormatBase does it but not the new one. From > it's javadoc: > > * Splits are created in number equal to the smallest between numSplits and > * the number of {@link HRegion}s in the table. If the number of splits is > * smaller than the number of {@link HRegion}s then splits are spanned across > * multiple {@link HRegion}s and are grouped the most evenly possible. In the > * case splits are uneven the bigger splits are placed first in the > * {@link InputSplit} array. > > J-D > > On Sat, Apr 9, 2011 at 9:48 AM, Stack <st...@duboce.net> wrote: >> Yes, you could make a different Splitter. Would be nice in the >> splitter if you could keep the locality where we have the Map task >> running on the TaskTracker that is adjacent to the hosting >> RegionServer. That shouldn't be hard. Study the current splitter and >> see how it juggles locations. >> >> Can you put us in contact w/ the person running the cluster (offline >> if you prefer)? 150k sounds like regions need to be bigger. >> >> Thanks, >> St.Ack >> >> On Sat, Apr 9, 2011 at 9:33 AM, Avery Ching <ach...@yahoo-inc.com> wrote: >>> The number of regions is pretty insane, but not under my control >>> unfortunately. The workaround I suggested is to write another InputFormat >>> and InputSplit such that each InputSplit is responsible for a configurable >>> number of regions. For example, if i have 100k regions and I configure >>> each InputSplit to handle 1k regions, then I'll only have 100 map tasks. >>> Just was wondering if anyone else faced these issues. >>> >>> Thanks for your quick response on a Saturday morning =), >>> >>> Avery >>> >>> On Apr 9, 2011, at 9:26 AM, Jean-Daniel Cryans wrote: >>> >>>> You cannot have more mappers than you have regions, but you can have >>>> less. Try going that way. >>>> >>>> Also 149,624 regions is insane, is that really the case? I don't think >>>> i've ever seen such a large deploy and it's probably bound to hit some >>>> issues... >>>> >>>> J-D >>>> >>>> On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <ach...@yahoo-inc.com> wrote: >>>>> Hi, >>>>> >>>>> First off, I'd like to say thanks to the developers for HBase, it's been >>>>> fun to work with. >>>>> >>>>> I've been using TableInputFormat to run a Map-Reduce job and ran into an >>>>> issue. >>>>> >>>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: >>>>> java.io.IOException: java.io.IOException: The number of tasks for this >>>>> job 149624 exceeds the configured limit 100000 >>>>> >>>>> The table i'm accessing has 149624 regions, however my Hadoop instance >>>>> won't allow me to start a job with that many map tasks. After briefly >>>>> looking at the TableInputFormatBase code, it appears that since >>>>> TableSplit only knows about a single region, my job will be forced into >>>>> having mappers == # of regions. Since the Hadoop instance I'm using is >>>>> shared, I'm concerned that even if configured limit was raised, having >>>>> Jobs with so many mappers would eventually cause havoc to the job tracker. >>>>> >>>>> Given that I have no control over the number of regions in the table >>>>> (maintained by someone else), is the only solution to implement another >>>>> input format (i.e. MultiRegionTableFormat) that allows InputSplits to >>>>> have more than one region? I don't mind doing it, but didn't want to >>>>> write it if another solution already exists. >>>>> >>>>> Apologies if this issue has been raised before, but a quick search didn't >>>>> turn anything up for me. >>>>> >>>>> Thanks, >>>>> >>>>> Avery >>>>> >>> >>> >>