It's weird, I thought we already did something like that and it seems
that the old TableInputFormatBase does it but not the new one. From
it's javadoc:

   * Splits are created in number equal to the smallest between numSplits and
   * the number of {@link HRegion}s in the table. If the number of splits is
   * smaller than the number of {@link HRegion}s then splits are spanned across
   * multiple {@link HRegion}s and are grouped the most evenly possible. In the
   * case splits are uneven the bigger splits are placed first in the
   * {@link InputSplit} array.

J-D

On Sat, Apr 9, 2011 at 9:48 AM, Stack <st...@duboce.net> wrote:
> Yes, you could make a different Splitter.  Would be nice in the
> splitter if you could keep the locality where we have the Map task
> running on the TaskTracker that is adjacent to the hosting
> RegionServer.  That shouldn't be hard.  Study the current splitter and
> see how it juggles locations.
>
> Can you put us in contact w/ the person running the cluster (offline
> if you prefer)?  150k sounds like regions need to be bigger.
>
> Thanks,
> St.Ack
>
> On Sat, Apr 9, 2011 at 9:33 AM, Avery Ching <ach...@yahoo-inc.com> wrote:
>> The number of regions is pretty insane, but not under my control 
>> unfortunately.  The workaround I suggested is to write another InputFormat 
>> and InputSplit such that each InputSplit is responsible for a configurable 
>> number of regions.  For example, if i have 100k regions and I configure each 
>> InputSplit to handle 1k regions, then I'll only have 100 map tasks.  Just 
>> was wondering if anyone else faced these issues.
>>
>> Thanks for your quick response on a Saturday morning =),
>>
>> Avery
>>
>> On Apr 9, 2011, at 9:26 AM, Jean-Daniel Cryans wrote:
>>
>>> You cannot have more mappers than you have regions, but you can have
>>> less. Try going that way.
>>>
>>> Also 149,624 regions is insane, is that really the case? I don't think
>>> i've ever seen such a large deploy and it's probably bound to hit some
>>> issues...
>>>
>>> J-D
>>>
>>> On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <ach...@yahoo-inc.com> wrote:
>>>> Hi,
>>>>
>>>> First off, I'd like to say thanks to the developers for HBase, it's been 
>>>> fun to work with.
>>>>
>>>> I've been using TableInputFormat to run a Map-Reduce job and ran into an 
>>>> issue.
>>>>
>>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: 
>>>> java.io.IOException: java.io.IOException: The number of tasks for this job 
>>>> 149624 exceeds the configured limit 100000
>>>>
>>>> The table i'm accessing has 149624 regions, however my Hadoop instance 
>>>> won't allow me to start a job with that many map tasks.  After briefly 
>>>> looking at the TableInputFormatBase code, it appears that since TableSplit 
>>>> only knows about a single region, my job will be forced into having 
>>>> mappers == # of regions.  Since the Hadoop instance I'm using is shared, 
>>>> I'm concerned that even if configured limit was raised, having Jobs with 
>>>> so many mappers would eventually cause havoc to the job tracker.
>>>>
>>>> Given that I have no control over the number of regions in the table 
>>>> (maintained by someone else), is the only solution to implement another 
>>>> input format (i.e. MultiRegionTableFormat) that allows InputSplits to have 
>>>> more than one region?  I don't mind doing it, but didn't want to write it 
>>>> if another solution already exists.
>>>>
>>>> Apologies if this issue has been raised before, but a quick search didn't 
>>>> turn anything up for me.
>>>>
>>>> Thanks,
>>>>
>>>> Avery
>>>>
>>
>>
>

Reply via email to