One tricky thing is that if the region size is larger (default max size is 256MB) that HDFS block size (default 64MB), it's still necessary to go through network.
Victor On Thu, Feb 18, 2010 at 12:22 AM, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > Bryan, > > What you are describing is already implemented and from my experience >90% > of my tasks are usually run on the region server that has the mapped region. > > See o.a.h.h.mapreduce.TableSplit.getLocations() > > J-D > > On Wed, Feb 17, 2010 at 12:10 AM, Bryan McCormick <br...@readpath.com>wrote: > >> Quick question about data local vs rack local tasks when running map reduce >> jobs against hbase. I've just run a job against a table that was split into >> 1,645 tasks. Looking at the job page it's reporting that 1,445 of those jobs >> were rack local compared to 200 that were data local. I'm taking these >> counters to mean that most of the jobs were running on a server that wasn't >> the same as the relevant region server. Is it possible or are there plans >> to add some logic into the scheduler to prefer jobs to run on the same >> server as the regionserver if possible? >> >> With HBase is there a similar way to tell if a region on a regionserver has >> a copy of the files that it needs to serve the region on a local datanode >> instead of having to cross the network to get it? >> >> I know that when you're writing new data into a table and it splits, the >> default is to have the first datanode copy be local. But after a fairly >> large table has been brought up and down several times with all of the >> regions being reassigned, is there logic when assigning regions to put them >> on a data local server? >> >> Thanks, >> Bryan >> >