Quick question about data local vs rack local tasks when running map
reduce jobs against hbase. I've just run a job against a table that
was split into 1,645 tasks. Looking at the job page it's reporting
that 1,445 of those jobs were rack local compared to 200 that were
data local. I'm taking these counters to mean that most of the jobs
were running on a server that wasn't the same as the relevant region
server. Is it possible or are there plans to add some logic into the
scheduler to prefer jobs to run on the same server as the regionserver
if possible?
With HBase is there a similar way to tell if a region on a
regionserver has a copy of the files that it needs to serve the region
on a local datanode instead of having to cross the network to get it?
I know that when you're writing new data into a table and it splits,
the default is to have the first datanode copy be local. But after a
fairly large table has been brought up and down several times with all
of the regions being reassigned, is there logic when assigning regions
to put them on a data local server?
Thanks,
Bryan
- Data Local Questions Bryan McCormick
-