Geoffrey Jacoby created PHOENIX-5313:
----------------------------------------
Summary: All mappers grab all RegionLocations from .META
Key: PHOENIX-5313
URL: https://issues.apache.org/jira/browse/PHOENIX-5313
Project: Phoenix
Issue Type: Bug
Reporter: Geoffrey Jacoby
Phoenix's MapReduce integration lives in PhoenixInputFormat. It implements
getSplits by calculating a QueryPlan for the provided SELECT query, and each
split gets a mapper. As part of this QueryPlan generation, we grab all
RegionLocations from .META
In PhoenixInputFormat:getQueryPlan:
{code:java}
// Initialize the query plan so it sets up the parallel scans
queryPlan.iterator(MapReduceParallelScanGrouper.getInstance());
{code}
In MapReduceParallelScanGrouper.getRegionBoundaries()
{code:java}
return context.getConnection().getQueryServices().getAllTableRegions(tableName);
{code}
This is fine.
Unfortunately, each mapper Task spawned by the job will go through this _same_
exercise when trying to create the RecordReader. Since HBase 1.x and up got rid
of .META prefetching and caching within the HBase client, that means that not
only will each _Job_ make potentially thousands of calls to .META, potentially
thousands of _Tasks_ will do the same.
The createRecordReader should get a QueryPlan without having to read all
RegionLocations, either by using its internal knowledge of its split key range,
or by serializing the query plan from the client and sending it to the mapper
tasks for use there.
Note that MapReduce tasks over snapshots are not affected by this, because
region locations are stored in the snapshot manifest.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)