Github user JamesRTaylor commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/8#discussion_r16794301
  
    --- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/iterate/DefaultParallelIteratorRegionSplitter.java
 ---
    @@ -138,14 +146,10 @@ public boolean apply(HRegionLocation location) {
             //    split each region in s splits such that:
             //    s = max(x) where s * x < t
             //
    -        // The idea is to align splits with region boundaries. If rows are 
not evenly
    -        // distributed across regions, using this scheme compensates for 
regions that
    -        // have more rows than others, by applying tighter splits and 
therefore spawning
    -        // off more scans over the overloaded regions.
    -        int splitsPerRegion = getSplitsPerRegion(regions.size());
             // Create a multi-map of ServerName to List<KeyRange> which we'll 
use to round robin from to ensure
             // that we keep each region server busy for each query.
    -        ListMultimap<HRegionLocation,KeyRange> keyRangesPerRegion = 
ArrayListMultimap.create(regions.size(),regions.size() * splitsPerRegion);;
    +        int splitsPerRegion = getSplitsPerRegion(regions.size());
    +        ListMultimap<HRegionLocation,KeyRange> keyRangesPerRegion = 
ArrayListMultimap.create(regions.size(),regions.size() * splitsPerRegion);
             if (splitsPerRegion == 1) {
                 for (HRegionLocation region : regions) {
    --- End diff --
    
    Here's what I think we should do here:
    - Store guideposts per column family. It's probably easiest if the PK is of 
the following form:
    <cf varchar not null><guidepost varbinary null>. I'm not sure there's any 
value in using a VARBINARY ARRAY. We should just make sure that we can delete 
the old guideposts and add the new ones easily.
    - Here, you'd still want to loop through the regions as above, but you want 
to get all guideposts for the column families involved in the query. Let's take 
the simple case where there's only one. In that case, you'd intersect all the 
region boundaries with the guideposts - this will be a bit easier if the 
guideposts are sorted already. The set of intersections will be what gets 
returned here.
    - For the multi-column family case, I think we want to do the same 
processing as above per column family and then we'll coalesce any overlapping 
ranges.
    - We have the intersect and coalesce methods you'll need in our KeyRange 
class, so the code should be relatively small


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to