Bin Shi created PHOENIX-4912:
--------------------------------

             Summary: Make Table Sampling algorithm to accommodate to the 
imbalance row distribution across guide posts
                 Key: PHOENIX-4912
                 URL: https://issues.apache.org/jira/browse/PHOENIX-4912
             Project: Phoenix
          Issue Type: Improvement
    Affects Versions: 5.0.0
            Reporter: Bin Shi
            Assignee: Bin Shi


The current implementation of table sampling is based on the assumption "Every 
two consecutive guide posts contains the equal number of rows" which isn't 
accurate in practice, and once we collect multiple versions of cells and the 
deleted rows, the thing will become worse.

In details, the current implementation of table sampling is (see 
BaseResultIterators.getParallelScan() which calls sampleScans(...) at the end 
of function) as described below:
 # Iterate all parallel scans generated;
 # For each scan, if getHashHode(start row key of the scan) MOD 100 < 
tableSamplingRate (See TableSamplerPredicate.java) then pick this scan; 
otherwise discard this scan.

The problem can be formalized as: We have a group of scans and each scan is 
defined as <the start row key denoted as Ki, the count of rows denoted as Ci>. 
Now we want to randomly pick X groups so that the sum of count of rows in the 
selected groups is close to Y, where Y = the total count of rows of all scans T 
* table sampling rate R.

To resolve the above problem, one of algorithms that we can consider are 
described below:

ArrayList<Scan> TableSampling(ArrayList<Scan> scans, T, R)

{  

    ArrayList<Scan> pickedScans = new ArrayList<Scan>();

    Y = T * R;

    for (scan<Ki, Ci> in scans) {

        if (Y <= 0) break;

        if (getHashCode(Ki) MOD 100 < R) {

            // then pick this scan, and adjust T, R, Y accordingly

            pickedScans.Add(scan);

            T -= Ci;

            Y -= Ci;

            if (T != 0 && Y > 0) {

                  R = Y / T;

            }

        }

    }

    return pickedScans;

}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to