[ 
https://issues.apache.org/jira/browse/PHOENIX-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan Mehta updated PHOENIX-4912:
---------------------------------
    Affects Version/s: 4.15.0

> Make Table Sampling algorithm to accommodate to the imbalance row 
> distribution across guide posts
> -------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-4912
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4912
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 5.0.0, 4.15.0
>            Reporter: Bin Shi
>            Assignee: Bin Shi
>            Priority: Major
>
> The current implementation of table sampling is based on the assumption 
> "Every two consecutive guide posts contains the equal number of rows" which 
> isn't accurate in practice, and once we collect multiple versions of cells 
> and the deleted rows, the thing will become worse.
> In details, the current implementation of table sampling is (see 
> BaseResultIterators.getParallelScan() which calls sampleScans(...) at the end 
> of function) as described below:
>  # Iterate all parallel scans generated;
>  # For each scan, if getHashHode(start row key of the scan) MOD 100 < 
> tableSamplingRate (See TableSamplerPredicate.java) then pick this scan; 
> otherwise discard this scan.
> The problem can be formalized as: We have a group of scans and each scan is 
> defined as <the start row key denoted as Ki, the count of rows denoted as 
> Ci>. Now we want to randomly pick X groups so that the sum of count of rows 
> in the selected groups is close to Y, where Y = the total count of rows of 
> all scans T * table sampling rate R.
> To resolve the above problem, one of algorithms that we can consider are 
> described below:
> ArrayList<Scan> TableSampling(ArrayList<Scan> scans, T, R)
> {  
>     ArrayList<Scan> pickedScans = new ArrayList<Scan>();
>     Y = T * R;
>     for (scan<Ki, Ci> in scans) {
>         if (Y <= 0) break;
>         if (getHashCode(Ki) MOD 100 < R) {
>             // then pick this scan, and adjust T, R, Y accordingly
>             pickedScans.Add(scan);
>             T -= Ci;
>             Y -= Ci;
>             if (T != 0 && Y > 0) {
>                 R = Y / T;
>             }
>         }
>     }
>     return pickedScans;
> }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to