[ 
https://issues.apache.org/jira/browse/PHOENIX-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629258#comment-16629258
 ] 

Bin Shi edited comment on PHOENIX-4912 at 9/26/18 6:48 PM:
-----------------------------------------------------------

By using the server side approach, the time complexity will O(n) instead of 
O(m), where n is the number of rows to scan and m is the number of scans, so n 
>> m. A table sampling query could be much heavier than the client side 
approach.

Microsoft Cloud (ADLA) uses the client approach – it uses Reservoir Sampling 
Algorithm on extent level (the same level of Block on HDFS, 128 or 256 MB by 
default). I believe the sampling query from Pig and Hive based on HDFS is also 
on HDFS Block level.

I second the current table sampling algorithm based on Scans/GuidePosts. 


was (Author: bin shi):
By using the server side approach, the time complexity will O(n) instead of 
O(m), where n is the number of rows to scan and m is the number of scans, so n 
>> m. A table sampling query could be much heavier than the client side 
approach.

Microsoft Cloud (ADLA) uses the client approach – it uses Reservoir Sampling 
Algorithm on extent level (the same level of Block on HDFS, 128 or 256 MB by 
default). I believe the sampling query from Pig and Hive based on HDFS is also 
on HDFS Block level.

> Make Table Sampling algorithm to accommodate to the imbalance row 
> distribution across guide posts
> -------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-4912
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4912
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 5.0.0, 4.15.0
>            Reporter: Bin Shi
>            Assignee: Bin Shi
>            Priority: Major
>
> The current implementation of table sampling is based on the assumption 
> "Every two consecutive guide posts contains the equal number of rows" which 
> isn't accurate in practice, and once we collect multiple versions of cells 
> and the deleted rows, the thing will become worse.
> In details, the current implementation of table sampling is (see 
> BaseResultIterators.getParallelScan() which calls sampleScans(...) at the end 
> of function) as described below:
>  # Iterate all parallel scans generated;
>  # For each scan, if getHashHode(start row key of the scan) MOD 100 < 
> tableSamplingRate (See TableSamplerPredicate.java) then pick this scan; 
> otherwise discard this scan.
> The problem can be formalized as: We have a group of scans and each scan is 
> defined as <the start row key denoted as Ki, the count of rows denoted as 
> Ci>. Now we want to randomly pick X groups so that the sum of count of rows 
> in the selected groups is close to Y, where Y = the total count of rows of 
> all scans denoted as T * table sampling rate denoted as R (0 <= R <= 100) / 
> 100.00.
> To resolve the above problem, one of algorithms that we can consider are 
> described below. The core idea is to adjust T, R, Y after each pick, so the 
> new problem is a child problem of the original problem.
> {code:java}
> ArrayList<Scan> TableSampling(ArrayList<Scan> scans, T, R) {  
>     ArrayList<Scan> pickedScans = new ArrayList<Scan>();
>     Y = T * R / 100.00;
>     for (scan<Ki, Ci> in scans) {
>         if (Y <= 0) break;
>         if (getHashCode(Ki) MOD 100 < R) {
>             // then pick this scan, and adjust T, R, Y accordingly
>             pickedScans.Add(scan);
>             T -= Ci;
>             Y -= Ci;
>             if (T != 0 && Y > 0) { 
>                 R = 100.00 * Y / T;
>             }
>         }
>     }
>     return pickedScans;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to