[ https://issues.apache.org/jira/browse/PHOENIX-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karan Mehta updated PHOENIX-4912: --------------------------------- Affects Version/s: 4.15.0 > Make Table Sampling algorithm to accommodate to the imbalance row > distribution across guide posts > ------------------------------------------------------------------------------------------------- > > Key: PHOENIX-4912 > URL: https://issues.apache.org/jira/browse/PHOENIX-4912 > Project: Phoenix > Issue Type: Improvement > Affects Versions: 5.0.0, 4.15.0 > Reporter: Bin Shi > Assignee: Bin Shi > Priority: Major > > The current implementation of table sampling is based on the assumption > "Every two consecutive guide posts contains the equal number of rows" which > isn't accurate in practice, and once we collect multiple versions of cells > and the deleted rows, the thing will become worse. > In details, the current implementation of table sampling is (see > BaseResultIterators.getParallelScan() which calls sampleScans(...) at the end > of function) as described below: > # Iterate all parallel scans generated; > # For each scan, if getHashHode(start row key of the scan) MOD 100 < > tableSamplingRate (See TableSamplerPredicate.java) then pick this scan; > otherwise discard this scan. > The problem can be formalized as: We have a group of scans and each scan is > defined as <the start row key denoted as Ki, the count of rows denoted as > Ci>. Now we want to randomly pick X groups so that the sum of count of rows > in the selected groups is close to Y, where Y = the total count of rows of > all scans T * table sampling rate R. > To resolve the above problem, one of algorithms that we can consider are > described below: > ArrayList<Scan> TableSampling(ArrayList<Scan> scans, T, R) > { > ArrayList<Scan> pickedScans = new ArrayList<Scan>(); > Y = T * R; > for (scan<Ki, Ci> in scans) { > if (Y <= 0) break; > if (getHashCode(Ki) MOD 100 < R) { > // then pick this scan, and adjust T, R, Y accordingly > pickedScans.Add(scan); > T -= Ci; > Y -= Ci; > if (T != 0 && Y > 0) { > R = Y / T; > } > } > } > return pickedScans; > } -- This message was sent by Atlassian JIRA (v7.6.3#76005)