[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-09-03 Thread hanzhi (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151751#comment-16151751
 ] 

hanzhi commented on PHOENIX-153:


Awesome!!

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Fix For: 4.12.0
>
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.
> [Update]
> Source Code Patch: 
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=5e33dc12bc088bd0008d89f0a5cd7d5c368efa25



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-08-24 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141232#comment-16141232
 ] 

Lars Hofhansl commented on PHOENIX-153:
---

Nice job [~aertoria]!

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Fix For: 4.12.0
>
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.
> [Update]
> Patch: 
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=5e33dc12bc088bd0008d89f0a5cd7d5c368efa25



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-08-01 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110284#comment-16110284
 ] 

Ethan Wang commented on PHOENIX-153:


Thanks [~jamestaylor]!

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Fix For: 4.12.0
>
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-08-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109722#comment-16109722
 ] 

Hudson commented on PHOENIX-153:


FAILURE: Integrated in Jenkins build Phoenix-master #1725 (See 
[https://builds.apache.org/job/Phoenix-master/1725/])
PHOENIX-153 Implement TABLESAMPLE clause (Ethan Wang) (jamestaylor: rev 
5e33dc12bc088bd0008d89f0a5cd7d5c368efa25)
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/parse/ParseNodeFactory.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/optimize/QueryOptimizer.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/parse/FilterableStatement.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/iterate/TableSamplerPredicate.java
* (add) 
phoenix-core/src/it/java/org/apache/phoenix/end2end/QueryWithTableSampleIT.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/parse/SelectStatement.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/parse/DeleteStatement.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/parse/ConcreteTableNode.java
* (edit) phoenix-core/src/main/antlr3/PhoenixSQL.g
* (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/NamedTableNode.java


> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Fix For: 4.12.0
>
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-08-01 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109519#comment-16109519
 ] 

James Taylor commented on PHOENIX-153:
--

+1. Nice work, [~aertoria]. Will let you know if patch doesn't apply cleanly to 
other 4.x branches.

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-08-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109342#comment-16109342
 ] 

ASF GitHub Bot commented on PHOENIX-153:


Github user aertoria commented on the issue:

https://github.com/apache/phoenix/pull/262
  
Commit 0507d4f change list:
1, explain plan a new way (Thanks for the suggestion)
2, squash previous four commit into one
3, revise all commit message to start with PHOENIX-153+space 


preview on a Single select
`CLIENT 3-CHUNK 30 ROWS 2370 BYTES PARALLEL 1-WAY 0.2-SAMPLED ROUND ROBIN 
FULL SCAN OVER PERSON`

on a Join select
```
CLIENT 1-CHUNK 0 ROWS 0 BYTES PARALLEL 1-WAY 0.65-SAMPLED ROUND ROBIN FULL 
SCAN OVER INX_ADDRESS_PERSON
SERVER FILTER BY FIRST KEY ONLY
PARALLEL INNER-JOIN TABLE 0
CLIENT 1-CHUNK 1 ROWS 32 BYTES PARALLEL 1-WAY 0.15-SAMPLED ROUND 
ROBIN FULL SCAN OVER US_POPULATION
DYNAMIC SERVER FILTER BY TO_CHAR("INX_ADDRESS_PERSON.0:ADDRESS") IN 
(US_POPULATION.STATE)
```




> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-07-31 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108408#comment-16108408
 ] 

Ethan Wang commented on PHOENIX-153:


Make sense. Thanks [~jamestaylor]

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-07-31 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108390#comment-16108390
 ] 

James Taylor commented on PHOENIX-153:
--

Seems like review comments aren't appearing here in JIRA (maybe because your 
commit message doesn't include the JIRA number in the expected format), so I'll 
repeat it here:

Let's move the explain for the sampling into the first line, before we recurse 
down for the other steps. You can put it on the same line, after the "-WAY " 
like this:

CLIENT PARALLEL 1-WAY 0.48-SAMPLED ...

Otherwise, users will interpret the sampling as happening after the 
scan/filtering which isn't the case.

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-07-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108354#comment-16108354
 ] 

ASF GitHub Bot commented on PHOENIX-153:


Github user JamesRTaylor commented on the issue:

https://github.com/apache/phoenix/pull/262
  
Ping @aertoria - would you have a few spare cycles to make that last 
change? Also, please squash all commits into one and amend your commit message 
to be prefixed with PHOENIX-153 (i.e. include the dash). Otherwise, we the pull 
request isn't tied to the JIRA.


> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-07-01 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071041#comment-16071041
 ] 

Lars Hofhansl commented on PHOENIX-153:
---

The default guidepost width is 300MB. Maybe we could go down to 10MB, once we 
have guidepost combining.
Less than that will be a huge management burden to the system.

Still a good thing to do! On small tables you do not need to sample in the 
first place, and for large tables - where it matters - we'll have sufficiently 
many guide posts. (A 1TB table has over 3000 300MB guideposts, i.e. you'll have 
a resolution of 0.03%, which is plenty good!)


> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-06-25 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062546#comment-16062546
 ] 

Ethan Wang commented on PHOENIX-153:


Valid Point. 

In addition, by design, this coarse problem gets magnified when three things 
happen (and vice versa):
1, Table is too small
2, Guidepost width set too wide, or even no stats collected at all
3, User specifies to not use stats table for parallelization. 

Based on the observation from the testing on a table with 400K rows and 
GUIDE_POSTS_WIDTH =10KB or 200KB, the sampled size was usually around +-5% of 
expected size. This performance gets better and better when the GuidePosts used 
are more granular (Detailed chart attached.)

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-06-24 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062143#comment-16062143
 ] 

Lars Hofhansl commented on PHOENIX-153:
---

Good idea. Skipping whole guideposts is pretty coarse, though.
At the same time I cannot thing of anything else efficient.

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-06-23 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061761#comment-16061761
 ] 

James Taylor commented on PHOENIX-153:
--

Yes, +1 to following Calcite syntax 

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-06-23 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061645#comment-16061645
 ] 

Ethan Wang commented on PHOENIX-153:


+1. After some study about _calcite/parse.jj_ and 
_calcite/SqlValidatorFeatureTest.java_, my understanding is that calcite seems 
to be very close to Postgres TABLESAMPLE syntax (which PHOENIX-153 is also 
designed to be similar with).

 I'd like to sum up two differences below (please correct me if I'm mistaken 
[~julianhyde]).

1, Calcite table sampling rate input is 0 to 100 (PHOENIX-153 currently is 0 to 
1).
2,  Syntax difference
Calcite:  select name from dept TABLESAMPLE system(58)
PHOENIX-153: select name from dept TABLESAMPLE 0.58 

Purposing change for PHOENIX-153: Let's change phoenix side to be
select name from dept TABLESAMPLE(0.58) 

Thoughts?




Reference:
https://github.com/apache/calcite/blob/d619304070bf2874ab760c92ec2573ee6c19f536/piglet/src/main/javacc/PigletParser.jj

https://github.com/apache/calcite/blob/0938c7b6d767e3242874d87a30d9112512d9243a/core/src/test/java/org/apache/calcite/test/SqlValidatorFeatureTest.java

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-06-16 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052505#comment-16052505
 ] 

James Taylor commented on PHOENIX-153:
--

+1. Do you have something you can point us to for the Calcite TABLESAMPLE 
syntax?

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-06-16 Thread Julian Hyde (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052472#comment-16052472
 ] 

Julian Hyde commented on PHOENIX-153:
-

Since Calcite already supports TABLESAMPLE let's save ourselves a headache and 
make sure that the 4.x syntax is compatible with Calcite's syntax.

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-06-09 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044823#comment-16044823
 ] 

Ethan Wang commented on PHOENIX-153:


Spec of this patch. Feedback plz.

++
++Belows are SUPPORTED
++
===BASE CASE
select * from Person;
select * from PERSON TABLESAMPLE 0.45;

===WHERE CLAUSE
select * from PERSON where ADDRESS = 'CA' OR name>'tina3';
select * from PERSON TABLESAMPLE 0.49 where ADDRESS = 'CA' OR name>'tina3';
select * from PERSON TABLESAMPLE 0.49 where ADDRESS = 'CA' OR name>'tina3' 
LIMIT 1;


===Wired Table===
select * from LOCAL_ADDRESS TABLESAMPLE 0.79;
select * from SYSTEM.STATS TABLESAMPLE 0.41;


===CORNER CASE===
select * from PERSON TABLESAMPLE 0;
select * from PERSON TABLESAMPLE 1.45;
select * from PERSON TABLESAMPLE kko;




++
++belows are NOT SUPPORTED
++
===Subquery and outter join not supporting===
select * from (select * from PERSON where ADDRESS = 'CA') TABLESAMPE 0.2 where 
Name > 'tina10'

===AGGREGATION===
select count(*) from PERSON TABLESAMPLE 0.5 LIMIT 2

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PHOENIX-153) Implement TABLESAMPLE clause

2017-06-01 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033617#comment-16033617
 ] 

Ethan Wang commented on PHOENIX-153:


Implementation Proposal. (Feedback plz)
Proposing table sampling on each Table basis (at the 'FROM' part of the query). 
Sample size decided by the input sampling rate applies on Primary Key's 
frequency.


Syntax: 
`select name from person SAMPLE(0.10) where name='ethan'`


Returns:
`person SAMPLE(0.10)` part returns rows about 10% volume of the PERSON table. 
Reducing performance cost from PERSON table scan to Person-STATS table scan.


Implementation detail:
For table PERSON, assume STATS is populated with GuidePost inserted on every 
other PK (50% coverage).
Step1, within the query scanning keyrange, iterate through the STATS table.
Step2, for every GuidePost encountered, consult with a random number generator 
to decide if this guidepost will be included or excluded from the sample. This 
dice has 10% chance of winning.
Step3, Once we decide to include this GuidePost, every PK on the original 
PERSON table that is between this-GuidePost and next-GuidePost will be included 
to the final sample. 
Repeat this process untill all GuidePost are visited.



Example: 

PERSON

|ID(PK)|
|1  |
|2  |
|3  |
|4  |
|5  |
|6  |



STATS

|GuidePost|
|1   |
|3|
|5|

During dice rolling process, GuidePost 3 is included. PK between [3,5) will be 
included. The final result will be rows with PK 3, 4.


This implementation, 
a, similar to Microsoft SQLServer TABLESAMPLE, focus mainly on the performance 
benefit. It does not guarantee the even distribution of the sample on original 
table (representativity). 
b, it works well on any GUIDE_POST_SWIDTH on any input sample rate. However, if 
the table is too small, the sample output may include rows more or less than 
the expected count (sample_rate X table_size)






Summary of other popular TABLESAMPLE implementations.
Basically two categories:
1, Sampling on Query Basis. 
(Such as Blink DB. https://sameeragarwal.github.io/blinkdb_eurosys13.pdf)
This implementation places sampling process based on entire query. such as:
`select name from person where name='ethan' SAMPLE WITH ERROR 10% CONFIDENCE 
95%'

BlinkDB did so by assuming "the data used for similar grouping and filtering 
clause does not change over time across future queries". Based on heuristic 
experience, query engine pre-build certain stratify sample groups extracted 
from the actual table, cache them, and use them for evaluating an approximate 
result for some expensive queries. Therefore to avoid full table scan. 

This approach:
a, Optimizes for the best performance-accuracy-trade-off. Once given the 
accuracy tolerance, it automatically decide the sampling rate for user.
b, Engine takes filtering and grouping into consideration therefore it's 
powerful. But on the other side it may not perform at the same level for all 
kinds of queries.  
c, Based on heuristic info, there will be a machine gradually learning process. 




2, Sampling on Table Basis. 
(Such as Postgres, MS SQLServer. 
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation)
This approach places Tablesample only on the "FROM" part of the query. such as:
`select name from person TABLESAMPLE(10 PERCENT) where name='ethan'`

This approach first sample the original table to a smaller 'view' based on the 
Primary Key frequency and a given sampling rate. Then that 'view' will 
participate into the rest part of the query in place of original table.

Usually a randomly selection process is used during the view creation. In MS 
SQLServer, a linear one-pass pointer travel through each "page", and ask a 
random generator to decide if this page will be part of the sample. Once 
accepted, every single row on this page now become part of new sample. 

This MSSQL tablesample 
a, gives flexibility satisfying any sampling rate.
b, gain performance by reducing the length of a table scan (but big O 
complexity still the same) 
c, only care about the performance gain, does't care about sample distribution.

[~jamestaylor]  [~gjacoby]
[~samarthjain]

> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)