[jira] [Updated] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-16 Thread JeongMin Ju (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JeongMin Ju updated PHOENIX-4906:
-
Description: 
For a salted table, when a query is made for an entire data target, a different 
plan is created depending on the type of the query, and as a result, erroneous 
data is retrieved as a result.

{code:java}
// Actually, the schema of the table I used is different, but please ignore it.
create table if not exists test.test_tale (
  rk1 varchar not null,
  rk2 varchar not null,
  column1 varchar
  constraint pk primary key (rk1, rk2)
)
...
SALT_BUCKETS=16...
;
{code}
 

I created a table with 16 salting regions and then wrote a lot of data.
 HBase automatically split the region and I did the merging regions for data 
balancing between the region servers.

Then, when run the query, you can see that another plan is created according to 
the Where clause.
 * query1
 select count\(*) from test.test_table;
{code:java}
+---+-++
|PLAN   
| EST_BYTES_READ  | EST_ROWS_READ  |
+---+-++
| CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
| SERVER FILTER BY FIRST KEY ONLY   
| 1944546675532   | 5005959292 |
| SERVER AGGREGATE INTO SINGLE ROW  
| 1944546675532   | 5005959292 |
+---+-++

{code}

 * query2
 select count\(*) from test.test_table where rk2 = 'aa';
{code}
+---+-++
|  PLAN 
| EST_BYTES_READ  | EST_ROWS_READ  |
+---+-++
| CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
| SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'
| 1939177965768   | 4992196444 |
| SERVER AGGREGATE INTO SINGLE ROW  
| 1939177965768   | 4992196444 |
+---+-++
{code}

Since rk2 used in the where clause of query2 is the second column of the PK, it 
must be a full scan query like query1.
However, as you can see, query2 is created by range scan and the generated 
chunk is also less than five compared to query1.
I added the log and printed out the startkey and endkey of the scan object 
generated by the plan.
And I found 5 chunks missing by query2.

All five missing chunks were found in regions where the originally generated 
region boundary value was not maintained through the merge operation.
!initial_salting_region.png!

After merging regions
!merged-region.png!

The code that caused the problem is this part.

 When a select query is executed, the 
[org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
 method creates a Scan object based on the GuidePost in the statistics table. 
In the case of a GuidePost that contains a region boundary, it is split into 
two Scan objects. The code used here is 
[org.apache.phoenix.compile.ScanRanges#intersectScan|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java#L299-L303].

!ScanRanges_intersectScan.png!

In the case of a table that has been salted, the code compares it with the 
remainder after subtracting the salt(prefix) bytes.
I can not be sure that this code is buggy or intended.

In this case, I have merge the region directly, but it is likely to occur 
through HBase's Normalizer function.

I wish other users did not merge the region manually or not the table property 
Normalization_enabled to true  in their production cluster. If so, check to see 
if the initial Sal

[jira] [Updated] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-16 Thread JeongMin Ju (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JeongMin Ju updated PHOENIX-4906:
-
Attachment: merged-region.png

> Abnormal query result due to Phoenix plan error
> ---
>
> Key: PHOENIX-4906
> URL: https://issues.apache.org/jira/browse/PHOENIX-4906
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.11.0, 4.14.0
>Reporter: JeongMin Ju
>Priority: Critical
> Attachments: ScanRanges_intersectScan.png, 
> initial_salting_region.png, merged-region.png
>
>
> For a salted table, when a query is made for an entire data target, a 
> different plan is created depending on the type of the query, and as a 
> result, erroneous data is retrieved as a result.
> {code:java}
> // Actually, the schema of the table I used is different, but please ignore 
> it.
> create table if not exists test.test_tale (
>   rk1 varchar not null,
>   rk2 varchar not null,
>   column1 varchar
>   constraint pk primary key (rk1, rk2)
> )
> ...
> SALT_BUCKETS=16...
> ;
> {code}
>  
> I created a table with 16 salting regions and then wrote a lot of data.
>  HBase automatically split the region and I did the merging regions for data 
> balancing between the region servers.
> Then, when run the query, you can see that another plan is created according 
> to the Where clause.
>  * query1
>  select count\(*) from test.test_table;
> {code:java}
> +---+-++
> |PLAN 
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
> SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
> | SERVER FILTER BY FIRST KEY ONLY 
>   | 1944546675532   | 5005959292 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1944546675532   | 5005959292 |
> +---+-++
> {code}
>  * query2
>  select count\(*) from test.test_table where rk2 = 'aa';
> {code}
> +---+-++
> |  PLAN   
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
> SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
> | SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'  
>   | 1939177965768   | 4992196444 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1939177965768   | 4992196444 |
> +---+-++
> {code}
> Since rk2 used in the where clause of query2 is the second column of the PK, 
> it must be a full scan query like query1.
> However, as you can see, query2 is created by range scan and the generated 
> chunk is also less than five compared to query1.
> I added the log and printed out the startkey and endkey of the scan object 
> generated by the plan.
> And I found 5 chunks missing by query2.
> All five missing chunks were found in regions where the originally generated 
> region boundary value was not maintained through the merge operation.
> The code that caused the problem is this part.
>  When a select query is executed, the 
> [org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
>  method creates a Scan object based on the GuidePost in the statistics table. 
> In the case of a GuidePost that contains a region boundary, it is split into 
> two Scan objects. The code used here is 
> [org.apache.phoenix.compile.ScanRanges#intersectScan|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache

[jira] [Updated] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-16 Thread JeongMin Ju (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JeongMin Ju updated PHOENIX-4906:
-
Attachment: initial_salting_region.png

> Abnormal query result due to Phoenix plan error
> ---
>
> Key: PHOENIX-4906
> URL: https://issues.apache.org/jira/browse/PHOENIX-4906
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.11.0, 4.14.0
>Reporter: JeongMin Ju
>Priority: Critical
> Attachments: ScanRanges_intersectScan.png, 
> initial_salting_region.png, merged-region.png
>
>
> For a salted table, when a query is made for an entire data target, a 
> different plan is created depending on the type of the query, and as a 
> result, erroneous data is retrieved as a result.
> {code:java}
> // Actually, the schema of the table I used is different, but please ignore 
> it.
> create table if not exists test.test_tale (
>   rk1 varchar not null,
>   rk2 varchar not null,
>   column1 varchar
>   constraint pk primary key (rk1, rk2)
> )
> ...
> SALT_BUCKETS=16...
> ;
> {code}
>  
> I created a table with 16 salting regions and then wrote a lot of data.
>  HBase automatically split the region and I did the merging regions for data 
> balancing between the region servers.
> Then, when run the query, you can see that another plan is created according 
> to the Where clause.
>  * query1
>  select count\(*) from test.test_table;
> {code:java}
> +---+-++
> |PLAN 
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
> SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
> | SERVER FILTER BY FIRST KEY ONLY 
>   | 1944546675532   | 5005959292 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1944546675532   | 5005959292 |
> +---+-++
> {code}
>  * query2
>  select count\(*) from test.test_table where rk2 = 'aa';
> {code}
> +---+-++
> |  PLAN   
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
> SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
> | SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'  
>   | 1939177965768   | 4992196444 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1939177965768   | 4992196444 |
> +---+-++
> {code}
> Since rk2 used in the where clause of query2 is the second column of the PK, 
> it must be a full scan query like query1.
> However, as you can see, query2 is created by range scan and the generated 
> chunk is also less than five compared to query1.
> I added the log and printed out the startkey and endkey of the scan object 
> generated by the plan.
> And I found 5 chunks missing by query2.
> All five missing chunks were found in regions where the originally generated 
> region boundary value was not maintained through the merge operation.
> The code that caused the problem is this part.
>  When a select query is executed, the 
> [org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
>  method creates a Scan object based on the GuidePost in the statistics table. 
> In the case of a GuidePost that contains a region boundary, it is split into 
> two Scan objects. The code used here is 
> [org.apache.phoenix.compile.ScanRanges#intersectScan|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/o

[jira] [Updated] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-16 Thread JeongMin Ju (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JeongMin Ju updated PHOENIX-4906:
-
Attachment: ScanRanges_intersectScan.png

> Abnormal query result due to Phoenix plan error
> ---
>
> Key: PHOENIX-4906
> URL: https://issues.apache.org/jira/browse/PHOENIX-4906
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.11.0, 4.14.0
>Reporter: JeongMin Ju
>Priority: Critical
> Attachments: ScanRanges_intersectScan.png
>
>
> For a salted table, when a query is made for an entire data target, a 
> different plan is created depending on the type of the query, and as a 
> result, erroneous data is retrieved as a result.
> {code:java}
> // Actually, the schema of the table I used is different, but please ignore 
> it.
> create table if not exists test.test_tale (
>   rk1 varchar not null,
>   rk2 varchar not null,
>   column1 varchar
>   constraint pk primary key (rk1, rk2)
> )
> ...
> SALT_BUCKETS=16...
> ;
> {code}
>  
> I created a table with 16 salting regions and then wrote a lot of data.
>  HBase automatically split the region and I did the merging regions for data 
> balancing between the region servers.
> Then, when run the query, you can see that another plan is created according 
> to the Where clause.
>  * query1
>  select count\(*) from test.test_table;
> {code:java}
> +---+-++
> |PLAN 
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
> SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
> | SERVER FILTER BY FIRST KEY ONLY 
>   | 1944546675532   | 5005959292 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1944546675532   | 5005959292 |
> +---+-++
> {code}
>  * query2
>  select count\(*) from test.test_table where rk2 = 'aa';
> {code}
> +---+-++
> |  PLAN   
>   | EST_BYTES_READ  | EST_ROWS_READ  |
> +---+-++
> | CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
> SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
> | SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'  
>   | 1939177965768   | 4992196444 |
> | SERVER AGGREGATE INTO SINGLE ROW
>   | 1939177965768   | 4992196444 |
> +---+-++
> {code}
> Since rk2 used in the where clause of query2 is the second column of the PK, 
> it must be a full scan query like query1.
> However, as you can see, query2 is created by range scan and the generated 
> chunk is also less than five compared to query1.
> I added the log and printed out the startkey and endkey of the scan object 
> generated by the plan.
> And I found 5 chunks missing by query2.
> All five missing chunks were found in regions where the originally generated 
> region boundary value was not maintained through the merge operation.
> The code that caused the problem is this part.
>  When a select query is executed, the 
> [org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
>  method creates a Scan object based on the GuidePost in the statistics table. 
> In the case of a GuidePost that contains a region boundary, it is split into 
> two Scan objects. The code used here is 
> [org.apache.phoenix.compile.ScanRanges#intersectScan|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java#L299-L

[jira] [Created] (PHOENIX-4906) Abnormal query result due to Phoenix plan error

2018-09-16 Thread JeongMin Ju (JIRA)
JeongMin Ju created PHOENIX-4906:


 Summary: Abnormal query result due to Phoenix plan error
 Key: PHOENIX-4906
 URL: https://issues.apache.org/jira/browse/PHOENIX-4906
 Project: Phoenix
  Issue Type: Bug
Affects Versions: 4.14.0, 4.11.0
Reporter: JeongMin Ju


For a salted table, when a query is made for an entire data target, a different 
plan is created depending on the type of the query, and as a result, erroneous 
data is retrieved as a result.

{code:java}
// Actually, the schema of the table I used is different, but please ignore it.
create table if not exists test.test_tale (
  rk1 varchar not null,
  rk2 varchar not null,
  column1 varchar
  constraint pk primary key (rk1, rk2)
)
...
SALT_BUCKETS=16...
;
{code}
 

I created a table with 16 salting regions and then wrote a lot of data.
 HBase automatically split the region and I did the merging regions for data 
balancing between the region servers.

Then, when run the query, you can see that another plan is created according to 
the Where clause.
 * query1
 select count\(*) from test.test_table;
{code:java}
+---+-++
|PLAN   
| EST_BYTES_READ  | EST_ROWS_READ  |
+---+-++
| CLIENT 1851-CHUNK 5005959292 ROWS 1944546675532 BYTES PARALLEL 11-WAY FULL 
SCAN OVER TEST:TEST_TABLE  | 1944546675532   | 5005959292 |
| SERVER FILTER BY FIRST KEY ONLY   
| 1944546675532   | 5005959292 |
| SERVER AGGREGATE INTO SINGLE ROW  
| 1944546675532   | 5005959292 |
+---+-++

{code}

 * query2
 select count\(*) from test.test_table where rk2 = 'aa';
{code}
+---+-++
|  PLAN 
| EST_BYTES_READ  | EST_ROWS_READ  |
+---+-++
| CLIENT 1846-CHUNK 4992196444 ROWS 1939177965768 BYTES PARALLEL 11-WAY RANGE 
SCAN OVER TEST:TEST_TABLE [0] - [15]  | 1939177965768   | 4992196444 |
| SERVER FILTER BY FIRST KEY ONLY AND RK2 = 'aa'
| 1939177965768   | 4992196444 |
| SERVER AGGREGATE INTO SINGLE ROW  
| 1939177965768   | 4992196444 |
+---+-++
{code}

Since rk2 used in the where clause of query2 is the second column of the PK, it 
must be a full scan query like query1.
However, as you can see, query2 is created by range scan and the generated 
chunk is also less than five compared to query1.
I added the log and printed out the startkey and endkey of the scan object 
generated by the plan.
And I found 5 chunks missing by query2.

All five missing chunks were found in regions where the originally generated 
region boundary value was not maintained through the merge operation.
The code that caused the problem is this part.

 When a select query is executed, the 
[org.apache.phoenix.iterate.BaseResultIterators#getParallelScans|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java#L743-L744]
 method creates a Scan object based on the GuidePost in the statistics table. 
In the case of a GuidePost that contains a region boundary, it is split into 
two Scan objects. The code used here is 
[org.apache.phoenix.compile.ScanRanges#intersectScan|https://github.com/apache/phoenix/blob/v4.11.0-HBase-1.2/phoenix-core/src/main/java/org/apache/phoenix/compile/ScanRanges.java#L299-L303].

In the case of a table that has been salted, the code compares it with the 
remainder after subtracting the salt(prefix) bytes.
I can not be sure that this code is buggy or intended.

In this case, I have merge the region directly, but it is likely to occur 
through HBase's Normalizer function.

I wish other users did not merge the region manually or not the table property 
Normalization_enabled to true  in their pr

Re: [DISCUSS] Suggestions for Phoenix from HBaseCon Asia notes

2018-09-16 Thread la...@apache.org
 I think we can start by implementing a tighter integration with Spark through 
DataSource V2.That would make it quickly apparent what parts of Phoenix would 
need direct access.
Some parts just need a interface audience declaration (like Phoenix's basic 
type system) and our agreement that we will change those only according to 
semantic versioning. Otherwise (like the query plan) will need a bit more 
thinking. Maybe that's the path to hook Calcite - just making that part up as I 
write this...
Perhaps turning the HBase interface into an API might not be so difficult 
either. That would perhaps be a new client - strictly additional - client API.

A good Spark interface is in everybody's interest and I think is the best 
avenue to figure out what's missing/needed.
-- Lars

On Wednesday, September 12, 2018, 12:47:21 PM PDT, Josh Elser 
 wrote:  
 
 I like it, Lars. I like it very much.

Just the easy part of doing it... ;)

On 9/11/18 4:53 PM, la...@apache.org wrote:
>  Sorry for coming a bit late to this. I've been thinking about some of lines 
>for a bit.
> It seems Phoenix serves 4 distinct purposes:
> 1. Query parsing and compiling.2. A type system3. Query execution4. Efficient 
> HBase interface
> Each of these is useful by itself, but we do not expose these as stable 
> interfaces.We have seen a lot of need to tie HBase into "higher level" 
> service, such as Spark (and Presto, etc).
> I think we can get a long way if we separate at least #1 (SQL) from the rest 
> #2, #3, and #4 (Typed HBase Interface - THI).
> Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill, 
> Spark, etc, can interface efficiently with HBase via THI (#2, #3, and #4).
> Thoughts?
> -- Lars
>      On Monday, August 27, 2018, 11:03:33 AM PDT, Josh Elser 
> wrote:
>  
>  (bcc: dev@hbase, in case folks there have been waiting for me to send
> this email to dev@phoenix)
> 
> Hi,
> 
> In case you missed it, there was an HBaseCon event held in Asia
> recently. Stack took some great notes and shared them with the HBase
> community. A few of them touched on Phoenix, directly or in a related
> manner. I think they are good "criticisms" that are beneficial for us to
> hear.
> 
> 1. The phoenix-$version-client.jar size is prohibitively large
> 
> In this day and age, I'm surprised that this is a big issue for people.
> I know have a lot of cruft, most of which coming from hadoop. We have
> gotten better here over recent releases, but I would guess that there is
> more we can do.
> 
> 2. Can Phoenix be the de-facto schema for SQL on HBase?
> 
> We've long asserted "if you have to ask how Phoenix serializes data, you
> shouldn't be do it" (a nod that you have to write lots of code). What if
> we turn that on its head? Could we extract our PDataType serialization,
> composite row-key, column encoding, etc into a minimal API that folks
> with their own itches can use?
> 
> With the growing integrations into Phoenix, we could embrace them by
> providing an API to make what they're doing easier. In the same vein, we
> cement ourselves as a cornerstone of doing it "correctly".
> 
> 3. Better recommendations to users to not attempt certain queries.
> 
> We definitively know that there are certain types of queries that
> Phoenix cannot support well (compared to optimal Phoenix use-cases).
> Users very commonly fall into such pitfalls on their own and this leaves
> a bad taste in their mouth (thinking that the product "stinks").
> 
> Can we do a better job of telling the user when and why it happened?
> What would such a user-interaction model look like? Can we supplement
> the "why" with instructions of what to do differently (even if in the
> abstract)?
> 
> 4. Phoenix-Calcite
> 
> This was mentioned as a "nice to have". From what I understand, there
> was nothing explicitly from with the implementation or approach, just
> that it was a massive undertaking to continue with little immediate
> gain. Would this be a boon for us to try to continue in some form? Are
> there steps we can take that would help push us along the right path?
> 
> Anyways, I'd love to hear everyone's thoughts. While the concerns were
> raised at HBaseCon Asia, the suggestions that accompany them here are
> largely mine ;). Feel free to break them out into their own threads if
> you think that would be better (or say that you disagree with me --
> that's cool too)!
> 
> - Josh
>    
> 
  

Re: Salting based on partial rowkeys

2018-09-16 Thread la...@apache.org
 I added some comments on the PHOENIX-4757

On Thursday, September 13, 2018, 6:42:12 PM PDT, Josh Elser 
 wrote:  
 
 Ahh, I get you now.

For a composite primary key made up of columns 1 through N, you want 
similar controls to compute the value of the salt based on a sequence of 
the columns 1 through M where M <= N (instead of always on all columns).

For large numbers of salt buckets and a scan over a facet, you prune 
your search space considerably. Makes sense to me!

On 9/13/18 6:37 PM, Gerald Sangudi wrote:
> In case the text formatting is lost below, I also added it as a comment in
> the JIRA ticket:
> 
> https://issues.apache.org/jira/browse/PHOENIX-4757
> 
> 
> On Thu, Sep 13, 2018 at 3:24 PM, Gerald Sangudi 
> wrote:
> 
>> Sorry I missed Josh's reply; I've subscribed to the dev list now.
>>
>> Below is a copy-and-paste from our internal document. Thanks in advance
>> for your review and additional feedback on this.
>>
>> Gerald
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *BackgroundWe make extensive use of multi-column rowkeys and salting
>>  in our different apache phoenix
>> deployments. We frequently perform group-by aggregations on these data
>> along a specific dimension that would benefit from predictably partitioning
>> the data along that dimension. Proposal:We propose to add table metadata to
>> allow schema designers to constrain salting to a subset of the rowkey,
>> rather than the full rowkey as it is today. This will introduce a mechanism
>> to partition data on a per-table basis along a single dimension without
>> application changes or much change to the phoenix runtime logic. We expect
>> this will result in substantially faster group-by’s along the salted
>> dimension and negligible penalties elsewhere. This feature has also been
>> proposed in PHOENIX-4757
>>  where it was pointed
>> out that partitioning and sorting data along different dimensions is a
>> common pattern in other datastores as well.Theoretically, it could cause
>> hotspotting when querying along the salted dimension without the leading
>> rowkey - that would be an anti-pattern.Usage ExampleCurrent:Schema:CREATE
>> TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key
>> BIGINT NOT NULL,val SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2,
>> other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From
>> relationshipWhere id_1 in (2,3)Group by id_2Explain:0: jdbc:phoenix:>
>> EXPLAIN Select id_2, sum(val) From relationship Where id_1 in (2,3) Group
>> by id_2
>> ;+-++|
>>                                          PLAN    | EST_BY
>> |+-++|
>> CLIENT 60-CHUNK PARALLEL 60-WAY SKIP SCAN ON 120 KEYS OVER RELATIONSHIP
>> [0,2] - [59,3]  | null ||    SERVER AGGREGATE INTO DISTINCT ROWS BY [ID_2]
>>                                        | null || CLIENT MERGE SORT
>>                                                                        |
>> null
>> |+-++3
>> rows selected (0.048 seconds)In this case, although the group by is
>> performed on both the client and regionserver, almost all of the actual
>> grouping happens on the client because the id_2’s are randomly distributed
>> across the regionservers. As a result, a lot of unnecessary data is
>> serialized to the client and grouped serially there. This can become quite
>> material with large resultsets.Proposed:Schema:CREATE TABLE relationship
>> (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key BIGINT NOT NULL,val
>> SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2,
>> other_key),SALT_BUCKETS=60,SALT_COLUMN = id_2);Query (unchanged):Select
>> id_2, sum(val)From relationshipWhere id_1 in (2,3)Group by id_2Explain
>> (unchanged)Under the proposal, the data are merely partitioned so that all
>> rows containing the same id_2 are on the same regionserver, the above query
>> will perform almost all of the grouping in parallel on the regionservers.
>> No special hint or changes to the query plan would be required to benefit.
>> Tables would need to be re-salted to take advantage of the new
>> functionality.Technical changes proposed to phoenix: - Create a new piece
>> of table-level metadata: SALT_COLUMN. SALT_COLUMN will instruct the salting
>> logic to generate a salt-byte based only on the specified column. If
>> unspecified, it will behave as it does today and default to salting the
>> entire rowkey. This metadata may be specified only when the table is
>> created and may not be modified. The specified column must be part of the
>> rowkey.  - Modify all callers of getSaltingByte
>>