[jira] [Commented] (PHOENIX-2143) Use guidepost bytes instead of region name in stats primary key

James Taylor (JIRA) Tue, 22 Dec 2015 18:08:11 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069032#comment-15069032
 ]


James Taylor commented on PHOENIX-2143:
---------------------------------------

Thanks for the WIP patch, [~ankit.singhal]. The row key structure looks 
correct, but I don't think you'll need to change GuidePostsInfo to 
List<GuidePostsInfo> as GuidePostsInfo encapsulates all the guideposts across a 
table per column family. This object is sent across the wire as part of the 
PTable (the metadata for a table) and we'll want to continue doing that (the 
client caches the guideposts across the entire table). I don't think this 
change will impact the client-side much (other than perhaps a few minor tweaks).

The primary changes on the server side will be:
- Removing any stats-related logic when a split occurs. Nothing will be 
required during a split.
- We previously could delete the row that stored *all* guideposts for a given 
table/region/cf, but this will no longer be possible. Instead, when we update 
the stats for a region, we'll run a query to collect up all the guideposts 
that'll be deleted by querying the stats table for all guideposts between the 
start region and end region key:
    SELECT guide_post_key FROM SYSTEM.STATS
    WHERE physical_name = ? and family_name = ? and guide_post_key >= :1 AND 
guide_post_key < :2
with the :1 is bound to the start region key and :2 is bound to the end region 
key (with some variation on the first and last region that are unbound at the 
beginning/end of the range). Based on this query, we can form the row keys of 
the guideposts we need to delete and add these mutations to the list of 
mutations we submit with the puts for the new guideposts.

> Use guidepost bytes instead of region name in stats primary key
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-2143
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2143
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: Ankit Singhal
>         Attachments: PHOENIX-2143_wip.patch
>
>
> Our current SYSTEM.STATS table uses the region name as the last column in the 
> primary key constraint. Instead, we should use the MIN_KEY column (which 
> corresponds to the region start key). The advantage would be that the stats 
> would then be ordered by region start key allowing us to approximate the 
> number of guideposts which would be traversed given the start/stop row of a 
> scan:
> {code}
> SELECT SUM(guide_posts_count) FROM SYSTEM.STATS WHERE min_key > :1 AND 
> min_key < :2
> {code}
> where :1 is the start row and :2 is the stop row of the scan. With an UNNEST 
> operator for ARRAYs, we could get a better approximation.
> As part of the upgrade to the new Phoenix version containing this fix, stats 
> could simply be dropped and they'd be recalculated with the new schema.
> An alternative, even more granular approach would be to *not* use arrays to 
> store the guide posts, but instead store them as individual rows with a 
> schema like this.
> |PHYSICAL_NAME|VARCHAR|
> |COLUMN_FAMILY|VARCHAR|
> |GUIDE_POST_KEY|VARBINARY|
> In this alternative, the maintenance during compaction is higher, though, as 
> you'd need to run a separate query to do the deletion of the old guideposts, 
> followed by a commit of the new guideposts. The other disadvantage (besides 
> requiring multiple queries) is that this couldn't be done transactionally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-2143) Use guidepost bytes instead of region name in stats primary key

Reply via email to