[jira] [Commented] (PHOENIX-2143) Use guidepost bytes instead of region name in stats primary key

Samarth Jain (JIRA) Mon, 11 Jan 2016 20:10:39 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093252#comment-15093252
 ]


Samarth Jain commented on PHOENIX-2143:
---------------------------------------

[~ankit.singhal] - I see that a lot of changes in your patch are white-space 
related. Can you make sure you don't format the code that you are not changing. 

Also, why is this deleted in StatisticsScannerCallable? 

{code}

-                if (mergeRegions != null) {
-                    if (mergeRegions.getFirst() != null) {
-                        if (LOG.isDebugEnabled()) {
-                            LOG.debug("Deleting stale stats for the region "
-                                    + 
mergeRegions.getFirst().getRegionNameAsString() + " as part of major 
compaction");
-                        }
-                        
stats.deleteStats(mergeRegions.getFirst().getRegionName(), tracker, family, 
mutations);
-                    }
-                    if (mergeRegions.getSecond() != null) {
-                        if (LOG.isDebugEnabled()) {
-                            LOG.debug("Deleting stale stats for the region "
-                                    + 
mergeRegions.getSecond().getRegionNameAsString() + " as part of major 
compaction");
-                        }
-                        
stats.deleteStats(mergeRegions.getSecond().getRegionName(), tracker, family, 
mutations);
-                    }
-                }

{code}

In StatsCollectorIT we already have test that run UPDATE STATISTICS. Not sure 
what the test that you added is testing that the existing tests are not. 
 
I notice there is more deleted code in UngroupedAggregateRegionObserver:

{code}
-    @Override
-    public void postSplit(final ObserverContext<RegionCoprocessorEnvironment> 
e, final Region l,
-            final Region r) throws IOException {
-        final Configuration config = e.getEnvironment().getConfiguration();
-        boolean async = config.getBoolean(COMMIT_STATS_ASYNC, 
DEFAULT_COMMIT_STATS_ASYNC);
-        if (!async) {
-            splitStatsInternal(e, l, r);
-        } else {
-            StatisticsCollectionRunTracker.getInstance(config).runTask(new 
Callable<Void>() {
-                @Override
-                public Void call() throws Exception {
-                    splitStatsInternal(e, l, r);
-                    return null;
-                }
-            });
-        }
-    }
-
-    private void splitStatsInternal(final 
ObserverContext<RegionCoprocessorEnvironment> e,
-            final Region l, final Region r) {
-        final RegionCoprocessorEnvironment env = e.getEnvironment();
-        final Region region = env.getRegion();
-        final TableName table = region.getRegionInfo().getTable();
-        try {
-            boolean useCurrentTime =
-                    
env.getConfiguration().getBoolean(QueryServices.STATS_USE_CURRENT_TIME_ATTRIB,
-                            
QueryServicesOptions.DEFAULT_STATS_USE_CURRENT_TIME);
-            // Provides a means of clients controlling their timestamps to not 
use current time
-            // when background tasks are updating stats. Instead we track the 
max timestamp of
-            // the cells and use that.
-            final long clientTimeStamp = useCurrentTime ? 
TimeKeeper.SYSTEM.getCurrentTime() :
-              StatisticsCollector.NO_TIMESTAMP;
-            User.runAsLoginUser(new PrivilegedExceptionAction<Void>() {
-              @Override
-              public Void run() throws Exception {
-                StatisticsCollector stats = new StatisticsCollector(env,
-                  table.getNameAsString(), clientTimeStamp);
-                try {
-                  stats.splitStats(region, l, r);
-                  return null;
-                } finally {
-                  if (stats != null) stats.close();
-                }
-              }
-            });
-        } catch (IOException ioe) {
-            if(logger.isWarnEnabled()) {
-                logger.warn("Error while collecting stats during split for " + 
table,ioe);
-            }
-        }
-    }

{code}

Can you tell me more why is this not needed anymore? Are existing tests passing 
for you? 

> Use guidepost bytes instead of region name in stats primary key
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-2143
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2143
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: Ankit Singhal
>         Attachments: PHOENIX-2143.patch, PHOENIX-2143_wip.patch, 
> PHOENIX-2143_wip_2.patch
>
>
> Our current SYSTEM.STATS table uses the region name as the last column in the 
> primary key constraint. Instead, we should use the MIN_KEY column (which 
> corresponds to the region start key). The advantage would be that the stats 
> would then be ordered by region start key allowing us to approximate the 
> number of guideposts which would be traversed given the start/stop row of a 
> scan:
> {code}
> SELECT SUM(guide_posts_count) FROM SYSTEM.STATS WHERE min_key > :1 AND 
> min_key < :2
> {code}
> where :1 is the start row and :2 is the stop row of the scan. With an UNNEST 
> operator for ARRAYs, we could get a better approximation.
> As part of the upgrade to the new Phoenix version containing this fix, stats 
> could simply be dropped and they'd be recalculated with the new schema.
> An alternative, even more granular approach would be to *not* use arrays to 
> store the guide posts, but instead store them as individual rows with a 
> schema like this.
> |PHYSICAL_NAME|VARCHAR|
> |COLUMN_FAMILY|VARCHAR|
> |GUIDE_POST_KEY|VARBINARY|
> In this alternative, the maintenance during compaction is higher, though, as 
> you'd need to run a separate query to do the deletion of the old guideposts, 
> followed by a commit of the new guideposts. The other disadvantage (besides 
> requiring multiple queries) is that this couldn't be done transactionally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-2143) Use guidepost bytes instead of region name in stats primary key

Reply via email to