[
https://issues.apache.org/jira/browse/PHOENIX-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100555#comment-14100555
]
ramkrishna.s.vasudevan commented on PHOENIX-180:
------------------------------------------------
Thanks for the initial feed back.
bq.A config option that controls how often we read the stats table from
MetaDataEndPointImpl and a timer thread that does this.
Can we use the Chore.java that HBase provides? That would need the
MetaDataEndPointImpl to be of type Stoppable.
bq.The table in the server-side cache in MetaDataEndPointImpl should be
invalidated when the stats are reread and have change
Okie. So will have to have an equals() method in the PTableStatsImpl and do a
compare and then invalidate the server-side cache.
bq. SQL command to initiate the endpoint coprocessor to update the stats.
Okie. I will do this.
bq.You should be able to use these in tests to ensure that Bytes.split() is no
longer used.
Phoenix adds two KVs with the actual table name and _0,_1 as qualifier names.
So we should avoid this. Can this be done on the server side itself using
filters? I am yet to see how Phoenix already avoids this.
what should be the default config value as when the guide post should be
collected. Currently it is based on kv.getLength(). Collect those kvs when
the total byte count exceeds this config value.
bq.perf will be interesting to compare using a schema like the one I mentioned
before.
Can you point me to a schema as an example that tries to do this?
I see that there is an example here
https://github.com/forcedotcom/phoenix/issues/47.
> Use stats to guide query parallelization
> ----------------------------------------
>
> Key: PHOENIX-180
> URL: https://issues.apache.org/jira/browse/PHOENIX-180
> Project: Phoenix
> Issue Type: Sub-task
> Reporter: James Taylor
> Assignee: ramkrishna.s.vasudevan
> Labels: enhancement
> Attachments: Phoenix-180_WIP.patch
>
>
> We're currently not using stats, beyond a table-wide min key/max key cached
> per client connection, to guide parallelization. If a query targets just a
> few regions, we don't know how to evenly divide the work among threads,
> because we don't know the data distribution. This other [issue]
> (https://github.com/forcedotcom/phoenix/issues/64) is targeting gather and
> maintaining the stats, while this issue is focused on using the stats.
> The main changes are:
> 1. Create a PTableStats interface that encapsulates the stats information
> (and implements the Writable interface so that it can be serialized back from
> the server).
> 2. Add a stats member variable off of PTable to hold this.
> 3. From MetaDataEndPointImpl, lookup the stats row for the table in the stats
> table. If the stats have changed, return a new PTable with the updated stats
> information. We may want to cache the stats row and have the stats gatherer
> invalidate the cache row when updated so we don't have to always do a scan
> for it. Additionally, it would be idea if we could use the same split policy
> on the stats table that we use on the system table to guarantee co-location
> of data (for the sake of caching).
> - modify the client-side parallelization (ParallelIterators.getSplits()) to
> use this information to guide how to chunk up the scans at query time.
> This should help boost query performance, especially in cases where the data
> is highly skewed. It's likely the cause for the slowness reported in this
> issue: https://github.com/forcedotcom/phoenix/issues/47.
--
This message was sent by Atlassian JIRA
(v6.2#6252)