[jira] [Commented] (PHOENIX-180) Use stats to guide query parallelization

James Taylor (JIRA) Wed, 20 Aug 2014 00:35:55 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103561#comment-14103561
 ]


James Taylor commented on PHOENIX-180:
--------------------------------------

Just use a null tenantId (QueryConstant.SEPARATOR_BYTE) for now and you can get 
the schemaName and tableName from the HBase table name 
(SchemaUtil.getSchemaNameFromFullName(byte[] tableName) and 
SchemaUtil.getTableNameFromFullName(byte[] tableName)). If the table being 
invalidated is multi-tenant (table.isMultiTenant()), then we'll likely want to 
run a scan to get all tenantId that have views defined and then invalidate all 
of them. Or alternatively, just walking through the keys in the metadata cache 
and invalidate any tenant specific tables. Don't worry about this for now, but 
please file a JIRA so we don't forget it.

How'd you decide to solve the multiple column family issue you brought up 
before?

An example where you'd see a big performance difference with
better parallelization is with a composite row key with low
cardinality (like an enum) followed by a high cardinality value (like
a date). For example, a table to track server metrics across a set of
servers might have a primary key constraint of hostName+dateMeasured.
Assume that a query is done for a given hostName and that each region
stores more or less one hostName worth of data. We'd attempt to
parallelize the scan by using Bytes.split to calculate split points
between two host names. For example, let's say you had host names like
SF1, SF2, SF3, etc. with a query to calculate average response time
for a given host:
{code}
    SELECT avg(responseTime) FROM SERVER_METRICS
    WHERE hostName='SF1'
{code}
Then we'd calculate splits from SF1 to SF2 and use these to determine
the chunk of work given to each scan for each thread. Unfortunately,
this would cause all the work to be done by a single thread, as we
have no knowledge of the range of values for responseTime. In
actuality, the dates would likely range from the current date to 90
days before. We'd be splitting instead from 1902 -2038 (the full range
of an 8 byte epoch time) which is pretty useless.

By establishing these guideposts and using them instead, we'd get a
10-20x perf boost due to better parallelization.

Super excited about this work you're doing, [~ramkrishna]! I'll take a deeper 
look as soon as we post a 3.1/4.1 RC (hopefully tomorrow).

> Use stats to guide query parallelization
> ----------------------------------------
>
>                 Key: PHOENIX-180
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-180
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: ramkrishna.s.vasudevan
>              Labels: enhancement
>         Attachments: Phoenix-180_WIP.patch
>
>
> We're currently not using stats, beyond a table-wide min key/max key cached 
> per client connection, to guide parallelization. If a query targets just a 
> few regions, we don't know how to evenly divide the work among threads, 
> because we don't know the data distribution. This other [issue] 
> (https://github.com/forcedotcom/phoenix/issues/64) is targeting gather and 
> maintaining the stats, while this issue is focused on using the stats.
> The main changes are:
> 1. Create a PTableStats interface that encapsulates the stats information 
> (and implements the Writable interface so that it can be serialized back from 
> the server).
> 2. Add a stats member variable off of PTable to hold this.
> 3. From MetaDataEndPointImpl, lookup the stats row for the table in the stats 
> table. If the stats have changed, return a new PTable with the updated stats 
> information. We may want to cache the stats row and have the stats gatherer 
> invalidate the cache row when updated so we don't have to always do a scan 
> for it. Additionally, it would be idea if we could use the same split policy 
> on the stats table that we use on the system table to guarantee co-location 
> of data (for the sake of caching).
> - modify the client-side parallelization (ParallelIterators.getSplits()) to 
> use this information to guide how to chunk up the scans at query time.
> This should help boost query performance, especially in cases where the data 
> is highly skewed. It's likely the cause for the slowness reported in this 
> issue: https://github.com/forcedotcom/phoenix/issues/47.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PHOENIX-180) Use stats to guide query parallelization

Reply via email to