[ 
https://issues.apache.org/jira/browse/PHOENIX-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668113#comment-16668113
 ] 

Bin Shi commented on PHOENIX-4999:
----------------------------------

I might have different opinion. We should allow to update statistics on tenant 
specific connection. The reasons are as follows:
 # The base table is defined as MULTI_TENANT = true, so the common case is that 
a user uses its tenant connection to issue queries which should include both 
CRUD queries and Update Statistics query. It isn't friendly to ask this tenant 
specific user to switch between table-level connection and tenant connection – 
I'm even not sure whether or not we always allow this tenant specific user to 
use table-level connection. 
 #  For a tenant specific user, it is natural to ask user to rely on Update 
Statistics with tenant specific connection to update stats used by this tenant 
or to rely on major compaction.
 # Different tenant might have different data update frequency and have 
different speed to reach certain data drift threshold which triggers "Update 
Statistics". One tenant exceeds data drift threshold, we should just 
collect/fresh stats for this particular tenant instead of for doing it for the 
whole table, otherwise that's wastage in terms of time and resource. With that 
being said, we should always allow each tenant to update its stats 
independently. 
 # Yes, allowing "update stats" with tenant specific connection on one tenant 
may result in partial stats on the other tenants, but it shouldn't be the 
reason that "Update statistics should not be allowed on tenant specific 
connection", because allowing "update stats" with tenant specific connection 
isn't the only reason which will cause partial stats on tenants. "Update 
Statistic" is an atomic operation on region level but not on tenant level. 
During running "UPDATE STATISICS" using sql statement or MR jobs, any failure 
on region level could cause partial stats, so anyway we need to fix partial 
stats issue generally.
 # Stats store/fetch to/from cache should be in the unit of tenant for the 
tables with MULTI_TENANT = true.**

> Update statistics should not be allowed on tenant specific connection
> ---------------------------------------------------------------------
>
>                 Key: PHOENIX-4999
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4999
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Karan Mehta
>            Assignee: Karan Mehta
>            Priority: Major
>
> Update statistics sql would can trigger partial stats collection when ran 
> using a tenant specific connection. Originally, update statistics internally 
> runs scans on all the regions of table. TenantId field bounds the scans on 
> startKey and endKey in tenant specific connection, which can cause stats to 
> run only on specific regions and result in partial stats collection. 
> Since the view data and table data reside in the same physical HBase table, 
> it doesn't make sense to allow users to run stats for specific tenants as 
> tenants may span across regions. The issue was first identified in 
> PHOENIX-4333.
> The patch however doesn't fully stop the SQL from running. Multiple 
> approaches can be taken here. 
>  # Unset the tenantId on the connection before update statistics is run and 
> reset it back later. This can be tricky and bad to implement since tenantId 
> is essentially a final field on PhoenixConnection.
>  # As [~tdsilva] pointed out, we can throw an UnsupportedOperationException() 
> whenever user tries to update statistics on tenant specific connection.
> The second option seems straightforward to implement and can prevent 
> accidental usage of this sql.
> [~Bin Shi] [~sukumaddineni] Any thoughts here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to