[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526097#comment-16526097
 ] 

Peter Vary commented on HIVE-19418:
-----------------------------------

[~sershe], [~alangates]: As per our discussion on hive-dev list (See: [Apache 
dev list 
archive|http://mail-archives.apache.org/mod_mbox/hive-dev/201712.mbox/%3CCDF09DF1-746E-4A4F-8644-8B441F386937%40cloudera.com%3E]),
 I think the consensus was that we would like to keep the original exceptions 
for the MetaStore Thrift API.

Shall I create a new patch which reverts back the part of this change where in 
{{HiveMetaStoreClient.createTable}} we started to throw 
{{InvalidOperationException}} instead of the original {{MetaException}}. Or we 
see enough reasons to reopen the discussion?
 Thanks,
 Peter

> add background stats updater similar to compactor
> -------------------------------------------------
>
>                 Key: HIVE-19418
>                 URL: https://issues.apache.org/jira/browse/HIVE-19418
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>             Fix For: 3.1.0, 4.0.0
>
>         Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to