[ https://issues.apache.org/jira/browse/HIVE-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569020#comment-16569020 ]
Sergey Shelukhin commented on HIVE-20109: ----------------------------------------- WIP patch. This turned out to be a much larger feature than I anticipated since many paths require different structure due to the location of the flags changing (e.g. flag is for each stat now and not entire partition, etc.), as well as some other non trivial changes (non-trivial as in my brain hurts reading this code, actually all the changes are very simple logically). Remaining areas (marked with TODO# comments): 1) Aggregate stats. Purely mechanical task to propagate and verify lists, but painful. 2) Cached store. Mostly mechanical. 3) Conversion script. The existing code is in the comment that needs to be converted into a tool. 4) Fixing test failures that alter stats with alter table, and numerous small bugs that no doubt exist. I may return to this patch week after next... > get rid of COLUMN_STATS_ACCURATE > -------------------------------- > > Key: HIVE-20109 > URL: https://issues.apache.org/jira/browse/HIVE-20109 > Project: Hive > Issue Type: Bug > Components: Statistics > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Priority: Major > Attachments: HIVE-20109.nogen.patch, HIVE-20109.patch > > > I don't know why anyone would come up with an idea of storing a set of > booleans in a database using JSON. This has caused various problems in the > past (text field limitations, perf issues when parsing a giant string; also > bugs because the way it is set is brittle). > However, now that we are implementing transactional stats, it becomes > especially problematic and error prone because the code in Hive sets C_S_A in > random places with reckless abandon, whereas we want to change the state of > the stats in well defined places where txn semantics can be verified. > Currently in HIVE-19416, we are handling random things that touch it (from > metastore itself to output committers, various stats tasks, commands like > truncate, etc.) via a pile of hacks, but the best solution would be to remove > it completely and replace with a DB table/columns in stats tables that would > need to be set explicitly, not via generic alter_table. -- This message was sent by Atlassian JIRA (v7.6.3#76005)