[ https://issues.apache.org/jira/browse/HIVE-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Koifman reassigned HIVE-17328: ------------------------------------- Assignee: (was: Eugene Koifman) > Remove special handling for Acid tables wherever possible > --------------------------------------------------------- > > Key: HIVE-17328 > URL: https://issues.apache.org/jira/browse/HIVE-17328 > Project: Hive > Issue Type: Improvement > Components: Transactions > Reporter: Eugene Koifman > Priority: Major > > There are various places in the code that do something like > {noformat} > if(acid update or delete) { > do something > } > else { > do something else > } > {noformat} > this complicates the code and makes it so that acid code path is not properly > tested in many new non-acid features or bug fixes. > Some work to simplify this was done in HIVE-15844. > _SortedDynPartitionOptimizer_ has some special logic > _ReduceSinkOperator_ relies on partitioning columns for update/delete be > _UDFToInteger(RecordIdentifier)_ which is set up in _SemanticAnalyzer_. > Consequently _SemanticAnalyzer_ has special logic to set it up. > _FileSinkOperator_ has some specialization. > _AbstractCorrelationProcCtx_ makes changes specific to acid writes setting > hive.optimize.reducededuplication.min.reducer=1 > With acid 2.0 (HIVE-17089) a lot more of it can simplified/removed. > Generally, Acid Insert follows the same code path as regular insert except > that the writer in _FileSinkOperator_ is Acid specific. > So all the specialization is to route Update/Delete events to the right place. > We can do the U=D+I early in the operator pipeline so that an Update is a > Hive multi-insert with 1 leg being the Insert leg and the other being the > Delete leg (like Merge stmt). > The Delete events themselves don't need to be routed in any particular way if > we always ship all delete_delta files for each split. This is ok since > delete events are very small and highly compressible. What is shipped is > independent of what needs to be loaded into memory. > This would allow removing almost all special code paths. > If need be we can also have the compactor rewrite the delete files so that > the name of the file matches the contents and make it as if they were > bucketed properly and use it reduce what needs to be shipped for each split. > This may help with some extreme cases where someone updates 1B rows. > This would in particular allow DISTRIBUTE BY for update/delete > Is this currently supported for Acid insert? > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy -- This message was sent by Atlassian Jira (v8.3.4#803005)