[ 
https://issues.apache.org/jira/browse/HIVE-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-17328:
-------------------------------------

    Assignee:     (was: Eugene Koifman)

> Remove special handling for Acid tables wherever possible
> ---------------------------------------------------------
>
>                 Key: HIVE-17328
>                 URL: https://issues.apache.org/jira/browse/HIVE-17328
>             Project: Hive
>          Issue Type: Improvement
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Priority: Major
>
> There are various places in the code that do something like 
> {noformat}
> if(acid update or delete) {
>  do something
> }
> else {
> do something else
> }
> {noformat}
> this complicates the code and makes it so that acid code path is not properly 
> tested in many new non-acid features or bug fixes.
> Some work to simplify this was done in HIVE-15844.
> _SortedDynPartitionOptimizer_ has some special logic
> _ReduceSinkOperator_ relies on partitioning columns for update/delete be 
> _UDFToInteger(RecordIdentifier)_ which is set up in _SemanticAnalyzer_.  
> Consequently _SemanticAnalyzer_ has special logic to set it up.
> _FileSinkOperator_ has some specialization.
> _AbstractCorrelationProcCtx_ makes changes specific to acid writes setting 
> hive.optimize.reducededuplication.min.reducer=1
> With acid 2.0 (HIVE-17089) a lot more of it can simplified/removed.
> Generally, Acid Insert follows the same code path as regular insert except 
> that the writer in _FileSinkOperator_ is Acid specific.
> So all the specialization is to route Update/Delete events to the right place.
> We can do the U=D+I early in the operator pipeline so that an Update is a 
> Hive multi-insert with 1 leg being the Insert leg and the other being the 
> Delete leg (like Merge stmt).
> The Delete events themselves don't need to be routed in any particular way if 
> we always ship all delete_delta files for each split.  This is ok since 
> delete events are very small and highly compressible.  What is shipped is 
> independent of what needs to be loaded into memory.
> This would allow removing almost all special code paths.
> If need be we can also have the compactor rewrite the delete files so that 
> the name of the file matches the contents and make it as if they were 
> bucketed properly and use it reduce what needs to be shipped for each split.  
> This may help with some extreme cases where someone updates 1B rows.
> This would in particular allow DISTRIBUTE BY for update/delete
> Is this currently supported for Acid insert?
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to