[jira] [Assigned] (HIVE-17328) Remove special handling for Acid tables wherever possible

2021-07-28 Thread Eugene Koifman (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-17328:
-

Assignee: (was: Eugene Koifman)

> Remove special handling for Acid tables wherever possible
> -
>
> Key: HIVE-17328
> URL: https://issues.apache.org/jira/browse/HIVE-17328
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Eugene Koifman
>Priority: Major
>
> There are various places in the code that do something like 
> {noformat}
> if(acid update or delete) {
>  do something
> }
> else {
> do something else
> }
> {noformat}
> this complicates the code and makes it so that acid code path is not properly 
> tested in many new non-acid features or bug fixes.
> Some work to simplify this was done in HIVE-15844.
> _SortedDynPartitionOptimizer_ has some special logic
> _ReduceSinkOperator_ relies on partitioning columns for update/delete be 
> _UDFToInteger(RecordIdentifier)_ which is set up in _SemanticAnalyzer_.  
> Consequently _SemanticAnalyzer_ has special logic to set it up.
> _FileSinkOperator_ has some specialization.
> _AbstractCorrelationProcCtx_ makes changes specific to acid writes setting 
> hive.optimize.reducededuplication.min.reducer=1
> With acid 2.0 (HIVE-17089) a lot more of it can simplified/removed.
> Generally, Acid Insert follows the same code path as regular insert except 
> that the writer in _FileSinkOperator_ is Acid specific.
> So all the specialization is to route Update/Delete events to the right place.
> We can do the U=D+I early in the operator pipeline so that an Update is a 
> Hive multi-insert with 1 leg being the Insert leg and the other being the 
> Delete leg (like Merge stmt).
> The Delete events themselves don't need to be routed in any particular way if 
> we always ship all delete_delta files for each split.  This is ok since 
> delete events are very small and highly compressible.  What is shipped is 
> independent of what needs to be loaded into memory.
> This would allow removing almost all special code paths.
> If need be we can also have the compactor rewrite the delete files so that 
> the name of the file matches the contents and make it as if they were 
> bucketed properly and use it reduce what needs to be shipped for each split.  
> This may help with some extreme cases where someone updates 1B rows.
> This would in particular allow DISTRIBUTE BY for update/delete
> Is this currently supported for Acid insert?
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-17328) Remove special handling for Acid tables wherever possible

2017-08-15 Thread Eugene Koifman (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-17328:
-


> Remove special handling for Acid tables wherever possible
> -
>
> Key: HIVE-17328
> URL: https://issues.apache.org/jira/browse/HIVE-17328
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>
> There are various places in the code that do something like 
> if(acid update or delete) {
>  do something
> }
> else {
> do something else
> }
> this complicates the code and makes it so that acid code path is not properly 
> tested in many new non-acid features or bug fixes.
> Some work to simplify this was done in HIVE-15844.
> SortedDynPartitionOptimizer has some special logic
> ReduceSinkOperator relies on partitioning columns for update/delete be 
> UDFToInteger(RecordIdentifier) which is set up in SemanticAnalyzer.  
> Consequently SemanticAnalyzer has special logic to set it up.
> FileSinkOperator has some specialization.
> AbstractCorrelationProcCtx makes changes specific to acid writes setting 
> hive.optimize.reducededuplication.min.reducer=1
> With acid 2.0 (HIVE-17089) a lot more of it can simplified/removed.
> Generally, Acid Insert follows the same code path as regular insert except 
> that the writer in FileSinkOperator is Acid specific.
> So all the specialization is to route Update/Delete events to the right place.
> We can do the U=D+I early in the operator pipeline so that an Update is a 
> Hive multi-insert with 1 leg being the Insert leg and the other being the 
> Delete leg (like Merge stmt).
> The Delete events themselves don't need to be routed in any particular way if 
> we always ship all delete_delta files for each split.  This is ok since 
> delete events are very small and highly compressible.  What is shipped is 
> independent of what needs to be loaded into memory.
> This would allow removing almost all special code paths.
> If need be we can also have the compactor rewrite the delete files so that 
> the name of the file matches the contents and make it as if they were 
> bucketed properly and use it reduce what needs to be shipped for each split.  
> This may help with some extreme cases where someone updates 1B rows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)