[
https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denys Kuzmenko resolved HIVE-27899.
-----------------------------------
Fix Version/s: 4.2.0
Resolution: Fixed
> Killed speculative execution task attempt should not commit file
> ----------------------------------------------------------------
>
> Key: HIVE-27899
> URL: https://issues.apache.org/jira/browse/HIVE-27899
> Project: Hive
> Issue Type: Sub-task
> Components: Tez
> Reporter: Chenyu Zheng
> Assignee: Chenyu Zheng
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.2.0
>
> Attachments: image-2023-11-23-16-21-20-244.png, reproduce_bug.md
>
>
> As I mentioned in HIVE-25561, when tez turns on speculative execution, the
> data file produced by hive may be duplicated. I mentioned in HIVE-25561 that
> if the speculatively executed task is killed, some data may be submitted
> unexpectedly. However, after HIVE-25561, there is still a situation that has
> not been solved. If two task attempts commit file at the same time, the
> problem of duplicate data files may also occur. Although the probability of
> this happening is very, very low, it does happen.
>
> Why?
> There are two key steps:
> (1)FileSinkOperator::closeOp
> TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp
> --> fsp.commit
> When the OP is closed, the process of closing the OP will be triggered, and
> eventually the call to fsp.commit will be triggered.
> (2)removeTempOrDuplicateFiles
> (2.a)Firstly, listStatus the files in the temporary directory. (Notes: in
> the latest version, corresponds to getNonEmptySubDirs)
> (2.b)Secondly check whether there are multiple incorrect commit, and finally
> move the correct results to the final directory. (Notes: in the latest
> version, corresponds to removeTempOrDuplicateFilesNonMm)
> When speculative execution is enabled, when one attempt of a Task is
> completed, other attempts will be killed. However, AM only sends the kill
> event and does not ensure that all cleanup actions are completed, that is,
> closeOp may be executed between 2.a and 2.b. Therefore,
> removeTempOrDuplicateFiles will not delete the file generated by the kill
> attempt.
> How?
> The problem is that both speculatively executed tasks commit the file. This
> will not happen in the Tez examples because they will try canCommit, which
> can guarantee that one and only one task attempt commit successfully. If one
> task attempt executes canCommit successfully, the other one will be stuck by
> canCommit until it receives a kill signal.
> detail see:
> [https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/SimpleMRProcessor.java#L70]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)