[jira] [Comment Edited] (HIVE-27985) Avoid duplicate files.

Chenyu Zheng (Jira) Fri, 28 Jun 2024 00:22:07 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860727#comment-17860727
 ]


Chenyu Zheng edited comment on HIVE-27985 at 6/28/24 7:21 AM:
--------------------------------------------------------------

[~glapark] 

Thanks for your reply! And Sorry for miss this comment.
I don't think non-deterministic results are relevant to speculative execution. 
This is also the case with task attempt reruns
In our production, we have encountered statements like "distribute by rand()". 
For this example, when some task attempt is failed, and the other task attempt 
of this task reruns. And we may get the different result.
I think that since randomness is introduced, which may result in different 
results each time the task runs, as long as the task is run successfully, the 
result should be regard as "correct".
In my experience, the problem with duplicate files comes from task attempt 
retry. Speculative execution just increases the probability of this problem.
 


was (Author: zhengchenyu):
[~glapark] 

Thanks for your reply! And Sorry for miss this comment.
I don't think non-deterministic results are relevant to speculative execution. 
This is also the case with task attempt reruns
In our production, we have encountered statements like "distribute by rand()". 
For this example, when some task attempt is failed, and the other task attempt 
of this task reruns. And we may get the different result.
I think that since randomness is introduced, which may result in different 
results each time the task runs, as long as the task is run successfully, the 
result should be correct.
In my experience, the problem with duplicate files comes from task attempt 
retry. Speculative execution just increases the probability of this problem.
 

> Avoid duplicate files.
> ----------------------
>
>                 Key: HIVE-27985
>                 URL: https://issues.apache.org/jira/browse/HIVE-27985
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>    Affects Versions: 4.0.0
>            Reporter: Chenyu Zheng
>            Assignee: Chenyu Zheng
>            Priority: Major
>         Attachments: how tez examples commit.png
>
>
> *1 introducation*
> Hive on Tez occasionally produces duplicated files, especially speculative 
> execution is enable. Hive identifies and removes duplicate files through 
> removeTempOrDuplicateFiles. However, this logic often does not take effect. 
> For example, the killed task attempt may commit files during the execution of 
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during 
> union all. There are many issues to solve these problems, mainly focusing on 
> how to identify duplicate files. *This issue mainly solves this problem by 
> avoiding the generation of duplicate files.*
> *2 How Tez avoids duplicate files?*
> After testing, I found that Hadoop MapReduce examples and Tez examples do not 
> have this problem. Through OutputCommitter, duplicate files can be avoided if 
> designed properly. Let's analyze how Tez avoids duplicate files.
> {color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more 
> commitPending, which is not critical, so only analyzing Tez._{color}
> !how tez examples commit.png|width=778,height=483!
>  
> Let’s analyze this step:
>  * (1) {*}process records{*}: Process records.
>  * (2) {*}send canCommit request{*}: After all Records are processed, call 
> canCommit remotely to AM.
>  * (3) {*}update commitAttempt{*}: After AM receives the canCommit request, 
> it will check whether there are other tasksattempts in the current task that 
> have already executed canCommit. If there is no other taskattempt to execute 
> canCommit first, return true. Otherwise return false. This ensures that only 
> one taskattempt is committed for each task.
>  * (4) {*}return canCommit response{*}: Task receives AM's response. If 
> returns true, it means it can be committed. If false is returned, it means 
> that another task attempt has already executed the commit first, and you 
> cannot commit. The task will jump into (2) loop to execute canCommit until it 
> is killed or other tasks fail.
>  * (5) {*}output.commit{*}: Execute commit, specifically rename the generated 
> temporary file to the final file.
>  * (6) {*}notify succeeded{*}: Although the task has completed the final 
> file, AM still needs to be notified that its work is completed. Therefore, AM 
> needs to be notified through heartbeat that the current task attempt has been 
> completed.
> There is a problem in the above steps. That is, if an exception occurs in the 
> task after (5) and before (6), AM does not know that the Task attempt has 
> been completed, so AM will still start a new task attempt, and the new task 
> attempt will generate a new file, so It will cause duplication. I added code 
> for randomly throwing exceptions between (5) and (6), and found that in fact, 
> Tez example did not produce data duplication. Why? Mainly because the final 
> file generated by which task attempt is the same is the same. When a new task 
> attempt commits and finds that the final file exists (this file was generated 
> by the previous task attempt), it will be deleted firstly, then renamed. 
> Regardless of whether the previous task attempt was committed normally, the 
> last successful task will clear the previous error results.
> To summarize, tez-examples uses two methods to avoid duplicate files:
>  * (1) Avoid repeated commit through canCommit. This is particularly 
> effective for tasks with speculative execution turned on.
>  * (2) The final file names generated by different task attempts are the 
> same. Combined with canCommit, it can be guaranteed that only one file 
> generated in the end, and it can only be generated by a successful task 
> attempt.
> *3 Why can't Hive on Tez avoid duplicate files?*
> Hive on Tez does not have the two mechanisms mentioned in the Tez example.
> First of all, Hive on Tez does not call canCommit.TezProcessor inherited from 
> AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly 
> in SimpleMRProcessor.
> Secondly, the file names generated for each file under Hive on Tez are not 
> same. The file generated by the first attempt of a task is 000000_0, and the 
> file generated by the second attempt is 000000_1.
> *4 How to improve?*
> Use canCommit to ensure that speculative tasks will not be submitted at the 
> same time. (HIVE-27899)
> Let different task attempts for each task generate the same final file name. 
> (HIVE-27986)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HIVE-27985) Avoid duplicate files.

Reply via email to