hujiahua created KYLIN-5163:
-------------------------------

             Summary: Global dictionary build job may produced incomplete 
dictionary file
                 Key: KYLIN-5163
                 URL: https://issues.apache.org/jira/browse/KYLIN-5163
             Project: Kylin
          Issue Type: Bug
          Components: Job Engine
    Affects Versions: v4.0.1
            Reporter: hujiahua


The current dictionary spark build job uses function 
`NBucketDictionary.saveBucketDict` to write dictionary files (include CURR file 
and PREV file) for each partition. But it does not consider that there may be 
concurrency multiple tasks for one same partition, such as scenarios like task 
retry or speculation task. Concurrency multiple tasks of one partition may 
cause incomplete dictionary file and we've encountered this issue in production.

I describe the issue in terms of timeline: 
1. currently in the dictionary building phase, one executor called E1 was 
preparing to build dictionary file for partition 0 
2. driver sent E1  shutdown message because of YARN resource preemption. Then 
driver mark the task of partition 0 failed and created a retry task to another 
executor called E2.
3. E2 began to proccess task, and finished task in a short time.
4. after E2 finished task, E1 began to proccess task, so E1 delete complete 
dictionary file which was created by E2 and created new dictionary file to 
write.
5. Then E1 received driver's shutdown message and kill himself, finally left a 
incomplete dictionary file which was not finished.

6. after other partition finished, the stage was marked successfull.
7. when next phase table encoding using incomplete dictionary file, stage will 
failed.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to