hujiahua created KYLIN-5163:
-------------------------------
Summary: Global dictionary build job may produced incomplete
dictionary file
Key: KYLIN-5163
URL: https://issues.apache.org/jira/browse/KYLIN-5163
Project: Kylin
Issue Type: Bug
Components: Job Engine
Affects Versions: v4.0.1
Reporter: hujiahua
The current dictionary spark build job uses function
`NBucketDictionary.saveBucketDict` to write dictionary files (include CURR file
and PREV file) for each partition. But it does not consider that there may be
concurrency multiple tasks for one same partition, such as scenarios like task
retry or speculation task. Concurrency multiple tasks of one partition may
cause incomplete dictionary file and we've encountered this issue in production.
I describe the issue in terms of timeline:
1. currently in the dictionary building phase, one executor called E1 was
preparing to build dictionary file for partition 0
2. driver sent E1 shutdown message because of YARN resource preemption. Then
driver mark the task of partition 0 failed and created a retry task to another
executor called E2.
3. E2 began to proccess task, and finished task in a short time.
4. after E2 finished task, E1 began to proccess task, so E1 delete complete
dictionary file which was created by E2 and created new dictionary file to
write.
5. Then E1 received driver's shutdown message and kill himself, finally left a
incomplete dictionary file which was not finished.
6. after other partition finished, the stage was marked successfull.
7. when next phase table encoding using incomplete dictionary file, stage will
failed.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)