[ https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhu Zhu updated FLINK-14163: ---------------------------- Description: Currently {{Execution#producedPartitions}} is assigned after the partitions have completed the registration to shuffle master in {{Execution#registerProducedPartitions(...)}}. The partition registration is an async interface ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so {{Execution#producedPartitions}} is possible[1] not set when used. Usages includes: 1. deploying this task, so that the task may be deployed without its result partitions assigned, and the job would hang. (DefaultScheduler issue only, since legacy scheduler handled this case) 2. generating input descriptors for downstream tasks: 3. retrieve {{ResultPartitionID}} for partition releasing: [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is not problematic at the moment since it returns a completed future on registration, so that it would be a synchronized process. However, if users implement their own shuffle service in which the {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it can be a problem. This is possible since customizable shuffle service is open to users since 1.9 (via config "shuffle-service-factory.class"). To avoid issues to happen, we may either 1. fix all the usages of {{Execution#producedPartitions}} regarding the async assigning, or 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync interface was: Currently {{Execution#producedPartitions}} is assigned after the partitions have completed the registration to shuffle master in {{Execution#registerProducedPartitions(...)}}. The partition registration is an async interface ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so {{Execution#producedPartitions}} is possible[1] not set when used. Usages includes: 1. deploying this task, so that the task may be deployed without its result partitions assigned, and the job would hang. (DefaultScheduler issue only, since legacy scheduler handled this case) 2. generating input descriptors for downstream tasks: 3. retrieve {{ResultPartitionID}} for partition releasing: [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is not problematic at the moment since it returns a completed future on registration, so that it would be a synchronized process. However, if users implement their own shuffle service in which the {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it can be a problem. This is possible since customizable shuffle service is open to users since 1.9 (via config "shuffle-service-factory.class"). To avoid issues to happen, we may either 1. fix all the usages of {{Execution#producedPartitions}} regarding the async set, or 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync interface > Execution#producedPartitions is possibly not assigned when used > --------------------------------------------------------------- > > Key: FLINK-14163 > URL: https://issues.apache.org/jira/browse/FLINK-14163 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.9.0, 1.10.0 > Reporter: Zhu Zhu > Priority: Major > Fix For: 1.10.0 > > > Currently {{Execution#producedPartitions}} is assigned after the partitions > have completed the registration to shuffle master in > {{Execution#registerProducedPartitions(...)}}. > The partition registration is an async interface > ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so > {{Execution#producedPartitions}} is possible[1] not set when used. > Usages includes: > 1. deploying this task, so that the task may be deployed without its result > partitions assigned, and the job would hang. (DefaultScheduler issue only, > since legacy scheduler handled this case) > 2. generating input descriptors for downstream tasks: > 3. retrieve {{ResultPartitionID}} for partition releasing: > [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is > not problematic at the moment since it returns a completed future on > registration, so that it would be a synchronized process. However, if users > implement their own shuffle service in which the > {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it > can be a problem. This is possible since customizable shuffle service is open > to users since 1.9 (via config "shuffle-service-factory.class"). > To avoid issues to happen, we may either > 1. fix all the usages of {{Execution#producedPartitions}} regarding the async > assigning, or > 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync > interface -- This message was sent by Atlassian Jira (v8.3.4#803005)