So the crux is that the JobManager has a location for the sender task (and
tell that to the receivers) before the senders have registered their
transfer queues.

Can we just establish a "happens-before" there?
 - The TaskManager may send the "ack" to the deployment call only after all
queues are registered (might even be like this now)
 - The job manager updates receivers only with locations of senders that
have switched to "running", not with ones that are in "deploying.

Would that fix it?


On Thu, Jan 22, 2015 at 2:51 AM, Ufuk Celebi <[email protected]> wrote:

> On 22 Jan 2015, at 11:37, Till Rohrmann <[email protected]> wrote:
>
> > I'm not sure whether it is currently possible to schedule first the
> > receiver and then the sender. Recently, I had to fix the
> > TaskManagerTest.testRunWithForwardChannel test case where this was
> exactly
> > the case. Due to first scheduling the receiver, it happened sometimes
> that
> > an IllegalQueueIteratorRequestException in the method
> > IntermediateResultPartitionManager.getIntermediateResultPartitionIterator
> > was thrown. The partition manager complained that the producer execution
> ID
> > was unknown. I assume that this has to be fixed first in order to
> schedule
> > all task immediately. But Ufuk will probably know it better.
>
> On 21 Jan 2015, at 20:58, Stephan Ewen <[email protected]> wrote:
>
> > - The queues would still send notifications to the JobManager that data
> is
> > available, but the JM will see that the target task is already deployed
> (or
> > currently being deployed). Then the info where to grab a channel from
> would
> > need to be sent to the task. That mechanism also exists already.
>
> The only minor thing that needs to be adjusted would be this mechanism. It
> is indeed in place already (e.g. UNKNOWN input channels are updated at
> runtime to LOCAL or REMOTE input channels depending on the producer
> location), but currently the consumer tasks assume that the consumed
> intermediate result partition has already been created when they (the
> consumer task) are deployed and request the partition. When we schedule all
> tasks at once, we might end up in situations like the test case Till
> described, where we know that it is a LOCAL or REMOTE channel, but the
> intermediate result has not been created yet and the request fails.
>
> tl;dr: channels can be updated at runtime, but requests need to arrive
> after the producer created the partition.

Reply via email to