Hello all,

I have a naive question regarding how spark uses the executors in a cluster
of machines. Imagine the scenario in which I do not know the input size of
my data in execution A, so I set Spark to use 20 (out of my 25 nodes, for
instance). At the same time, I also launch a second execution B, setting
Spark to use 10 nodes for this.

Assuming a huge input size for execution A, which implies an execution time
of 30 minutes for example (using all the resources), and a constant
execution time for B of 10 minutes, then both executions will last for 40
minutes (I assume that B cannot be launched until 10 resources are
completely available, when A finishes).

Now, assuming a very small input size for execution A running for 5 minutes
in only 2 of the 20 planned resources, I would like execution B to be
launched at that time, consuming both executions only 10 minutes (and 12
resources). However, as execution A has set Spark to use 20 resources,
execution B has to wait until A has finished, so the total execution time
lasts for 15 minutes.

Is this right? If so, how can I solve this kind of scenarios? If I am
wrong, what would be the correct interpretation for this?

Thanks in advance,

Best

Reply via email to