Raymond,

Thank you.

But I read from other thread
<http://apache-spark-user-list.1001560.n3.nabble.com/When-does-Spark-switch-from-PROCESS-LOCAL-to-NODE-LOCAL-or-RACK-LOCAL-td7091.html>
that "PROCESS_LOCAL" means the data is in the same JVM as the code that is
running. When data is in the same JVM with the code that is running, the
data should be on the same node as JVM, i.e., the data can be said to be
local.

Also you said that the tasks will be assigned to available executors which
satisfy the application's requirements. But what requirements must an
executor satisfy so that task can be assigned to it? Do you mean resources
(memory, CPU)?

Finally, is there any way to guarantee that all executors for an
application will be run on all Spark nodes when data to be processed is big
enough (for example, HBase data resides on all RegionServers) ?




2014-10-20 11:35 GMT+08:00 raymond <rgbbones.m...@gmail.com>:

> My best guess is the speed that your executors got registered with driver
> differs between each run.
>
> when you run it for the first time, the executors is not fully registered
> when task set manager start to assign tasks, and thus the tasks was
> assigned to available executors which have already satisfy what you need
> ,say 86 with one batch.
>
> And the “Process_local” does not necessary means that the data is local,
> it could be that the executor is not available yet for the data source ( in
> your case, might though will be available later).
>
> If this is the case, you could just sleep a few seconds before run the
> job. or there are some patches related and providing other way to sync
> executors status before running applications, but I haven’t track the
> related status for a while.
>
>
> Raymond
>
> On 2014年10月20日, at 上午11:22, Tao Xiao <xiaotao.cs....@gmail.com> wrote:
>
> Hi all,
>
> I have a Spark-0.9 cluster, which has 16 nodes.
>
> I wrote a Spark application to read data from an HBase table, which has 86
> regions spreading over 20 RegionServers.
>
> I submitted the Spark app in Spark standalone mode and found that there
> were 86 executors running on just 3 nodes and it took about  30 minutes to
> read data from the table. In this case, I noticed from Spark master UI
> that Locality Level of all executors are "PROCESS_LOCAL".
>
> Later I ran the same app again (without any code changed) and found that
> those 86 executors were running on 16 nodes, and this time it took just 4
> minutes to read date from the same HBase table. In this case, I noticed
> that Locality Level of most executors are "NODE_LOCAL".
>
> After testing multiple times, I found the two cases above occur randomly.
>
> So I have 2 questions:
> 1)  Why would the two cases above occur randomly when I submitted the same
> application multiple times ?
> 2)  Would the spread of executors influence locality level ?
>
> Thank you.
>
>
>
>
>

Reply via email to