It turned out the issue was with my environment not Spark. Just in case
anyone else is experiencing this the problem was that the Spark workers did
not use the machine hostname by default. Setting the following environment
variable on each worker rectified it: SPARK_LOCAL_HOSTNAME: "worker1" etc.
<https://stackoverflow.com/users/14147688/tom-scott>

On Tue, Sep 8, 2020 at 10:11 PM Tom Scott <thomaskwsc...@gmail.com> wrote:

> Hi Guys,
>
>   I asked this in stack overflow here:
> https://stackoverflow.com/questions/63535720/why-would-preferredlocations-not-be-enforced-on-an-empty-spark-cluster
> but am hoping there is further help here.
>
>   I have a 4 node standalone cluster with workers named worker1, worker2
> and worker3 and a master on which I am running spark-shell. Given the
> following example:
>
> -----------------------------------------------------------------------------------------------------------------
> import scala.collection.mutable
>
> val someData = mutable.ArrayBuffer[(String, Seq[String])]()
>
> someData += ("1" -> Seq("worker1"))
> someData += ("2" -> Seq("worker2"))
> someData += ("3" -> Seq("worker3"))
>
> val someRdd = sc.makeRDD(someData)
>
> someRdd.map(i=>i + ":" +
> java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
>
> -----------------------------------------------------------------------------------------------------------------
>
> The cluster is completely clean with nothing else executing so I would
> expect to see output:
>
> 1:worker1
> 2:worker2
> 3:worker3
>
> but in fact the output is undefined and i see things like:
>
> scala> someRdd.map(i=>i + ":" +
> java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
> 1:worker3
> 2:worker1
> 3:worker2
>
> scala> someRdd.map(i=>i + ":" +
> java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
> 1:worker2
> 2:worker3
> 3:worker1
>
> Am I doing this wrong or is this expected behaviour?
>
> Thanks
>
>   Tom
>
>

Reply via email to