Re: [Spark Core] makeRDD() preferredLocations do not appear to be considered

2020-09-12 Thread Tom Scott
It turned out the issue was with my environment not Spark. Just in case
anyone else is experiencing this the problem was that the Spark workers did
not use the machine hostname by default. Setting the following environment
variable on each worker rectified it: SPARK_LOCAL_HOSTNAME: "worker1" etc.


On Tue, Sep 8, 2020 at 10:11 PM Tom Scott  wrote:

> Hi Guys,
>
>   I asked this in stack overflow here:
> https://stackoverflow.com/questions/63535720/why-would-preferredlocations-not-be-enforced-on-an-empty-spark-cluster
> but am hoping there is further help here.
>
>   I have a 4 node standalone cluster with workers named worker1, worker2
> and worker3 and a master on which I am running spark-shell. Given the
> following example:
>
> -
> import scala.collection.mutable
>
> val someData = mutable.ArrayBuffer[(String, Seq[String])]()
>
> someData += ("1" -> Seq("worker1"))
> someData += ("2" -> Seq("worker2"))
> someData += ("3" -> Seq("worker3"))
>
> val someRdd = sc.makeRDD(someData)
>
> someRdd.map(i=>i + ":" +
> java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
>
> -
>
> The cluster is completely clean with nothing else executing so I would
> expect to see output:
>
> 1:worker1
> 2:worker2
> 3:worker3
>
> but in fact the output is undefined and i see things like:
>
> scala> someRdd.map(i=>i + ":" +
> java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
> 1:worker3
> 2:worker1
> 3:worker2
>
> scala> someRdd.map(i=>i + ":" +
> java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
> 1:worker2
> 2:worker3
> 3:worker1
>
> Am I doing this wrong or is this expected behaviour?
>
> Thanks
>
>   Tom
>
>


[Spark Core] makeRDD() preferredLocations do not appear to be considered

2020-09-08 Thread Tom Scott
Hi Guys,

  I asked this in stack overflow here:
https://stackoverflow.com/questions/63535720/why-would-preferredlocations-not-be-enforced-on-an-empty-spark-cluster
but am hoping there is further help here.

  I have a 4 node standalone cluster with workers named worker1, worker2
and worker3 and a master on which I am running spark-shell. Given the
following example:
-
import scala.collection.mutable

val someData = mutable.ArrayBuffer[(String, Seq[String])]()

someData += ("1" -> Seq("worker1"))
someData += ("2" -> Seq("worker2"))
someData += ("3" -> Seq("worker3"))

val someRdd = sc.makeRDD(someData)

someRdd.map(i=>i + ":" +
java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
-

The cluster is completely clean with nothing else executing so I would
expect to see output:

1:worker1
2:worker2
3:worker3

but in fact the output is undefined and i see things like:

scala> someRdd.map(i=>i + ":" +
java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker3
2:worker1
3:worker2

scala> someRdd.map(i=>i + ":" +
java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker2
2:worker3
3:worker1

Am I doing this wrong or is this expected behaviour?

Thanks

  Tom