If you're using Kubernetes you can group spark and hdfs to run in the same stack. Meaning they'll basically run in the same network space and share ips. Just gotta make sure there's no port conflicts.
On Wed, Dec 28, 2016 at 5:07 AM, Karamba <phantom...@web.de> wrote: > > Good idea, thanks! > > But unfortunately that's not possible. All containers are connected to > an overlay network. > > Is there any other possiblity to say spark that it is on the same *NODE* > as an hdfs data node? > > > On 28.12.2016 12:00, Miguel Morales wrote: >> It might have to do with your container ips, it depends on your >> networking setup. You might want to try host networking so that the >> containers share the ip with the host. >> >> On Wed, Dec 28, 2016 at 1:46 AM, Karamba <phantom...@web.de> wrote: >>> Hi Sun Rui, >>> >>> thanks for answering! >>> >>> >>>> Although the Spark task scheduler is aware of rack-level data locality, it >>>> seems that only YARN implements the support for it. >>> This explains why the script that I configured in core-site.xml >>> topology.script.file.name is not called in by the spark container. >>> But at time of reading from hdfs in a spark program, the script is >>> called in my hdfs namenode container. >>> >>>> However, node-level locality can still work for Standalone. >>> I have a couple of physical hosts that run spark and hdfs docker >>> containers. How does spark standalone knows that spark and docker >>> containers are on the same host? >>> >>>> Data Locality involves in both task data locality and executor data >>>> locality. Executor data locality is only supported on YARN with executor >>>> dynamic allocation enabled. For standalone, by default, a Spark >>>> application will acquire all available cores in the cluster, generally >>>> meaning there is at least one executor on each node, in which case task >>>> data locality can work because a task can be dispatched to an executor on >>>> any of the preferred nodes of the task for execution. >>>> >>>> for your case, have you set spark.cores.max to limit the cores to acquire, >>>> which means executors are available on a subset of the cluster nodes? >>> I set "--total-executor-cores 1" in order to use only a small subset of >>> the cluster. >>> >>> >>> >>> On 28.12.2016 02:58, Sun Rui wrote: >>>> Although the Spark task scheduler is aware of rack-level data locality, it >>>> seems that only YARN implements the support for it. However, node-level >>>> locality can still work for Standalone. >>>> >>>> It is not necessary to copy the hadoop config files into the Spark CONF >>>> directory. Set HADOOP_CONF_DIR to point to the conf directory of your >>>> Hadoop. >>>> >>>> Data Locality involves in both task data locality and executor data >>>> locality. Executor data locality is only supported on YARN with executor >>>> dynamic allocation enabled. For standalone, by default, a Spark >>>> application will acquire all available cores in the cluster, generally >>>> meaning there is at least one executor on each node, in which case task >>>> data locality can work because a task can be dispatched to an executor on >>>> any of the preferred nodes of the task for execution. >>>> >>>> for your case, have you set spark.cores.max to limit the cores to acquire, >>>> which means executors are available on a subset of the cluster nodes? >>>> >>>>> On Dec 27, 2016, at 01:39, Karamba <phantom...@web.de> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I am running a couple of docker hosts, each with an HDFS and a spark >>>>> worker in a spark standalone cluster. >>>>> In order to get data locality awareness, I would like to configure Racks >>>>> for each host, so that a spark worker container knows from which hdfs >>>>> node container it should load its data. Does this make sense? >>>>> >>>>> I configured HDFS container nodes via the core-site.xml in >>>>> $HADOOP_HOME/etc and this works. hdfs dfsadmin -printTopology shows my >>>>> setup. >>>>> >>>>> I configured SPARK the same way. I placed core-site.xml and >>>>> hdfs-site.xml in the SPARK_CONF_DIR ... BUT this has no effect. >>>>> >>>>> Submitting a spark job via spark-submit to the spark-master that loads >>>>> from HDFS just has Data locality ANY. >>>>> >>>>> It would be great if anybody would help me getting the right >>>>> configuration! >>>>> >>>>> Thanks and best regards, >>>>> on >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org