Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
If you're using Kubernetes you can group spark and hdfs to run in the same stack. Meaning they'll basically run in the same network space and share ips. Just gotta make sure there's no port conflicts. On Wed, Dec 28, 2016 at 5:07 AM, Karamba wrote: > > Good idea, thanks! > >

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
Good idea, thanks! But unfortunately that's not possible. All containers are connected to an overlay network. Is there any other possiblity to say spark that it is on the same *NODE* as an hdfs data node? On 28.12.2016 12:00, Miguel Morales wrote: > It might have to do with your container

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
It might have to do with your container ips, it depends on your networking setup. You might want to try host networking so that the containers share the ip with the host. On Wed, Dec 28, 2016 at 1:46 AM, Karamba wrote: > > Hi Sun Rui, > > thanks for answering! > > >> Although

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
Hi Sun Rui, thanks for answering! > Although the Spark task scheduler is aware of rack-level data locality, it > seems that only YARN implements the support for it. This explains why the script that I configured in core-site.xml topology.script.file.name is not called in by the spark

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-27 Thread Sun Rui
Although the Spark task scheduler is aware of rack-level data locality, it seems that only YARN implements the support for it. However, node-level locality can still work for Standalone. It is not necessary to copy the hadoop config files into the Spark CONF directory. Set HADOOP_CONF_DIR to

[Spark 2.0.2 HDFS]: no data locality

2016-12-26 Thread Karamba
Hi, I am running a couple of docker hosts, each with an HDFS and a spark worker in a spark standalone cluster. In order to get data locality awareness, I would like to configure Racks for each host, so that a spark worker container knows from which hdfs node container it should load its data.