Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
If you're using Kubernetes you can group spark and hdfs to run in the
same stack.  Meaning they'll basically run in the same network space
and share ips.  Just gotta make sure there's no port conflicts.

On Wed, Dec 28, 2016 at 5:07 AM, Karamba  wrote:
>
> Good idea, thanks!
>
> But unfortunately that's not possible. All containers are connected to
> an overlay network.
>
> Is there any other possiblity to say spark that it is on the same *NODE*
> as an hdfs data node?
>
>
> On 28.12.2016 12:00, Miguel Morales wrote:
>> It might have to do with your container ips, it depends on your
>> networking setup.  You might want to try host networking so that the
>> containers share the ip with the host.
>>
>> On Wed, Dec 28, 2016 at 1:46 AM, Karamba  wrote:
>>> Hi Sun Rui,
>>>
>>> thanks for answering!
>>>
>>>
 Although the Spark task scheduler is aware of rack-level data locality, it 
 seems that only YARN implements the support for it.
>>> This explains why the script that I configured in core-site.xml
>>> topology.script.file.name is not called in by the spark container.
>>> But at time of reading from hdfs in a spark program, the script is
>>> called in my hdfs namenode container.
>>>
 However, node-level locality can still work for Standalone.
>>> I have a couple of physical hosts that run spark and hdfs docker
>>> containers. How does spark standalone knows that spark and docker
>>> containers are on the same host?
>>>
 Data Locality involves in both task data locality and executor data 
 locality. Executor data locality is only supported on YARN with executor 
 dynamic allocation enabled. For standalone, by default, a Spark 
 application will acquire all available cores in the cluster, generally 
 meaning there is at least one executor on each node, in which case task 
 data locality can work because a task can be dispatched to an executor on 
 any of the preferred nodes of the task for execution.

 for your case, have you set spark.cores.max to limit the cores to acquire, 
 which means executors are available on a subset of the cluster nodes?
>>> I set "--total-executor-cores 1" in order to use only a small subset of
>>> the cluster.
>>>
>>>
>>>
>>> On 28.12.2016 02:58, Sun Rui wrote:
 Although the Spark task scheduler is aware of rack-level data locality, it 
 seems that only YARN implements the support for it. However, node-level 
 locality can still work for Standalone.

 It is not necessary to copy the hadoop config files into the Spark CONF 
 directory. Set HADOOP_CONF_DIR to point to the conf directory of your 
 Hadoop.

 Data Locality involves in both task data locality and executor data 
 locality. Executor data locality is only supported on YARN with executor 
 dynamic allocation enabled. For standalone, by default, a Spark 
 application will acquire all available cores in the cluster, generally 
 meaning there is at least one executor on each node, in which case task 
 data locality can work because a task can be dispatched to an executor on 
 any of the preferred nodes of the task for execution.

 for your case, have you set spark.cores.max to limit the cores to acquire, 
 which means executors are available on a subset of the cluster nodes?

> On Dec 27, 2016, at 01:39, Karamba  wrote:
>
> Hi,
>
> I am running a couple of docker hosts, each with an HDFS and a spark
> worker in a spark standalone cluster.
> In order to get data locality awareness, I would like to configure Racks
> for each host, so that a spark worker container knows from which hdfs
> node container it should load its data. Does this make sense?
>
> I configured HDFS container nodes via the core-site.xml in
> $HADOOP_HOME/etc and this works. hdfs dfsadmin -printTopology shows my
> setup.
>
> I configured SPARK the same way. I placed core-site.xml and
> hdfs-site.xml in the SPARK_CONF_DIR ... BUT this has no effect.
>
> Submitting a spark job via spark-submit to the spark-master that loads
> from HDFS just has Data locality ANY.
>
> It would be great if anybody would help me getting the right 
> configuration!
>
> Thanks and best regards,
> on
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba

Good idea, thanks!

But unfortunately that's not possible. All containers are connected to
an overlay network.

Is there any other possiblity to say spark that it is on the same *NODE*
as an hdfs data node?


On 28.12.2016 12:00, Miguel Morales wrote:
> It might have to do with your container ips, it depends on your
> networking setup.  You might want to try host networking so that the
> containers share the ip with the host.
>
> On Wed, Dec 28, 2016 at 1:46 AM, Karamba  wrote:
>> Hi Sun Rui,
>>
>> thanks for answering!
>>
>>
>>> Although the Spark task scheduler is aware of rack-level data locality, it 
>>> seems that only YARN implements the support for it.
>> This explains why the script that I configured in core-site.xml
>> topology.script.file.name is not called in by the spark container.
>> But at time of reading from hdfs in a spark program, the script is
>> called in my hdfs namenode container.
>>
>>> However, node-level locality can still work for Standalone.
>> I have a couple of physical hosts that run spark and hdfs docker
>> containers. How does spark standalone knows that spark and docker
>> containers are on the same host?
>>
>>> Data Locality involves in both task data locality and executor data 
>>> locality. Executor data locality is only supported on YARN with executor 
>>> dynamic allocation enabled. For standalone, by default, a Spark application 
>>> will acquire all available cores in the cluster, generally meaning there is 
>>> at least one executor on each node, in which case task data locality can 
>>> work because a task can be dispatched to an executor on any of the 
>>> preferred nodes of the task for execution.
>>>
>>> for your case, have you set spark.cores.max to limit the cores to acquire, 
>>> which means executors are available on a subset of the cluster nodes?
>> I set "--total-executor-cores 1" in order to use only a small subset of
>> the cluster.
>>
>>
>>
>> On 28.12.2016 02:58, Sun Rui wrote:
>>> Although the Spark task scheduler is aware of rack-level data locality, it 
>>> seems that only YARN implements the support for it. However, node-level 
>>> locality can still work for Standalone.
>>>
>>> It is not necessary to copy the hadoop config files into the Spark CONF 
>>> directory. Set HADOOP_CONF_DIR to point to the conf directory of your 
>>> Hadoop.
>>>
>>> Data Locality involves in both task data locality and executor data 
>>> locality. Executor data locality is only supported on YARN with executor 
>>> dynamic allocation enabled. For standalone, by default, a Spark application 
>>> will acquire all available cores in the cluster, generally meaning there is 
>>> at least one executor on each node, in which case task data locality can 
>>> work because a task can be dispatched to an executor on any of the 
>>> preferred nodes of the task for execution.
>>>
>>> for your case, have you set spark.cores.max to limit the cores to acquire, 
>>> which means executors are available on a subset of the cluster nodes?
>>>
 On Dec 27, 2016, at 01:39, Karamba  wrote:

 Hi,

 I am running a couple of docker hosts, each with an HDFS and a spark
 worker in a spark standalone cluster.
 In order to get data locality awareness, I would like to configure Racks
 for each host, so that a spark worker container knows from which hdfs
 node container it should load its data. Does this make sense?

 I configured HDFS container nodes via the core-site.xml in
 $HADOOP_HOME/etc and this works. hdfs dfsadmin -printTopology shows my
 setup.

 I configured SPARK the same way. I placed core-site.xml and
 hdfs-site.xml in the SPARK_CONF_DIR ... BUT this has no effect.

 Submitting a spark job via spark-submit to the spark-master that loads
 from HDFS just has Data locality ANY.

 It would be great if anybody would help me getting the right configuration!

 Thanks and best regards,
 on

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org

>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
It might have to do with your container ips, it depends on your
networking setup.  You might want to try host networking so that the
containers share the ip with the host.

On Wed, Dec 28, 2016 at 1:46 AM, Karamba  wrote:
>
> Hi Sun Rui,
>
> thanks for answering!
>
>
>> Although the Spark task scheduler is aware of rack-level data locality, it 
>> seems that only YARN implements the support for it.
>
> This explains why the script that I configured in core-site.xml
> topology.script.file.name is not called in by the spark container.
> But at time of reading from hdfs in a spark program, the script is
> called in my hdfs namenode container.
>
>> However, node-level locality can still work for Standalone.
>
> I have a couple of physical hosts that run spark and hdfs docker
> containers. How does spark standalone knows that spark and docker
> containers are on the same host?
>
>> Data Locality involves in both task data locality and executor data 
>> locality. Executor data locality is only supported on YARN with executor 
>> dynamic allocation enabled. For standalone, by default, a Spark application 
>> will acquire all available cores in the cluster, generally meaning there is 
>> at least one executor on each node, in which case task data locality can 
>> work because a task can be dispatched to an executor on any of the preferred 
>> nodes of the task for execution.
>>
>> for your case, have you set spark.cores.max to limit the cores to acquire, 
>> which means executors are available on a subset of the cluster nodes?
> I set "--total-executor-cores 1" in order to use only a small subset of
> the cluster.
>
>
>
> On 28.12.2016 02:58, Sun Rui wrote:
>> Although the Spark task scheduler is aware of rack-level data locality, it 
>> seems that only YARN implements the support for it. However, node-level 
>> locality can still work for Standalone.
>>
>> It is not necessary to copy the hadoop config files into the Spark CONF 
>> directory. Set HADOOP_CONF_DIR to point to the conf directory of your Hadoop.
>>
>> Data Locality involves in both task data locality and executor data 
>> locality. Executor data locality is only supported on YARN with executor 
>> dynamic allocation enabled. For standalone, by default, a Spark application 
>> will acquire all available cores in the cluster, generally meaning there is 
>> at least one executor on each node, in which case task data locality can 
>> work because a task can be dispatched to an executor on any of the preferred 
>> nodes of the task for execution.
>>
>> for your case, have you set spark.cores.max to limit the cores to acquire, 
>> which means executors are available on a subset of the cluster nodes?
>>
>>> On Dec 27, 2016, at 01:39, Karamba  wrote:
>>>
>>> Hi,
>>>
>>> I am running a couple of docker hosts, each with an HDFS and a spark
>>> worker in a spark standalone cluster.
>>> In order to get data locality awareness, I would like to configure Racks
>>> for each host, so that a spark worker container knows from which hdfs
>>> node container it should load its data. Does this make sense?
>>>
>>> I configured HDFS container nodes via the core-site.xml in
>>> $HADOOP_HOME/etc and this works. hdfs dfsadmin -printTopology shows my
>>> setup.
>>>
>>> I configured SPARK the same way. I placed core-site.xml and
>>> hdfs-site.xml in the SPARK_CONF_DIR ... BUT this has no effect.
>>>
>>> Submitting a spark job via spark-submit to the spark-master that loads
>>> from HDFS just has Data locality ANY.
>>>
>>> It would be great if anybody would help me getting the right configuration!
>>>
>>> Thanks and best regards,
>>> on
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba

Hi Sun Rui,

thanks for answering!


> Although the Spark task scheduler is aware of rack-level data locality, it 
> seems that only YARN implements the support for it. 

This explains why the script that I configured in core-site.xml
topology.script.file.name is not called in by the spark container.
But at time of reading from hdfs in a spark program, the script is
called in my hdfs namenode container.

> However, node-level locality can still work for Standalone.

I have a couple of physical hosts that run spark and hdfs docker
containers. How does spark standalone knows that spark and docker
containers are on the same host?

> Data Locality involves in both task data locality and executor data locality. 
> Executor data locality is only supported on YARN with executor dynamic 
> allocation enabled. For standalone, by default, a Spark application will 
> acquire all available cores in the cluster, generally meaning there is at 
> least one executor on each node, in which case task data locality can work 
> because a task can be dispatched to an executor on any of the preferred nodes 
> of the task for execution.
>
> for your case, have you set spark.cores.max to limit the cores to acquire, 
> which means executors are available on a subset of the cluster nodes?
I set "--total-executor-cores 1" in order to use only a small subset of
the cluster.



On 28.12.2016 02:58, Sun Rui wrote:
> Although the Spark task scheduler is aware of rack-level data locality, it 
> seems that only YARN implements the support for it. However, node-level 
> locality can still work for Standalone.
>
> It is not necessary to copy the hadoop config files into the Spark CONF 
> directory. Set HADOOP_CONF_DIR to point to the conf directory of your Hadoop.
>
> Data Locality involves in both task data locality and executor data locality. 
> Executor data locality is only supported on YARN with executor dynamic 
> allocation enabled. For standalone, by default, a Spark application will 
> acquire all available cores in the cluster, generally meaning there is at 
> least one executor on each node, in which case task data locality can work 
> because a task can be dispatched to an executor on any of the preferred nodes 
> of the task for execution.
>
> for your case, have you set spark.cores.max to limit the cores to acquire, 
> which means executors are available on a subset of the cluster nodes?
>
>> On Dec 27, 2016, at 01:39, Karamba  wrote:
>>
>> Hi,
>>
>> I am running a couple of docker hosts, each with an HDFS and a spark
>> worker in a spark standalone cluster.
>> In order to get data locality awareness, I would like to configure Racks
>> for each host, so that a spark worker container knows from which hdfs
>> node container it should load its data. Does this make sense?
>>
>> I configured HDFS container nodes via the core-site.xml in
>> $HADOOP_HOME/etc and this works. hdfs dfsadmin -printTopology shows my
>> setup.
>>
>> I configured SPARK the same way. I placed core-site.xml and
>> hdfs-site.xml in the SPARK_CONF_DIR ... BUT this has no effect.
>>
>> Submitting a spark job via spark-submit to the spark-master that loads
>> from HDFS just has Data locality ANY.
>>
>> It would be great if anybody would help me getting the right configuration!
>>
>> Thanks and best regards,
>> on
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-27 Thread Sun Rui
Although the Spark task scheduler is aware of rack-level data locality, it 
seems that only YARN implements the support for it. However, node-level 
locality can still work for Standalone.

It is not necessary to copy the hadoop config files into the Spark CONF 
directory. Set HADOOP_CONF_DIR to point to the conf directory of your Hadoop.

Data Locality involves in both task data locality and executor data locality. 
Executor data locality is only supported on YARN with executor dynamic 
allocation enabled. For standalone, by default, a Spark application will 
acquire all available cores in the cluster, generally meaning there is at least 
one executor on each node, in which case task data locality can work because a 
task can be dispatched to an executor on any of the preferred nodes of the task 
for execution.

for your case, have you set spark.cores.max to limit the cores to acquire, 
which means executors are available on a subset of the cluster nodes?

> On Dec 27, 2016, at 01:39, Karamba  wrote:
> 
> Hi,
> 
> I am running a couple of docker hosts, each with an HDFS and a spark
> worker in a spark standalone cluster.
> In order to get data locality awareness, I would like to configure Racks
> for each host, so that a spark worker container knows from which hdfs
> node container it should load its data. Does this make sense?
> 
> I configured HDFS container nodes via the core-site.xml in
> $HADOOP_HOME/etc and this works. hdfs dfsadmin -printTopology shows my
> setup.
> 
> I configured SPARK the same way. I placed core-site.xml and
> hdfs-site.xml in the SPARK_CONF_DIR ... BUT this has no effect.
> 
> Submitting a spark job via spark-submit to the spark-master that loads
> from HDFS just has Data locality ANY.
> 
> It would be great if anybody would help me getting the right configuration!
> 
> Thanks and best regards,
> on
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark 2.0.2 HDFS]: no data locality

2016-12-26 Thread Karamba
Hi,

I am running a couple of docker hosts, each with an HDFS and a spark
worker in a spark standalone cluster.
In order to get data locality awareness, I would like to configure Racks
for each host, so that a spark worker container knows from which hdfs
node container it should load its data. Does this make sense?

I configured HDFS container nodes via the core-site.xml in
$HADOOP_HOME/etc and this works. hdfs dfsadmin -printTopology shows my
setup.

I configured SPARK the same way. I placed core-site.xml and
hdfs-site.xml in the SPARK_CONF_DIR ... BUT this has no effect.

Submitting a spark job via spark-submit to the spark-master that loads
from HDFS just has Data locality ANY.

It would be great if anybody would help me getting the right configuration!

Thanks and best regards,
on

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org