Solved: Putting HADOOP_CONF_DIR in spark-env of the workers solved the
problem.


The difference between HadoopRDD and NewHadoopRDD is that the old one
creates the JobConf on worker side, where the new one creates an instance
of JobConf on driver side and then broadcasts it.

I tried creating myself the HadoopRDD and tweaked a bit things in order to
log the properties in the conf when loaded on worker side and on driver
side. On worker side I see a dummy conf that looks like the default conf to
me, where on driver side I get the right conf with the namenodes etc.

My guess is that HADOOP_CONF_DIR is not shared with the workers when set
only on the driver (it was not defined in spark-env)?

Also wouldn't it be more natural to create the conf on driver side and then
share it with the workers?





2014-05-09 10:51 GMT+02:00 Eugen Cepoi <cepoi.eu...@gmail.com>:

> Hi,
>
> I have some strange behaviour when using textFile to read some data from
> HDFS in spark 0.9.1.
> I get UnknownHost exceptions,  where hadoop client tries to resolve the
> dfs.nameservices and fails.
>
> So far:
>  - this has been tested inside the shell
>  - the exact same code works with spark-0.8.1
>  - the shell is launched with HADOOP_CONF_DIR pointing to our HA conf
>  - if before that some other rdd is created from HDFS and succeeds than,
> this works also (might be related in the way the default hadoop
> configuration is being shared?)
>  - if using the new MR API it works
>    sc.newAPIHadoopFile(path, classOf[TextInputFormat],
> classOf[LongWritable], classOf[Text],
> sc.hadoopConfiguration).map(_._2.toString)
>
> Hadoop disitribution: 2.0.0-cdh4.1.2
> Spark 0.9.1 - packaged with correct version of hadoop
>
> Eugen
>

Reply via email to