Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

DiData Thu, 10 Apr 2014 12:52:22 -0700

Hello friends:

I recently compiled and installed Spark v0.9 from the Apache distribution.

Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well(actually, theentire big-data suite from CDH is installed), but for the moment I'musing my

manually built Apache Spark for 'ground-up' learning purposes.

Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified thefollowing:


      export SPARK_YARN=true
      export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0

The resulting examples ran fine locally as well as on YARN.

I'm not interested in YARN here; just mentioning it for completeness incase that matters in

my upcoming question. Here is my issue / question:

I start pyspark locally -- on one machine for API learning purposes --as shown below, and attempt tointeract with a local text file (not in HDFS). Unfortunately, theSparkContext (sc) tries to connect toa HDFS Name Node (which I don't currently have enabled because I don'tneed it).

The SparkContext cleverly inspects the configurations in my'/etc/hadoop/conf/' directory to learnwhere my Name Node is, however I don't want it to do that in this case.I just want it to run a

one-machine local version of 'pyspark'.

Did I miss something in my invocation/use of 'pyspark' below? Do I needto add something else?

(Btw: I searched but could not find any solutions, and thedocumentation, while good, doesn't

quite get me there).

See below, and thank you all in advance!


user$ export PYSPARK_PYTHON=/usr/bin/bpython
user$ export MASTER=local[8]
user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark

#===========================================================================================

  >>> sc
  <pyspark.context.SparkContext object at 0x24f0f50>
  >>>
  >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
  >>> distData.count()
  [ ... snip ... ]
*Py4JJavaError: An error occurred while calling o21.collect.

: java.net.ConnectException: Call From server01/192.168.0.15 tonamenode:8020 failed on connection exception:java.net.ConnectException: Connection refused; For more detailssee: http://wiki.apache.org/hadoop/ConnectionRefused*

  [ ... snip ... ]
  >>>
  >>>

#===========================================================================================


--
Sincerely,
DiData

Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

Reply via email to