I started a AWS cluster (1master + 3core) and download the prebuilt Spark binary. I downloaded the latest Anaconda Python and started a iPython notebook server by running the command below:
ipython notebook --port 9999 --profile nbserver --no-browser Then, I try to develop a Spark application running on top of YARN interactively in the iPython notebook: Here is the code that I have written: import sys import os from pyspark import SparkContext, SparkConf sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python') sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip') os.environ["YARN_CONF_DIR"] = "/home/hadoop/conf" os.environ["SPARK_HOME"] = "/home/hadoop/bwang/spark-1.3.1-bin-hadoop2.4" conf = (SparkConf() .setMaster("yarn-client") .setAppName("Spark ML") .set("spark.executor.memory", "2g") ) sc = SparkContext(conf=conf) data = sc.textFile("hdfs:// ec2-xx.xx.xx.xxxx.compute-1.amazonaws.com:8020/data/*") data.count() The code works all the way till the count, and it shows "com.hadoop.compression.lzo.LzoCodec not found".. Here <http://www.wepaste.com/sparkcompression/>is the full log. I did some search, and it is basically around Spark cannot access Lzocodec library. I have tried to use os.environ to set the SPARK_CLASSPATH and SPARK_LIBRARY_PATH to include the hadoop-lzo.jar which is located in "./home/hadoop/.versions/2.4.0-amzn-4/share/hadoop/common/lib/hadoop-lzo.jar " in AWS hadoop. However, it is still not working. Can anyone show me how to solve this problem?