Hi, I am trying to load a CSV file which is on HDFS. I have two machines: IMPETUS-1466 (172.26.49.156) and IMPETUS-1325 (172.26.49.55). Both have Spark 1.6.0 pre-built for Hadoop 2.6 and later, but for both, I had existing Hadoop clusters running Hadoop 1.0.4. I have launched HDFS from 172.26.49.156 by running start-dfs.sh from it, copied files from local file system to HDFS and can view them by hadoop fs -ls.
However, when I am trying to load the CSV file from pyspark shell (launched by bin/pyspark --packages com.databricks:spark-csv_2.10:1.3.0) from IMPETUS-1325 (172.26.49.55) with the following commands: >>from pyspark.sql import SQLContext >>sqlContext = SQLContext(sc) >>patients_df = sqlContext.read.format("com.databricks.spark.csv").option("header", "false").load("hdfs:// 172.26.49.156:54310/bibudh/healthcare/data/cloudera_challenge/patients.csv") I get the following error: java.io.EOFException: End of File Exception between local host is: " IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55"; destination host is: "IMPETUS-1466":54310; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException U have changed the port number from 54310 to 8020, but then I get the error java.net.ConnectException: Call From IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55 to IMPETUS-1466:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused To me it seemed like this may result from a version mismatch between Spark Hadoop client and my Hadoop cluster, so I have made the following changes: 1) Added the following lines to conf/spark-env.sh export HADOOP_HOME="/usr/local/hadoop-1.0.4" export HADOOP_CONF_DIR="$HADOOP_HOME/conf" export HDFS_URL="hdfs:// 172.26.49.156:8020" 2) Downloaded Spark 1.6.0, pre-built with user-provided Hadoop, and in addition to the three lines above, added the following line to conf/spark-env.sh export SPARK_DIST_CLASSPATH="/usr/local/hadoop-1.0.4/bin/hadoop" but none of it seems to work. However, the following command works from 172.26.49.55 and gives the directory listing: /usr/local/hadoop-1.0.4/bin/hadoop fs -ls hdfs://172.26.49.156:54310/ Any suggestion? Thanks Bibudh -- Bibudh Lahiri Data Scientist, Impetus Technolgoies 5300 Stevens Creek Blvd San Jose, CA 95129 http://knowthynumbers.blogspot.com/