Cool thanksI have a CDH 5.4.8 (Cloudera Starving Developers Version) with 1 NN and 4 DN and SPark is running but its 1.3.xI want to leverage this HDFS hive cluster for SparkR because we do all data munging here and produce datasets for ML. I am thinking of the following idea 1. Add 2 datanodes to the existing HDFS cluster thru Cloudera Manager2. Dont add any Spark Service to these two new nodes3. Download and install latest 1.5.1 Spark on these two datanodes4. Download and Install R on these 2 datanodes5. Configure spark as 1 master and 1 slave on one node . On second node, configure slave will report back if this works ! thanks sanjay From: shenLiu <lius...@outlook.com> To: Sanjay Subramanian <sanjaysubraman...@yahoo.com>; User <user@spark.apache.org> Sent: Monday, November 9, 2015 10:23 PM Subject: RE: Is it possible Running SparkR on 2 nodes without HDFS #yiv4791623997 #yiv4791623997 --.yiv4791623997hmmessage P{margin:0px;padding:0px;}#yiv4791623997 body.yiv4791623997hmmessage{font-size:12pt;}#yiv4791623997 Hi Sanjay, It's better to use HDFS. otherwise you should have copies of the csv file on all worker node with same path. regardsShawn
Date: Tue, 10 Nov 2015 02:06:16 +0000 From: sanjaysubraman...@yahoo.com.INVALID To: user@spark.apache.org Subject: Is it possible Running SparkR on 2 nodes without HDFS hey guys I have a 2 node SparkR (1 master 1 slave)cluster on AWS using spark-1.5.1-bin-without-hadoop.tgz Running the SparkR job on the master node /opt/spark-1.5.1-bin-hadoop2.6/bin/sparkR --master spark://ip-xx-ppp-vv-ddd:7077 --packages com.databricks:spark-csv_2.10:1.2.0 --executor-cores 16 --num-executors 8 --executor-memory 8G --driver-memory 8g myRprogram.R org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 1.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1.0 (TID 103, xx.ff.rr.tt): java.io.FileNotFoundException: File file:/mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766) at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecord myRprogram.R library(SparkR) sc <- sparkR.init(appName="SparkR-CancerData-example")sqlContext <- sparkRSQL.init(sc) lds <- read.df(sqlContext, "file:///mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv", "com.databricks.spark.csv", header="true")sink("file:///mnt/local/1024gbxvdf1/leads_new_data_analyis.txt")summary(lds) This used to run when we had a single node SparkR installation regards sanjay