Re: Is it possible Running SparkR on 2 nodes without HDFS

Sanjay Subramanian Tue, 10 Nov 2015 09:46:29 -0800

Cool thanksI have a CDH 5.4.8 (Cloudera Starving Developers Version) with 1 NN 
and 4 DN and SPark is running but its 1.3.xI want to leverage this HDFS hive 
cluster for SparkR because we do all data munging here and produce datasets for 
ML.
I am thinking of the following idea 
1. Add 2 datanodes to the existing HDFS cluster thru Cloudera Manager2. Dont 
add any Spark Service to these two new nodes3. Download and install latest 
1.5.1 Spark on these two datanodes4. Download and Install R on these 2 
datanodes5. Configure spark as 1 master and 1 slave on one node . On second 
node, configure slave
will report back if this works !
thanks
sanjay       From: shenLiu <lius...@outlook.com>
 To: Sanjay Subramanian <sanjaysubraman...@yahoo.com>; User 
<user@spark.apache.org> 
 Sent: Monday, November 9, 2015 10:23 PM
 Subject: RE: Is it possible Running SparkR on 2 nodes without HDFS
   
#yiv4791623997 #yiv4791623997 --.yiv4791623997hmmessage 
P{margin:0px;padding:0px;}#yiv4791623997 
body.yiv4791623997hmmessage{font-size:12pt;}#yiv4791623997 Hi Sanjay,
It's better to use HDFS. otherwise you should have copies of the csv file on 
all worker node with same path.
regardsShawn




Date: Tue, 10 Nov 2015 02:06:16 +0000
From: sanjaysubraman...@yahoo.com.INVALID
To: user@spark.apache.org
Subject: Is it possible Running SparkR on 2 nodes without HDFS

hey guys
I have a 2 node SparkR (1 master 1 slave)cluster on AWS using 
spark-1.5.1-bin-without-hadoop.tgz
Running the SparkR job on the master node 
/opt/spark-1.5.1-bin-hadoop2.6/bin/sparkR --master  
spark://ip-xx-ppp-vv-ddd:7077 --packages com.databricks:spark-csv_2.10:1.2.0  
--executor-cores 16 --num-executors 8 --executor-memory 8G --driver-memory 8g   
myRprogram.R

  org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in 
stage 1.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1.0 (TID 
103, xx.ff.rr.tt): java.io.FileNotFoundException: File 
file:/mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv 
does not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341) 
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766) at 
org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecord




myRprogram.R
library(SparkR)
sc <- sparkR.init(appName="SparkR-CancerData-example")sqlContext <- 
sparkRSQL.init(sc)
lds <- read.df(sqlContext, 
"file:///mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv",
 "com.databricks.spark.csv", 
header="true")sink("file:///mnt/local/1024gbxvdf1/leads_new_data_analyis.txt")summary(lds)

This used to run when we had a single node SparkR installation
regards
sanjay

Re: Is it possible Running SparkR on 2 nodes without HDFS

Reply via email to