RE: cannot read file form a local path

2014-09-11 Thread Mozumder, Monir
I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file 
) :



scala val lines = sc.textFile(file:///home/monir/.bashrc)
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
console:12

scala val linecount = lines.count
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/home/monir/.bashrc
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

-Original Message-
From: wsun
Sent: Feb 03, 2014; 12:44pm
To: u...@spark.incubator.apache.org
Subject: cannot read file form a local path


After installing spark 0.8.1 on a EC2 cluster, I launched Spark shell on the 
master. This is what happened to me: 

scalaval textFile=sc.textFile(README.md) 
14/02/03 20:38:08 INFO storage.MemoryStore: ensureFreeSpace(34380) called with 
c  urMem=0, maxMem=4082116853 
14/02/03 20:38:08 INFO storage.MemoryStore: Block broadcast_0 stored as values 
t  o memory (estimated size 33.6 KB, free 
3.8 GB) 
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
consol  e:12 


scala textFile.count() 
14/02/03 20:38:39 WARN snappy.LoadSnappy: Snappy native library is available 
14/02/03 20:38:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library 
14/02/03 20:38:39 INFO snappy.LoadSnappy: Snappy native library loaded 
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs:  
//ec2-54-234-136-50.compute-1.amazonaws.com:9000/user/root/README.md 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j   
   ava:197) 
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja   
   va:208) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:141) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) 
at scala.Option.getOrElse(Option.scala:108) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) 
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) 
at scala.Option.getOrElse(Option.scala:108) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:886) 
at org.apache.spark.rdd.RDD.count(RDD.scala:698) 


Spark seems looking for README.md in hdfs. However, I did not specify the 
file is located in hdfs. I am just wondering if there any configuration in 
Spark that force Spark to read files from local file system. Thanks in advance 
for any helps. 

wp

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: cannot read file form a local path

2014-09-11 Thread Mozumder, Monir
Seems starting spark-shell in local mode solves this. But still then it cannot 
recognize file beginning with a '.' 

MASTER=local[4] ./bin/spark-shell

.
scala val lineCount = sc.textFile(/home/monir/ref).count
lineCount: Long = 68

scala val lineCount2 = sc.textFile(/home/monir/.ref).count
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/home/monir/.ref


Though I am ok with running spark-shell in  local mode to basic examples run, I 
was wondering if getting to local files on the cluster nodes is possible when 
all of the worker nodes have the file in question in their local file system.

Still fairly new to Spark so bear with me if this is easily tunable by some 
config params.

Bests,
-Monir



-Original Message-
From: Mozumder, Monir 
Sent: Thursday, September 11, 2014 12:15 PM
To: user@spark.apache.org
Subject: RE: cannot read file form a local path

I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file 
) :



scala val lines = sc.textFile(file:///home/monir/.bashrc)
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
console:12

scala val linecount = lines.count
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/home/monir/.bashrc
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

-Original Message-
From: wsun
Sent: Feb 03, 2014; 12:44pm
To: u...@spark.incubator.apache.org
Subject: cannot read file form a local path


After installing spark 0.8.1 on a EC2 cluster, I launched Spark shell on the 
master. This is what happened to me: 

scalaval textFile=sc.textFile(README.md)
14/02/03 20:38:08 INFO storage.MemoryStore: ensureFreeSpace(34380) called with 
c  urMem=0, maxMem=4082116853 
14/02/03 20:38:08 INFO storage.MemoryStore: Block broadcast_0 stored as values 
t  o memory (estimated size 33.6 KB, free 
3.8 GB) 
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
consol  e:12 


scala textFile.count()
14/02/03 20:38:39 WARN snappy.LoadSnappy: Snappy native library is available
14/02/03 20:38:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/02/03 20:38:39 INFO snappy.LoadSnappy: Snappy native library loaded 
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs:  
//ec2-54-234-136-50.compute-1.amazonaws.com:9000/user/root/README.md 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j   
   ava:197) 
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja   
   va:208) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:141) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) 
at scala.Option.getOrElse(Option.scala:108) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) 
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) 
at scala.Option.getOrElse(Option.scala:108) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:886) 
at org.apache.spark.rdd.RDD.count(RDD.scala:698) 


Spark seems looking for README.md in hdfs. However, I did not specify the 
file is located in hdfs. I am just wondering if there any configuration in 
Spark that force Spark to read files from local file system. Thanks in advance 
for any helps. 

wp

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How spark parallelize maps Slices to tasks/executors/workers

2014-09-04 Thread Mozumder, Monir

I have this 2-node cluster setup, where each node has 4-cores.

MASTER
(Worker-on-master)  (Worker-on-node1)

(slaves(master,node1))
SPARK_WORKER_INSTANCES=1


I am trying to understand Spark's parallelize behavior. The sparkPi example has 
this code:
val slices = 8
val n = 10 * slices
val count = spark.parallelize(1 to n, slices).map { i =
  val x = random * 2 - 1
  val y = random * 2 - 1
  if (x*x + y*y  1) 1 else 0
}.reduce(_ + _)


As per documentation: Spark will run one task for each slice of the cluster. 
Typically you want 2-4 slices for each CPU in your cluster. I set slices to be 
8 which means the workingset will be divided among 8 tasks on the cluster, in 
turn each worker node gets 4 tasks (1:1 per core)

Questions:
   i)  Where can I see task level details? Inside executors I dont see task 
breakdown so I can see the effect of slices on the UI.
   ii) How to  programmatically find the working set size for the map function 
above? I assume it is n/slices (10 above)
   iii) Are the multiple tasks run by an executor run sequentially or paralelly 
in multiple threads?
   iv) Reasoning behind 2-4 slices per CPU.
   v) I assume ideally we should tune SPARK_WORKER_INSTANCES to correspond to 
number of

Bests,
-Monir


RE: Benchmark on physical Spark cluster

2014-08-12 Thread Mozumder, Monir
An on-list follow up: http://prof.ict.ac.cn/BigDataBench/#Benchmarks looks 
promising as it has spark as one of the platforms used.

Bests,
-Monir


From: Mozumder, Monir
Sent: Monday, August 11, 2014 7:18 PM
To: user@spark.apache.org
Subject: Benchmark on physical Spark cluster

I am trying to get some workloads or benchmarks for running on a physical spark 
cluster and find relative speedups on different physical clusters.

The instructions at 
https://databricks.com/blog/2014/02/12/big-data-benchmark.html uses Amazon EC2. 
I was wondering if anyone got other benchmarks for spark on physical clusters. 
Hoping to get a CloudSuite like suite for Spark.

Bests,
-Monir