RE: cannot read file form a local path
I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file ) : scala val lines = sc.textFile(file:///home/monir/.bashrc) lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:12 scala val linecount = lines.count org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/monir/.bashrc at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -Original Message- From: wsun Sent: Feb 03, 2014; 12:44pm To: u...@spark.incubator.apache.org Subject: cannot read file form a local path After installing spark 0.8.1 on a EC2 cluster, I launched Spark shell on the master. This is what happened to me: scalaval textFile=sc.textFile(README.md) 14/02/03 20:38:08 INFO storage.MemoryStore: ensureFreeSpace(34380) called with c urMem=0, maxMem=4082116853 14/02/03 20:38:08 INFO storage.MemoryStore: Block broadcast_0 stored as values t o memory (estimated size 33.6 KB, free 3.8 GB) textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at consol e:12 scala textFile.count() 14/02/03 20:38:39 WARN snappy.LoadSnappy: Snappy native library is available 14/02/03 20:38:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/02/03 20:38:39 INFO snappy.LoadSnappy: Snappy native library loaded org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs: //ec2-54-234-136-50.compute-1.amazonaws.com:9000/user/root/README.md at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:141) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.SparkContext.runJob(SparkContext.scala:886) at org.apache.spark.rdd.RDD.count(RDD.scala:698) Spark seems looking for README.md in hdfs. However, I did not specify the file is located in hdfs. I am just wondering if there any configuration in Spark that force Spark to read files from local file system. Thanks in advance for any helps. wp - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: cannot read file form a local path
Seems starting spark-shell in local mode solves this. But still then it cannot recognize file beginning with a '.' MASTER=local[4] ./bin/spark-shell . scala val lineCount = sc.textFile(/home/monir/ref).count lineCount: Long = 68 scala val lineCount2 = sc.textFile(/home/monir/.ref).count org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/monir/.ref Though I am ok with running spark-shell in local mode to basic examples run, I was wondering if getting to local files on the cluster nodes is possible when all of the worker nodes have the file in question in their local file system. Still fairly new to Spark so bear with me if this is easily tunable by some config params. Bests, -Monir -Original Message- From: Mozumder, Monir Sent: Thursday, September 11, 2014 12:15 PM To: user@spark.apache.org Subject: RE: cannot read file form a local path I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file ) : scala val lines = sc.textFile(file:///home/monir/.bashrc) lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:12 scala val linecount = lines.count org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/monir/.bashrc at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -Original Message- From: wsun Sent: Feb 03, 2014; 12:44pm To: u...@spark.incubator.apache.org Subject: cannot read file form a local path After installing spark 0.8.1 on a EC2 cluster, I launched Spark shell on the master. This is what happened to me: scalaval textFile=sc.textFile(README.md) 14/02/03 20:38:08 INFO storage.MemoryStore: ensureFreeSpace(34380) called with c urMem=0, maxMem=4082116853 14/02/03 20:38:08 INFO storage.MemoryStore: Block broadcast_0 stored as values t o memory (estimated size 33.6 KB, free 3.8 GB) textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at consol e:12 scala textFile.count() 14/02/03 20:38:39 WARN snappy.LoadSnappy: Snappy native library is available 14/02/03 20:38:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/02/03 20:38:39 INFO snappy.LoadSnappy: Snappy native library loaded org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs: //ec2-54-234-136-50.compute-1.amazonaws.com:9000/user/root/README.md at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:141) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.SparkContext.runJob(SparkContext.scala:886) at org.apache.spark.rdd.RDD.count(RDD.scala:698) Spark seems looking for README.md in hdfs. However, I did not specify the file is located in hdfs. I am just wondering if there any configuration in Spark that force Spark to read files from local file system. Thanks in advance for any helps. wp - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How spark parallelize maps Slices to tasks/executors/workers
I have this 2-node cluster setup, where each node has 4-cores. MASTER (Worker-on-master) (Worker-on-node1) (slaves(master,node1)) SPARK_WORKER_INSTANCES=1 I am trying to understand Spark's parallelize behavior. The sparkPi example has this code: val slices = 8 val n = 10 * slices val count = spark.parallelize(1 to n, slices).map { i = val x = random * 2 - 1 val y = random * 2 - 1 if (x*x + y*y 1) 1 else 0 }.reduce(_ + _) As per documentation: Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster. I set slices to be 8 which means the workingset will be divided among 8 tasks on the cluster, in turn each worker node gets 4 tasks (1:1 per core) Questions: i) Where can I see task level details? Inside executors I dont see task breakdown so I can see the effect of slices on the UI. ii) How to programmatically find the working set size for the map function above? I assume it is n/slices (10 above) iii) Are the multiple tasks run by an executor run sequentially or paralelly in multiple threads? iv) Reasoning behind 2-4 slices per CPU. v) I assume ideally we should tune SPARK_WORKER_INSTANCES to correspond to number of Bests, -Monir
RE: Benchmark on physical Spark cluster
An on-list follow up: http://prof.ict.ac.cn/BigDataBench/#Benchmarks looks promising as it has spark as one of the platforms used. Bests, -Monir From: Mozumder, Monir Sent: Monday, August 11, 2014 7:18 PM To: user@spark.apache.org Subject: Benchmark on physical Spark cluster I am trying to get some workloads or benchmarks for running on a physical spark cluster and find relative speedups on different physical clusters. The instructions at https://databricks.com/blog/2014/02/12/big-data-benchmark.html uses Amazon EC2. I was wondering if anyone got other benchmarks for spark on physical clusters. Hoping to get a CloudSuite like suite for Spark. Bests, -Monir