Set Job Descriptions for Scala application
Hello, My Spark application is written in Scala and submitted to a Spark cluster in standalone mode. The Spark Jobs for my application are listed in the Spark UI like this: Job Id Description ... 6 saveAsTextFile at Foo.scala:202 5 saveAsTextFile at Foo.scala:201 4 count at Foo.scala:188 3 collect at Foo.scala:182 2 count at Foo.scala:162 1 count at Foo.scala:152 0 collect at Foo.scala:142 Is it possible to assign Job Descriptions to all these jobs in my Scala code? Thanks! Rares
Driver ID from spark-submit
Hello, I am trying to use the default Spark cluster manager in a production environment. I will be submitting jobs with spark-submit. I wonder if the following is possible: 1. Get the Driver ID from spark-submit. We will use this ID to keep track of the job and kill it if necessary. 2. Weather it is possible to run spark-submit in a mode where it ends and returns control to the user immediately after the job is submitted. Thanks! Rares
2 input paths generate 3 partitions
Hello, I am using the Spark shell in Scala on the localhost. I am using sc.textFile to read a directory. The directory looks like this (generated by another Spark script): part-0 part-1 _SUCCESS The part-0 has four short lines of text while part-1 has two short lines of text. The _SUCCESS file is empty. When I check the number of partitions on the RDD I get: scala foo.partitions.length 15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2 res68: Int = 3 I wonder why do the two input files generate three partitions. Does Spark check the number of lines in each file and try to generate three balanced partitions? Thanks! Rares
Re: 2 input paths generate 3 partitions
Hi, I am not using HDFS, I am using the local file system. Moreover, I did not modify the defaultParallelism. The Spark instance is the default one started by Spark Shell. Thanks! Rares On Fri, Mar 27, 2015 at 4:48 PM, java8964 java8...@hotmail.com wrote: The files sound too small to be 2 blocks in HDFS. Did you set the defaultParallelism to be 3 in your spark? Yong -- Subject: Re: 2 input paths generate 3 partitions From: zzh...@hortonworks.com To: rvern...@gmail.com CC: user@spark.apache.org Date: Fri, 27 Mar 2015 23:15:38 + Hi Rares, The number of partition is controlled by HDFS input format, and one file may have multiple partitions if it consists of multiple block. In you case, I think there is one file with 2 splits. Thanks. Zhan Zhang On Mar 27, 2015, at 3:12 PM, Rares Vernica rvern...@gmail.com wrote: Hello, I am using the Spark shell in Scala on the localhost. I am using sc.textFile to read a directory. The directory looks like this (generated by another Spark script): part-0 part-1 _SUCCESS The part-0 has four short lines of text while part-1 has two short lines of text. The _SUCCESS file is empty. When I check the number of partitions on the RDD I get: scala foo.partitions.length 15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2 res68: Int = 3 I wonder why do the two input files generate three partitions. Does Spark check the number of lines in each file and try to generate three balanced partitions? Thanks! Rares
Set spark.fileserver.uri on private cluster
Hi, I have a private cluster with private IPs, 192.168.*.*, and a gateway node with both private IP, 192.168.*.*, and public internet IP. I setup the Spark master on the gateway node and set the SPARK_MASTER_IP to the private IP. I start Spark workers on the private nodes. It works fine. The problem is with spark-shell. I start if from the gateway node with --master and --conf spark.driver.host using the private IP. The shell starts alright but when I try to run a job I get Connection refused errors from RDD. I checked the Environment for the shell and I noticed that the spark.fileserver.uri and spark.repl.class.uri are both using the public IP of the gateway. On the other hand spark.driver.host is using the private IP as expected. Setting spark.fileserver.uri or spark.repl.class.uri with --conf does not help. It seems these values are not read but calculated. Thanks! Rares
takeSample triggers 2 jobs
Hello, I am using takeSample from the Scala Spark 1.2.1 shell: scala sc.textFile(README.md).takeSample(false, 3) and I notice that two jobs are generated on the Spark Jobs page: Job Id Description 1 takeSample at console:13 0 takeSample at console:13 Any ideas why the two jobs are needed? Thanks! Rares