Set Job Descriptions for Scala application

2015-08-05 Thread Rares Vernica
Hello,

My Spark application is written in Scala and submitted to a Spark cluster
in standalone mode. The Spark Jobs for my application are listed in the
Spark UI like this:

Job Id Description ...
6  saveAsTextFile at Foo.scala:202
5  saveAsTextFile at Foo.scala:201
4  count at Foo.scala:188
3  collect at Foo.scala:182
2  count at Foo.scala:162
1  count at Foo.scala:152
0  collect at Foo.scala:142


Is it possible to assign Job Descriptions to all these jobs in my Scala
code?

Thanks!
Rares


Driver ID from spark-submit

2015-04-27 Thread Rares Vernica
Hello,

I am trying to use the default Spark cluster manager in a production
environment. I will be submitting jobs with spark-submit. I wonder if the
following is possible:

1. Get the Driver ID from spark-submit. We will use this ID to keep track
of the job and kill it if necessary.

2. Weather it is possible to run spark-submit in a mode where it ends and
returns control to the user immediately after the job is submitted.

Thanks!
Rares


2 input paths generate 3 partitions

2015-03-27 Thread Rares Vernica
Hello,

I am using the Spark shell in Scala on the localhost. I am using sc.textFile
to read a directory. The directory looks like this (generated by another
Spark script):

part-0
part-1
_SUCCESS


The part-0 has four short lines of text while part-1 has two short
lines of text. The _SUCCESS file is empty. When I check the number of
partitions on the RDD I get:

scala foo.partitions.length
15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2
res68: Int = 3


I wonder why do the two input files generate three partitions. Does Spark
check the number of lines in each file and try to generate three balanced
partitions?

Thanks!
Rares


Re: 2 input paths generate 3 partitions

2015-03-27 Thread Rares Vernica
Hi,

I am not using HDFS, I am using the local file system. Moreover, I did not
modify the defaultParallelism. The Spark instance is the default one
started by Spark Shell.

Thanks!
Rares


On Fri, Mar 27, 2015 at 4:48 PM, java8964 java8...@hotmail.com wrote:

 The files sound too small to be 2 blocks in HDFS.

 Did you set the defaultParallelism to be 3 in your spark?

 Yong

 --
 Subject: Re: 2 input paths generate 3 partitions
 From: zzh...@hortonworks.com
 To: rvern...@gmail.com
 CC: user@spark.apache.org
 Date: Fri, 27 Mar 2015 23:15:38 +


 Hi Rares,

  The number of partition is controlled by HDFS input format, and one file
 may have multiple partitions if it consists of multiple block. In you case,
 I think there is one file with 2 splits.

  Thanks.

  Zhan Zhang
  On Mar 27, 2015, at 3:12 PM, Rares Vernica rvern...@gmail.com wrote:

  Hello,

  I am using the Spark shell in Scala on the localhost. I am using
 sc.textFile to read a directory. The directory looks like this (generated
 by another Spark script):

  part-0
 part-1
 _SUCCESS


  The part-0 has four short lines of text while part-1 has two
 short lines of text. The _SUCCESS file is empty. When I check the number
 of partitions on the RDD I get:

  scala foo.partitions.length
 15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2
 res68: Int = 3


  I wonder why do the two input files generate three partitions. Does
 Spark check the number of lines in each file and try to generate three
 balanced partitions?

  Thanks!
 Rares





Set spark.fileserver.uri on private cluster

2015-03-17 Thread Rares Vernica
Hi,

I have a private cluster with private IPs, 192.168.*.*, and a gateway node
with both private IP, 192.168.*.*, and public internet IP.

I setup the Spark master on the gateway node and set the SPARK_MASTER_IP to
the private IP. I start Spark workers on the private nodes. It works fine.

The problem is with spark-shell. I start if from the gateway node with
--master and --conf spark.driver.host using the private IP. The shell
starts alright but when I try to run a job I get Connection refused errors
from RDD.

I checked the Environment for the shell and I noticed that the
spark.fileserver.uri and spark.repl.class.uri are both using the public IP
of the gateway. On the other hand spark.driver.host is using the private IP
as expected.

Setting spark.fileserver.uri or spark.repl.class.uri with --conf does not
help. It seems these values are not read but calculated.

Thanks!
Rares


takeSample triggers 2 jobs

2015-03-06 Thread Rares Vernica
Hello,

I am using takeSample from the Scala Spark 1.2.1 shell:

scala sc.textFile(README.md).takeSample(false, 3)


and I notice that two jobs are generated on the Spark Jobs page:

Job Id Description
1 takeSample at console:13
0  takeSample at console:13


Any ideas why the two jobs are needed?

Thanks!
Rares