Re: Comprehensive Port Configuration reference?

2014-06-08 Thread Andrew Ash
Hi Jacob,

The port configuration docs that we worked on together are now available
at:
http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security

Thanks for the help!

Andrew


On Wed, May 28, 2014 at 3:21 PM, Jacob Eisinger  wrote:

> Howdy Andrew,
>
> This is a standalone cluster.  And, yes, if my understanding of Spark
> terminology is correct, you are correct about the port ownerships.
>
>
> Jacob
>
> Jacob D. Eisinger
> IBM Emerging Technologies
> jeis...@us.ibm.com - (512) 286-6075
>
> [image: Inactive hide details for Andrew Ash ---05/28/2014 05:18:46
> PM---Hmm, those do look like 4 listening ports to me. PID 3404 is]Andrew
> Ash ---05/28/2014 05:18:46 PM---Hmm, those do look like 4 listening ports
> to me.  PID 3404 is an executor and PID 4762 is a worker?
>
>
> From: Andrew Ash 
> To: user@spark.apache.org
> Date: 05/28/2014 05:18 PM
>
> Subject: Re: Comprehensive Port Configuration reference?
> --
>
>
>
> Hmm, those do look like 4 listening ports to me.  PID 3404 is an executor
> and PID 4762 is a worker?  This is a standalone cluster?
>
>
> On Wed, May 28, 2014 at 8:22 AM, Jacob Eisinger <*jeis...@us.ibm.com*
> > wrote:
>
>Howdy Andrew,
>
>Here is what I ran before an application context was created (other
>services have been deleted):
>
>   *# netstat -l -t tcp -p  --numeric-ports
> *
>   Active Internet connections (only servers)
>
>   Proto Recv-Q Send-Q Local Address   Foreign Address
>   State   PID/Program name
> * tcp6   0  0 **10.90.17.100:* 
> *   :::*LISTEN  4762/java
>   tcp6   0  0 :::8081 :::*
>LISTEN  4762/java *
>
>And, then while the application context is up:
>   *# netstat -l -t tcp -p  --numeric-ports
> *
>   Active Internet connections (only servers)
>
>   Proto Recv-Q Send-Q Local Address   Foreign Address
>   State   PID/Program name
> * tcp6   0  0 **10.90.17.100:* *
>   :::*LISTEN  4762/java
> *
>
> * tcp6   0  0 :::57286:::*
>LISTEN  3404/java tcp6   0 
>  0 *
>   *10.90.17.100:38118* 
> *  :::*LISTEN  3404/java
> tcp6   0  0 **10.90.17.100:35530*
>   
> *  :::*LISTEN  3404/java
> tcp6   0  0 :::60235:::*
>  LISTEN  3404/java *
> * tcp6   0  0 :::8081 :::*
>LISTEN  4762/java *
>
>My understanding is that this says four ports are open.  Is 57286 and
>60235 not being used?
>
>
>Jacob
>
>Jacob D. Eisinger
>IBM Emerging Technologies
> *jeis...@us.ibm.com*  - *(512) 286-6075*
><%28512%29%20286-6075>
>
>[image: Inactive hide details for Andrew Ash ---05/25/2014 06:25:18
>PM---Hi Jacob, The config option spark.history.ui.port is new for 1]Andrew
>Ash ---05/25/2014 06:25:18 PM---Hi Jacob, The config option
>spark.history.ui.port is new for 1.0  The problem that
>
>
>From: Andrew Ash <*and...@andrewash.com* >
>To: *user@spark.apache.org* 
>Date: 05/25/2014 06:25 PM
>
>Subject: Re: Comprehensive Port Configuration reference?
>--
>
>
>
>Hi Jacob,
>
>The config option spark.history.ui.port is new for 1.0  The problem
>that History server solves is that in non-Standalone cluster deployment
>modes (Mesos and YARN) there is no long-lived Spark Master that can store
>logs and statistics about an application after it finishes.  History server
>is the UI that renders logged data from applications after they complete.
>
>Read more here: *https://issues.apache.org/jira/browse/SPARK-1276*
> and
>*https://github.com/apache/spark/pull/204*
>
>
>As far as the two vs four dynamic ports, are those all listening
>ports?  I did observe 4 ports in use, but only two of them were listening.
> The other two were the random ports used for responses on outbound
>connections, the source port of the (srcIP, srcPort, dstIP, dstPort) tuple
>that uniquely identifies a TCP socket.
>
>
>
> *http://unix.stackexchange.com/questions/75011/how-does-the-server-find-out-what-client-port-to-send-to*
>
> 
>
>Thanks for taking a look throug

RE: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-08 Thread innowireless TaeYun Kim
Without (C), what is the best practice to implement the following scenario?

1. rdd = sc.textFile(FileA)
2. rdd = rdd.map(...)  // actually modifying the rdd
3. rdd.saveAsTextFile(FileA)

Since the rdd transformation is 'lazy', rdd will not materialize until
saveAsTextFile(), so FileA must still exist, but it must be deleted before
saveAsTextFile().

What I can think is:

3. rdd.saveAsTextFile(TempFile)
4. delete FileA
5. rename TempFile to FileA

This is not very convenient...

Thanks.

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Tuesday, June 03, 2014 11:40 AM
To: user@spark.apache.org
Subject: Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing
file

(A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's output
format check and overwrite files in the destination directory.
But it won't clobber the directory entirely. I.e. if the directory already
had "part1" "part2" "part3" "part4" and you write a new job outputing only
two files ("part1", "part2") then it would leave the other two files intact,
confusingly.

(B) Semantics in Spark 1.0 and earlier: Runs Hadoop OutputFormat check which
means the directory must not exist already or an excpetion is thrown.

(C) Semantics proposed by Nicholas Chammas in this thread (AFAIK):
Spark will delete/clobber an existing destination directory if it exists,
then fully over-write it with new data.

I'm fine to add a flag that allows (B) for backwards-compatibility reasons,
but my point was I'd prefer not to have (C) even though I see some cases
where it would be useful.

- Patrick

On Mon, Jun 2, 2014 at 4:25 PM, Sean Owen  wrote:
> Is there a third way? Unless I miss something. Hadoop's OutputFormat 
> wants the target dir to not exist no matter what, so it's just a 
> question of whether Spark deletes it for you or errors.
>
> On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell 
wrote:
>> We can just add back a flag to make it backwards compatible - it was 
>> just missed during the original PR.
>>
>> Adding a *third* set of "clobber" semantics, I'm slightly -1 on that 
>> for the following reasons:
>>
>> 1. It's scary to have Spark recursively deleting user files, could 
>> easily lead to users deleting data by mistake if they don't 
>> understand the exact semantics.
>> 2. It would introduce a third set of semantics here for saveAsXX...
>> 3. It's trivial for users to implement this with two lines of code 
>> (if output dir exists, delete it) before calling saveAsHadoopFile.
>>
>> - Patrick
>>



Re: Classpath errors with Breeze

2014-06-08 Thread Xiangrui Meng
Hi Tobias,

Which file system and which encryption are you using?

Best,
Xiangrui

On Sun, Jun 8, 2014 at 10:16 PM, Xiangrui Meng  wrote:
> Hi dlaw,
>
> You are using breeze-0.8.1, but the spark assembly jar depends on
> breeze-0.7. If the spark assembly jar comes the first on the classpath
> but the method from DenseMatrix is only available in breeze-0.8.1, you
> get NoSuchMethod. So,
>
> a) If you don't need the features in breeze-0.8.1, do not include it
> as a dependency.
>
> or
>
> b) Try an experimental features by turning on
> spark.files.userClassPathFirst in your Spark configuration.
>
> Best,
> Xiangrui
>
> On Sun, Jun 8, 2014 at 10:08 PM, dlaw  wrote:
>> Thanks for the quick response. No, I actually build my jar via 'sbt package'
>> on EC2 on the master itself.
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Classpath-errors-with-Breeze-tp7220p7225.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Classpath errors with Breeze

2014-06-08 Thread Xiangrui Meng
Hi dlaw,

You are using breeze-0.8.1, but the spark assembly jar depends on
breeze-0.7. If the spark assembly jar comes the first on the classpath
but the method from DenseMatrix is only available in breeze-0.8.1, you
get NoSuchMethod. So,

a) If you don't need the features in breeze-0.8.1, do not include it
as a dependency.

or

b) Try an experimental features by turning on
spark.files.userClassPathFirst in your Spark configuration.

Best,
Xiangrui

On Sun, Jun 8, 2014 at 10:08 PM, dlaw  wrote:
> Thanks for the quick response. No, I actually build my jar via 'sbt package'
> on EC2 on the master itself.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Classpath-errors-with-Breeze-tp7220p7225.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Classpath errors with Breeze

2014-06-08 Thread dlaw
Thanks for the quick response. No, I actually build my jar via 'sbt package'
on EC2 on the master itself.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Classpath-errors-with-Breeze-tp7220p7225.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Thanks for your reply Wei, will try this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7224.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Thanks  a lot Krishna, this works for me.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7223.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Classpath errors with Breeze

2014-06-08 Thread Tobias Pfeiffer
Hi,

I had a similar problem; I was using `sbt assembly` to build a jar
containing all my dependencies, but since my file system has a problem
with long file names (due to disk encryption), some class files (which
correspond to functions in Scala) where not included in the jar I
uploaded. Although, thinking about it, that would result in a
ClassNotFound exception, not NoSuchMethod. Have you built your code
against a different version of the library than the jar you use in
EC2?

Tobias

On Mon, Jun 9, 2014 at 1:52 PM, dlaw  wrote:
> I'm having some trouble getting a basic matrix multiply to work with Breeze.
> I'm pretty sure it's related to my classpath. My setup is a cluster on AWS
> with 8 m3.xlarges. To create the cluster I used the provided ec2 scripts and
> Spark 1.0.0.
>
> I've made a gist with the relevant pieces of my app:
>
> https://gist.github.com/dieterichlawson/e5e3ab158a09429706e0
>
> The app was created as detailed in the quick start guide.
>
> When I run it I get an error that says the method to multiply a dense matrix
> by a dense matrix does not exist:
>
> 14/06/09 04:49:09 WARN scheduler.TaskSetManager: Lost TID 90 (task 0.0:13)
> 14/06/09 04:49:09 INFO scheduler.TaskSetManager: Loss was due to
> java.lang.NoSuchMethodError:
> breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$;
> [duplicate 46]
>
> I've tried a bunch of different things, including playing with the CLASSPATH
> and ADD_JARS environment variables, the --jars option on spark-submit, the
> version of breeze and scala, etc...
>
> I've also tried it in the spark-shell. It works there, so I don't really
> know what's going on. Any thoughts?
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Classpath-errors-with-Breeze-tp7220.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to achieve a reasonable performance on Spark Streaming

2014-06-08 Thread onpoq
Dear All,

I recently installed Spark 1.0.0 on a 10-slave dedicate cluster. However,
the max input rate that the system can sustain with stable latency seems
very low. I use a simple word counting workload over tweets:

theDStream.flatMap(extractWordOnePairs).reduceByKey(sumFunc).count.print

With 2s batch interval, the 10-slave cluster can only handle ~ 30,000
tweets/s (which translates to ~ 300,000 words/s). To give you a sense about
the speed of a slave machine,  a single machine can handle ~ 100,000
tweets/s on a stream processing program in plain java. 

I've tuned the following parameters without seeing obvious improvement:
1. Batch interval: 1s, 2s, 5s, 10s
2. Parallelism: 1 x total num of cores, 2x, 3x 
3. StorageLevel: MEMORY_ONLY, MEMORY_ONLY_SER
4. Run type: yarn-client, standalone cluster

* My first question is: what are the max input rates you have observed on
Spark Streaming? I know it depends on the workload and the hardware. But I
just want to get some sense of the reasonable numbers.

* My second question is: any suggestion on what I can tune to improve the
performance? I've found unexpected delays in "reduce" that I can't explain,
and they may be related to the poor performance. Details are shown below

= DETAILS =

Below is the CPU utilization plot with 2s batch interval and 40,000
tweets/s. The latency keeps increasing while the CPU, network, disk and
memory are all under utilized.

 

I tried to find out which stage is the bottleneck. It seems that the
"reduce" phase for each batch can usually finish in less than 0.5s, but
sometimes (70 out of 545 batches) takes 5s. Below is a snapshot of the wet
UI showing the time taken by "reduce" in some batches where the normal cases
are marked in green and the abnormal case is marked in red:

 

I further look into all the tasks of a slow "reduce" stage. As shown by the
below snapshot, a small portion of the tasks are stragglers:

 

Here is the log of some slow "reduce" tasks on an executor, where the start
and end of the tasks are marked in red. They started at 21:55:43, and
completed at 21:55:48. During the 5s, I can only see shuffling at the
beginning and activities of input blocks.

 
...
 

For comparison, here is the log of the normal "reduce" tasks on the same
executor:

 

Anybody has any idea of this 5s delay?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-achieve-a-reasonable-performance-on-Spark-Streaming-tp7221.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Classpath errors with Breeze

2014-06-08 Thread dlaw
I'm having some trouble getting a basic matrix multiply to work with Breeze.
I'm pretty sure it's related to my classpath. My setup is a cluster on AWS
with 8 m3.xlarges. To create the cluster I used the provided ec2 scripts and
Spark 1.0.0.

I've made a gist with the relevant pieces of my app:

https://gist.github.com/dieterichlawson/e5e3ab158a09429706e0

The app was created as detailed in the quick start guide.

When I run it I get an error that says the method to multiply a dense matrix
by a dense matrix does not exist:

14/06/09 04:49:09 WARN scheduler.TaskSetManager: Lost TID 90 (task 0.0:13)
14/06/09 04:49:09 INFO scheduler.TaskSetManager: Loss was due to
java.lang.NoSuchMethodError:
breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$;
[duplicate 46]

I've tried a bunch of different things, including playing with the CLASSPATH
and ADD_JARS environment variables, the --jars option on spark-submit, the
version of breeze and scala, etc...

I've also tried it in the spark-shell. It works there, so I don't really
know what's going on. Any thoughts?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Classpath-errors-with-Breeze-tp7220.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Are "scala.MatchError" messages a problem?

2014-06-08 Thread Tobias Pfeiffer
Jeremy,

On Mon, Jun 9, 2014 at 10:22 AM, Jeremy Lee
 wrote:
>> When you use match, the match must be exhaustive. That is, a match error
>> is thrown if the match fails.
>
> Ahh, right. That makes sense. Scala is applying its "strong typing" rules
> here instead of "no ceremony"... but isn't the idea that type errors should
> get picked up at compile time? I suppose the compiler can't tell there's not
> complete coverage, but it seems strange to throw that at runtime when it is
> literally the 'default case'.

You can use subclasses of "sealed traits" to get a compiler warning
for non-exhaustive matches:
http://stackoverflow.com/questions/11203268/what-is-a-sealed-trait
I don't know if it can be applied for regular expression matching, though...

Tobias


Re: Are "scala.MatchError" messages a problem?

2014-06-08 Thread Jeremy Lee
On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath 
 wrote:

> When you use match, the match must be exhaustive. That is, a match error
> is thrown if the match fails.


Ahh, right. That makes sense. Scala is applying its "strong typing" rules
here instead of "no ceremony"... but isn't the idea that type errors should
get picked up at compile time? I suppose the compiler can't tell there's
not complete coverage, but it seems strange to throw that at runtime when
it is literally the 'default case'.

I think I need a good "Scala Programming Guide"... any suggestions? I've
read and watch the usual resources and videos, but it feels like a shotgun
approach and I've clearly missed a lot.

On Mon, Jun 9, 2014 at 3:26 AM, Mark Hamstra 
wrote:
>
> And you probably want to push down that filter into the cluster --
> collecting all of the elements of an RDD only to not use or filter out some
> of them isn't an efficient usage of expensive (at least in terms of
> time/performance) network resources.  There may also be a good opportunity
> to use the partial function form of collect to push even more processing
> into the cluster.
>

I almost certainly do :-) And I am really looking forward to spending time
optimizing the code, but I keep getting caught up on deployment issues,
uberjars, missing /mnt/spark directories, only being able to submit from
the master, and being thoroughly confused about sample code from three
versions ago.

I'm even thinking of learning maven, if it means I never have to use sbt
again. Does it mean that?

-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Spark Kafka streaming - ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2014-06-08 Thread Tobias Pfeiffer
Gaurav,

I am not sure that the "*" expands to what you expect it to do.
Normally the bash expands "*" to a space-separated string, not
colon-separated. Try specifying all the jars manually, maybe?

Tobias

On Thu, Jun 5, 2014 at 6:45 PM, Gaurav Dasgupta  wrote:
> Hi,
>
> I have written my own custom Spark streaming code which connects to Kafka
> server and fetch data. I have tested the code on local mode and it is
> working fine. But when I am executing the same code on YARN mode, I am
> getting KafkaReceiver class not found exception. I am providing the Spark
> Kafka jar in the classpath and ensured that the path is correct for all the
> nodes in my cluster.
>
> I am using Spark 0.9.1 hadoop pre-built and is deployed on all the nodes (10
> node cluster) in the YARN cluster.
> I am using the following command to run my code on YARN mode:
>
> SPARK_YARN_MODE=true
> SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
> SPARK_YARN_APP_JAR=/usr/local/SparkStreamExample.jar java -cp
> /usr/local/SparkStreamExample.jar:assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar:external/kafka/target/spark-streaming-kafka_2.10-0.9.1.jar:/usr/local/kafka/kafka_2.10-0.8.1.1/libs/*:/usr/lib/hbase/lib/*:/etc/hadoop/conf/:/etc/hbase/conf/
> SparkStreamExample yarn-client 10.10.5.32 myFirstGroup testTopic
> NewTestTable 1
>
> Below is the error message I am getting:
>
> 14/06/05 04:29:12 INFO cluster.YarnClientClusterScheduler: Adding task set
> 2.0 with 1 tasks
> 14/06/05 04:29:12 INFO scheduler.TaskSetManager: Starting task 2.0:0 as TID
> 70 on executor 2: manny6.musigma.com (PROCESS_LOCAL)
> 14/06/05 04:29:12 INFO scheduler.TaskSetManager: Serialized task 2.0:0 as
> 2971 bytes in 2 ms
> 14/06/05 04:29:12 WARN scheduler.TaskSetManager: Lost TID 70 (task 2.0:0)
> 14/06/05 04:29:12 WARN scheduler.TaskSetManager: Loss was due to
> java.lang.ClassNotFoundException
> java.lang.ClassNotFoundException:
> org.apache.spark.streaming.kafka.KafkaReceiver
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
> at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1574)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1495)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1731)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1666)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1322)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
> at
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:479)
> at
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:72)
> at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
> at
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:145)
> at
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1791)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(Spa

Re: Spark Worker Core Allocation

2014-06-08 Thread Subacini B
Thanks Sean, let me try to set spark.deploy.spreadOut  as  false.


On Sun, Jun 8, 2014 at 12:44 PM, Sean Owen  wrote:

> Have a look at:
>
> https://spark.apache.org/docs/1.0.0/job-scheduling.html
> https://spark.apache.org/docs/1.0.0/spark-standalone.html
>
> The default is to grab resource on all nodes. In your case you could set
> spark.cores.max to 2 or less to enable running two apps on a cluster of
> 4-core machines simultaneously.
>
> See also spark.deploy.defaultCores
>
> But you may really be after spark.deploy.spreadOut. if you make it false
> it will instead try to take all resource from a few nodes.
>  On Jun 8, 2014 1:55 AM, "Subacini B"  wrote:
>
>> Hi All,
>>
>> My cluster has 5 workers each having 4 cores (So total 20 cores).It is
>> in stand alone mode (not using Mesos or Yarn).I want two programs to run at
>> same time. So I have configured "spark.cores.max=3" , but when i run the
>> program it allocates three cores taking one core from each worker making 3
>> workers to run the program ,
>>
>> How to configure such that it takes 3 cores from 1 worker so that i can
>> use other workers for second program.
>>
>> Thanks in advance
>> Subacini
>>
>


Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Patrick Wendell
Okay I think I've isolated this a bit more. Let's discuss over on the JIRA:

https://issues.apache.org/jira/browse/SPARK-2075

On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown  wrote:
>
> Hi, Patrick --
>
> Java 7 on the development machines:
>
> » java -version
> 1 ↵
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
>
>
> And on the deployed boxes:
>
> $ java -version
> java version "1.7.0_55"
> OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
> OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
>
>
> Also, "unzip -l" in place of "jar tvf" gives the same results, so I don't
> think it's an issue with jar not reporting the files.  Also, the classes do
> get correctly packaged into the uberjar:
>
> unzip -l /target/[deleted]-driver.jar | grep 'rdd/RDD' | grep 'saveAs'
>  1519  06-08-14 12:05
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>  1560  06-08-14 12:05
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>
>
> Best.
> -- Paul
>
> —
> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>
>
> On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell  wrote:
>>
>> Paul,
>>
>> Could you give the version of Java that you are building with and the
>> version of Java you are running with? Are they the same?
>>
>> Just off the cuff, I wonder if this is related to:
>> https://issues.apache.org/jira/browse/SPARK-1520
>>
>> If it is, it could appear that certain functions are not in the jar
>> because they go beyond the extended zip boundary `jar tvf` won't list
>> them.
>>
>> - Patrick
>>
>> On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown  wrote:
>> > Moving over to the dev list, as this isn't a user-scope issue.
>> >
>> > I just ran into this issue with the missing saveAsTestFile, and here's a
>> > little additional information:
>> >
>> > - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
>> > - Driver built as an uberjar via Maven.
>> > - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
>> > Spark 1.0.0-hadoop1 downloaded from Apache.
>> >
>> > Given that it functions correctly in local mode but not in a standalone
>> > cluster, this suggests to me that the issue is in a difference between
>> > the
>> > Maven version and the hadoop1 version.
>> >
>> > In the spirit of taking the computer at its word, we can just have a
>> > look
>> > in the JAR files.  Here's what's in the Maven dep as of 1.0.0:
>> >
>> > jar tvf
>> >
>> > ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>> > | grep 'rdd/RDD' | grep 'saveAs'
>> >   1519 Mon May 26 13:57:58 PDT 2014
>> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>> >   1560 Mon May 26 13:57:58 PDT 2014
>> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>> >
>> >
>> > And here's what's in the hadoop1 distribution:
>> >
>> > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep
>> > 'saveAs'
>> >
>> >
>> > I.e., it's not there.  It is in the hadoop2 distribution:
>> >
>> > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep
>> > 'saveAs'
>> >   1519 Mon May 26 07:29:54 PDT 2014
>> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>> >   1560 Mon May 26 07:29:54 PDT 2014
>> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>> >
>> >
>> > So something's clearly broken with the way that the distribution
>> > assemblies
>> > are created.
>> >
>> > FWIW and IMHO, the "right" way to publish the hadoop1 and hadoop2
>> > flavors
>> > of Spark to Maven Central would be as *entirely different* artifacts
>> > (spark-core-h1, spark-core-h2).
>> >
>> > Logged as SPARK-2075 .
>> >
>> > Cheers.
>> > -- Paul
>> >
>> >
>> >
>> > --
>> > p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>> >
>> >
>> > On Fri, Jun 6, 2014 at 2:45 AM, HenriV  wrote:
>> >
>> >> I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
>> >> Im using google compute engine and cloud storage. but saveAsTextFile is
>> >> returning errors while saving in the cloud or saving local. When i
>> >> start a
>> >> job in the cluster i do get an error but after this error it keeps on
>> >> running fine untill the saveAsTextFile. ( I don't know if the two are
>> >> connected)
>> >>
>> >> ---Error at job startup---
>> >>  ERROR metrics.MetricsSystem: Sink class
>> >> org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
>> >> java.lang.reflect.InvocationTargetException
>> >> at
>> >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> Method)
>> >> at
>> >>
>> >>
>> >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> at
>> >>
>> >>
>> >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> at
>> >> 

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Paul Brown
Hi, Patrick --

Java 7 on the development machines:

» java -version
   1 ↵
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


And on the deployed boxes:

$ java -version
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)


Also, "unzip -l" in place of "jar tvf" gives the same results, so I don't
think it's an issue with jar not reporting the files.  Also, the classes do
get correctly packaged into the uberjar:

unzip -l /target/[deleted]-driver.jar | grep 'rdd/RDD' | grep 'saveAs'
 1519  06-08-14 12:05
org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
 1560  06-08-14 12:05
org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


Best.
-- Paul

—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell  wrote:

> Paul,
>
> Could you give the version of Java that you are building with and the
> version of Java you are running with? Are they the same?
>
> Just off the cuff, I wonder if this is related to:
> https://issues.apache.org/jira/browse/SPARK-1520
>
> If it is, it could appear that certain functions are not in the jar
> because they go beyond the extended zip boundary `jar tvf` won't list
> them.
>
> - Patrick
>
> On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown  wrote:
> > Moving over to the dev list, as this isn't a user-scope issue.
> >
> > I just ran into this issue with the missing saveAsTestFile, and here's a
> > little additional information:
> >
> > - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
> > - Driver built as an uberjar via Maven.
> > - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
> > Spark 1.0.0-hadoop1 downloaded from Apache.
> >
> > Given that it functions correctly in local mode but not in a standalone
> > cluster, this suggests to me that the issue is in a difference between
> the
> > Maven version and the hadoop1 version.
> >
> > In the spirit of taking the computer at its word, we can just have a look
> > in the JAR files.  Here's what's in the Maven dep as of 1.0.0:
> >
> > jar tvf
> >
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
> > | grep 'rdd/RDD' | grep 'saveAs'
> >   1519 Mon May 26 13:57:58 PDT 2014
> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
> >   1560 Mon May 26 13:57:58 PDT 2014
> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> >
> >
> > And here's what's in the hadoop1 distribution:
> >
> > jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep
> 'saveAs'
> >
> >
> > I.e., it's not there.  It is in the hadoop2 distribution:
> >
> > jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep
> 'saveAs'
> >   1519 Mon May 26 07:29:54 PDT 2014
> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
> >   1560 Mon May 26 07:29:54 PDT 2014
> > org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> >
> >
> > So something's clearly broken with the way that the distribution
> assemblies
> > are created.
> >
> > FWIW and IMHO, the "right" way to publish the hadoop1 and hadoop2 flavors
> > of Spark to Maven Central would be as *entirely different* artifacts
> > (spark-core-h1, spark-core-h2).
> >
> > Logged as SPARK-2075 .
> >
> > Cheers.
> > -- Paul
> >
> >
> >
> > --
> > p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
> >
> >
> > On Fri, Jun 6, 2014 at 2:45 AM, HenriV  wrote:
> >
> >> I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
> >> Im using google compute engine and cloud storage. but saveAsTextFile is
> >> returning errors while saving in the cloud or saving local. When i
> start a
> >> job in the cluster i do get an error but after this error it keeps on
> >> running fine untill the saveAsTextFile. ( I don't know if the two are
> >> connected)
> >>
> >> ---Error at job startup---
> >>  ERROR metrics.MetricsSystem: Sink class
> >> org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
> >> java.lang.reflect.InvocationTargetException
> >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> Method)
> >> at
> >>
> >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >> at
> >>
> >>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >> at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >> at
> >>
> >>
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
> >> at
> >>
> >>
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
> >> at
> >>
> scala.collectio

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Sean Owen
I suspect Patrick is right about the cause. The Maven artifact that
was released does contain this class (phew)

http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-core_2.10%7C1.0.0%7Cjar

As to the hadoop1 / hadoop2 artifact question -- agree that is often
done. Here the working theory seems to be to depend on the one
artifact (whose API should be identical regardless of dependencies)
and then customize the hadoop-client dep. Here, there are not two
versions deployed to Maven at all.


On Sun, Jun 8, 2014 at 4:02 PM, Patrick Wendell  wrote:
> Paul,
>
> Could you give the version of Java that you are building with and the
> version of Java you are running with? Are they the same?
>
> Just off the cuff, I wonder if this is related to:
> https://issues.apache.org/jira/browse/SPARK-1520
>
> If it is, it could appear that certain functions are not in the jar
> because they go beyond the extended zip boundary `jar tvf` won't list
> them.


Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Patrick Wendell
Also I should add - thanks for taking time to help narrow this down!

On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell  wrote:
> Paul,
>
> Could you give the version of Java that you are building with and the
> version of Java you are running with? Are they the same?
>
> Just off the cuff, I wonder if this is related to:
> https://issues.apache.org/jira/browse/SPARK-1520
>
> If it is, it could appear that certain functions are not in the jar
> because they go beyond the extended zip boundary `jar tvf` won't list
> them.
>
> - Patrick
>
> On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown  wrote:
>> Moving over to the dev list, as this isn't a user-scope issue.
>>
>> I just ran into this issue with the missing saveAsTestFile, and here's a
>> little additional information:
>>
>> - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
>> - Driver built as an uberjar via Maven.
>> - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
>> Spark 1.0.0-hadoop1 downloaded from Apache.
>>
>> Given that it functions correctly in local mode but not in a standalone
>> cluster, this suggests to me that the issue is in a difference between the
>> Maven version and the hadoop1 version.
>>
>> In the spirit of taking the computer at its word, we can just have a look
>> in the JAR files.  Here's what's in the Maven dep as of 1.0.0:
>>
>> jar tvf
>> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>> | grep 'rdd/RDD' | grep 'saveAs'
>>   1519 Mon May 26 13:57:58 PDT 2014
>> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>>   1560 Mon May 26 13:57:58 PDT 2014
>> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>>
>>
>> And here's what's in the hadoop1 distribution:
>>
>> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
>>
>>
>> I.e., it's not there.  It is in the hadoop2 distribution:
>>
>> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>>   1519 Mon May 26 07:29:54 PDT 2014
>> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>>   1560 Mon May 26 07:29:54 PDT 2014
>> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>>
>>
>> So something's clearly broken with the way that the distribution assemblies
>> are created.
>>
>> FWIW and IMHO, the "right" way to publish the hadoop1 and hadoop2 flavors
>> of Spark to Maven Central would be as *entirely different* artifacts
>> (spark-core-h1, spark-core-h2).
>>
>> Logged as SPARK-2075 .
>>
>> Cheers.
>> -- Paul
>>
>>
>>
>> --
>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>>
>>
>> On Fri, Jun 6, 2014 at 2:45 AM, HenriV  wrote:
>>
>>> I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
>>> Im using google compute engine and cloud storage. but saveAsTextFile is
>>> returning errors while saving in the cloud or saving local. When i start a
>>> job in the cluster i do get an error but after this error it keeps on
>>> running fine untill the saveAsTextFile. ( I don't know if the two are
>>> connected)
>>>
>>> ---Error at job startup---
>>>  ERROR metrics.MetricsSystem: Sink class
>>> org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
>>> java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>> at
>>>
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> at
>>>
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> at
>>>
>>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
>>> at
>>>
>>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
>>> at
>>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>> at
>>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>> at
>>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>>> at
>>>
>>> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
>>> at
>>> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:84)
>>> at
>>>
>>> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)
>>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
>>> at org.apache.spark.SparkContext.(SparkContext.scala:202)
>>> at Hello$.main(Hello.scala:101)
>>> at Hello.main(Hello.scala)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Patrick Wendell
Paul,

Could you give the version of Java that you are building with and the
version of Java you are running with? Are they the same?

Just off the cuff, I wonder if this is related to:
https://issues.apache.org/jira/browse/SPARK-1520

If it is, it could appear that certain functions are not in the jar
because they go beyond the extended zip boundary `jar tvf` won't list
them.

- Patrick

On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown  wrote:
> Moving over to the dev list, as this isn't a user-scope issue.
>
> I just ran into this issue with the missing saveAsTestFile, and here's a
> little additional information:
>
> - Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
> - Driver built as an uberjar via Maven.
> - Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
> Spark 1.0.0-hadoop1 downloaded from Apache.
>
> Given that it functions correctly in local mode but not in a standalone
> cluster, this suggests to me that the issue is in a difference between the
> Maven version and the hadoop1 version.
>
> In the spirit of taking the computer at its word, we can just have a look
> in the JAR files.  Here's what's in the Maven dep as of 1.0.0:
>
> jar tvf
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
> | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>
>
> And here's what's in the hadoop1 distribution:
>
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
>
>
> I.e., it's not there.  It is in the hadoop2 distribution:
>
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
>
>
> So something's clearly broken with the way that the distribution assemblies
> are created.
>
> FWIW and IMHO, the "right" way to publish the hadoop1 and hadoop2 flavors
> of Spark to Maven Central would be as *entirely different* artifacts
> (spark-core-h1, spark-core-h2).
>
> Logged as SPARK-2075 .
>
> Cheers.
> -- Paul
>
>
>
> --
> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>
>
> On Fri, Jun 6, 2014 at 2:45 AM, HenriV  wrote:
>
>> I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
>> Im using google compute engine and cloud storage. but saveAsTextFile is
>> returning errors while saving in the cloud or saving local. When i start a
>> job in the cluster i do get an error but after this error it keeps on
>> running fine untill the saveAsTextFile. ( I don't know if the two are
>> connected)
>>
>> ---Error at job startup---
>>  ERROR metrics.MetricsSystem: Sink class
>> org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
>> java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>> at
>>
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at
>>
>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
>> at
>>
>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>> at
>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>> at
>>
>> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
>> at
>> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:84)
>> at
>>
>> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)
>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
>> at org.apache.spark.SparkContext.(SparkContext.scala:202)
>> at Hello$.main(Hello.scala:101)
>> at Hello.main(Hello.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Paul Brown
Moving over to the dev list, as this isn't a user-scope issue.

I just ran into this issue with the missing saveAsTestFile, and here's a
little additional information:

- Code ported from 0.9.1 up to 1.0.0; works with local[n] in both cases.
- Driver built as an uberjar via Maven.
- Deployed to smallish EC2 cluster in standalone mode (S3 storage) with
Spark 1.0.0-hadoop1 downloaded from Apache.

Given that it functions correctly in local mode but not in a standalone
cluster, this suggests to me that the issue is in a difference between the
Maven version and the hadoop1 version.

In the spirit of taking the computer at its word, we can just have a look
in the JAR files.  Here's what's in the Maven dep as of 1.0.0:

jar tvf
~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
| grep 'rdd/RDD' | grep 'saveAs'
  1519 Mon May 26 13:57:58 PDT 2014
org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
  1560 Mon May 26 13:57:58 PDT 2014
org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


And here's what's in the hadoop1 distribution:

jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'


I.e., it's not there.  It is in the hadoop2 distribution:

jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
  1519 Mon May 26 07:29:54 PDT 2014
org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
  1560 Mon May 26 07:29:54 PDT 2014
org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class


So something's clearly broken with the way that the distribution assemblies
are created.

FWIW and IMHO, the "right" way to publish the hadoop1 and hadoop2 flavors
of Spark to Maven Central would be as *entirely different* artifacts
(spark-core-h1, spark-core-h2).

Logged as SPARK-2075 .

Cheers.
-- Paul



—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


On Fri, Jun 6, 2014 at 2:45 AM, HenriV  wrote:

> I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0.
> Im using google compute engine and cloud storage. but saveAsTextFile is
> returning errors while saving in the cloud or saving local. When i start a
> job in the cluster i do get an error but after this error it keeps on
> running fine untill the saveAsTextFile. ( I don't know if the two are
> connected)
>
> ---Error at job startup---
>  ERROR metrics.MetricsSystem: Sink class
> org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
>
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
> at
>
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
> at
>
> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
> at
> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:84)
> at
>
> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
> at org.apache.spark.SparkContext.(SparkContext.scala:202)
> at Hello$.main(Hello.scala:101)
> at Hello.main(Hello.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at sbt.Run.invokeMain(Run.scala:72)
> at sbt.Run.run0(Run.scala:65)
> at sbt.Run.sbt$Run$$execute$1(Run.scala:54)
> at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:58)
> at sbt.Run$$anonfun$run$1.apply(Run.scala:58)
> at sbt.Run$$anonfun$run$1.apply(Run.scala:58)
> at sbt.Logger$$anon$4.apply(Logger.scala:90)
> at sbt.TrapExit$App.run(TrapExit.scala:244)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.NoSuchMethodError:
> com.fasterxml.jackson.core.JsonFactory.requiresPropertyOrdering()Z
> at

Re: Spark Worker Core Allocation

2014-06-08 Thread Sean Owen
Have a look at:

https://spark.apache.org/docs/1.0.0/job-scheduling.html
https://spark.apache.org/docs/1.0.0/spark-standalone.html

The default is to grab resource on all nodes. In your case you could set
spark.cores.max to 2 or less to enable running two apps on a cluster of
4-core machines simultaneously.

See also spark.deploy.defaultCores

But you may really be after spark.deploy.spreadOut. if you make it false it
will instead try to take all resource from a few nodes.
 On Jun 8, 2014 1:55 AM, "Subacini B"  wrote:

> Hi All,
>
> My cluster has 5 workers each having 4 cores (So total 20 cores).It is  in
> stand alone mode (not using Mesos or Yarn).I want two programs to run at
> same time. So I have configured "spark.cores.max=3" , but when i run the
> program it allocates three cores taking one core from each worker making 3
> workers to run the program ,
>
> How to configure such that it takes 3 cores from 1 worker so that i can
> use other workers for second program.
>
> Thanks in advance
> Subacini
>


Re: How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Nicholas Chammas
In PySpark you can also do help(my_rdd) and get a nice help page of methods
available.

2014년 6월 8일 일요일, Carter님이 작성한 메시지:

> Thank you very much Gerard.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191p7193.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Spark Streaming union expected behaviour?

2014-06-08 Thread Shrikar archak
Hi All,

I was writing a simple Streaming job to get more understanding about Spark
streaming.
I am not understanding why the union behaviour in this particular case

*WORKS:*
val lines = ssc.socketTextStream("localhost", ,
StorageLevel.MEMORY_AND_DISK_SER)
val words = lines..flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
wordCounts.saveAsTextFiles("all")

This works as expected as well as the streams are stored as files


*DOESN'T WORK*
val lines = ssc.socketTextStream("localhost", ,
StorageLevel.MEMORY_AND_DISK_SER)
val lines1 = ssc.socketTextStream("localhost", 1,
StorageLevel.MEMORY_AND_DISK_SER)
   * val words = lines.union(lines1).flatMap(_.split(" "))*


val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
wordCounts.saveAsTextFiles("all")

In the above case neither the messages are printed nor the files are saved.
Am I doing something wrong here?

Thanks,
Shrikar


Re: Spark Worker Core Allocation

2014-06-08 Thread Subacini B
HI,

I am stuck here, my cluster is not effficiently utilized . Appreciate any
input on this.

Thanks
Subacini


On Sat, Jun 7, 2014 at 10:54 PM, Subacini B  wrote:

> Hi All,
>
> My cluster has 5 workers each having 4 cores (So total 20 cores).It is  in
> stand alone mode (not using Mesos or Yarn).I want two programs to run at
> same time. So I have configured "spark.cores.max=3" , but when i run the
> program it allocates three cores taking one core from each worker making 3
> workers to run the program ,
>
> How to configure such that it takes 3 cores from 1 worker so that i can
> use other workers for second program.
>
> Thanks in advance
> Subacini
>


Re: Are "scala.MatchError" messages a problem?

2014-06-08 Thread Mark Hamstra
>
> The solution is either to add a default case which does nothing, or
> probably better to add a .filter such that you filter out anything that's
> not a command before matching.
>

And you probably want to push down that filter into the cluster --
collecting all of the elements of an RDD only to not use or filter out some
of them isn't an efficient usage of expensive (at least in terms of
time/performance) network resources.  There may also be a good opportunity
to use the partial function form of collect to push even more processing
into the cluster.



On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath 
wrote:

> When you use match, the match must be exhaustive. That is, a match error
> is thrown if the match fails.
>
> That's why you usually handle the default case using "case _ => ..."
>
> Here it looks like your taking the text of all statuses - which means not
> all of them will be commands... Which means your match will not be
> exhaustive.
>
> The solution is either to add a default case which does nothing, or
> probably better to add a .filter such that you filter out anything that's
> not a command before matching.
>
> Just looking at it again it could also be that you take x => x._2._1 ...
> What type is that? Should it not be a Seq if you're joining, in which case
> the match will also fail...
>
> Hope this helps.
> —
> Sent from Mailbox 
>
>
> On Sun, Jun 8, 2014 at 6:45 PM, Jeremy Lee  > wrote:
>
>>
>> I shut down my first (working) cluster and brought up a fresh one... and
>> It's been a bit of a horror and I need to sleep now. Should I be worried
>> about these errors? Or did I just have the old log4j.config tuned so I
>> didn't see them?
>>
>> I
>>
>>  14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job
>> streaming job 1402245172000 ms.2
>> scala.MatchError: 0101-01-10 (of class java.lang.String)
>>  at
>> SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
>> at
>> SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
>> at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
>> at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
>> at
>> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>> at scala.util.Try$.apply(Try.scala:161)
>> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>> at
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>>
>>
>> The error comes from this code, which seemed like a sensible way to match
>> things:
>> (The "case cmd_plus(w)" statement is generating the error,)
>>
>>   val cmd_plus = """[+]([\w]+)""".r
>>  val cmd_minus = """[-]([\w]+)""".r
>>  // find command user tweets
>>  val commands = stream.map(
>>  status => ( status.getUser().getId(), status.getText() )
>>  ).foreachRDD(rdd => {
>>  rdd.join(superusers).map(
>>  x => x._2._1
>>  ).collect().foreach{ cmd => {
>>  218:  cmd match {
>>  case cmd_plus(w) => {
>>  ...
>> } case cmd_minus(w) => { ... } } }} })
>>
>>  It seems a bit excessive for scala to throw exceptions because a regex
>> didn't match. Something feels wrong.
>>
>
>


Re: Are "scala.MatchError" messages a problem?

2014-06-08 Thread Nick Pentreath
When you use match, the match must be exhaustive. That is, a match error is 
thrown if the match fails. 


That's why you usually handle the default case using "case _ => ..."




Here it looks like your taking the text of all statuses - which means not all 
of them will be commands... Which means your match will not be exhaustive.




The solution is either to add a default case which does nothing, or probably 
better to add a .filter such that you filter out anything that's not a command 
before matching.




Just looking at it again it could also be that you take x => x._2._1 ... What 
type is that? Should it not be a Seq if you're joining, in which case the match 
will also fail...




Hope this helps.
—
Sent from Mailbox

On Sun, Jun 8, 2014 at 6:45 PM, Jeremy Lee 
wrote:

> I shut down my first (working) cluster and brought up a fresh one... and
> It's been a bit of a horror and I need to sleep now. Should I be worried
> about these errors? Or did I just have the old log4j.config tuned so I
> didn't see them?
> I
> 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
> job 1402245172000 ms.2
> scala.MatchError: 0101-01-10 (of class java.lang.String)
> at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
> at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
> at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> The error comes from this code, which seemed like a sensible way to match
> things:
> (The "case cmd_plus(w)" statement is generating the error,)
> val cmd_plus = """[+]([\w]+)""".r
> val cmd_minus = """[-]([\w]+)""".r
> // find command user tweets
> val commands = stream.map(
> status => ( status.getUser().getId(), status.getText() )
> ).foreachRDD(rdd => {
> rdd.join(superusers).map(
> x => x._2._1
> ).collect().foreach{ cmd => {
> 218:  cmd match {
> case cmd_plus(w) => {
> ...
> } case cmd_minus(w) => { ... } } }} })
> It seems a bit excessive for scala to throw exceptions because a regex
> didn't match. Something feels wrong.

Re: Are "scala.MatchError" messages a problem?

2014-06-08 Thread Sean Owen
A match clause needs to cover all the possibilities, and not matching
any regex is a distinct possibility. It's not really like 'switch'
because it requires this and I think that has benefits, like being
able to interpret a match as something with a type. I think it's all
in order, but it's more of a Scala thing than Spark thing.

You just need a "case _ => ..." to cover anything else.

(You can avoid two extra levels of scope with .foreach(_ match { ... }) BTW)

On Sun, Jun 8, 2014 at 12:44 PM, Jeremy Lee
 wrote:
>
> I shut down my first (working) cluster and brought up a fresh one... and
> It's been a bit of a horror and I need to sleep now. Should I be worried
> about these errors? Or did I just have the old log4j.config tuned so I
> didn't see them?
>
> I
>
> 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
> job 1402245172000 ms.2
> scala.MatchError: 0101-01-10 (of class java.lang.String)
> at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
> at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
> at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
>
> The error comes from this code, which seemed like a sensible way to match
> things:
> (The "case cmd_plus(w)" statement is generating the error,)
>
> val cmd_plus = """[+]([\w]+)""".r
> val cmd_minus = """[-]([\w]+)""".r
> // find command user tweets
> val commands = stream.map(
> status => ( status.getUser().getId(), status.getText() )
> ).foreachRDD(rdd => {
> rdd.join(superusers).map(
> x => x._2._1
> ).collect().foreach{ cmd => {
> 218: cmd match {
> case cmd_plus(w) => {
> ...
> } case cmd_minus(w) => { ... } } }} })
>
> It seems a bit excessive for scala to throw exceptions because a regex
> didn't match. Something feels wrong.


Are "scala.MatchError" messages a problem?

2014-06-08 Thread Jeremy Lee
I shut down my first (working) cluster and brought up a fresh one... and
It's been a bit of a horror and I need to sleep now. Should I be worried
about these errors? Or did I just have the old log4j.config tuned so I
didn't see them?

I

14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
job 1402245172000 ms.2
scala.MatchError: 0101-01-10 (of class java.lang.String)
at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


The error comes from this code, which seemed like a sensible way to match
things:
(The "case cmd_plus(w)" statement is generating the error,)

val cmd_plus = """[+]([\w]+)""".r
val cmd_minus = """[-]([\w]+)""".r
// find command user tweets
val commands = stream.map(
status => ( status.getUser().getId(), status.getText() )
).foreachRDD(rdd => {
rdd.join(superusers).map(
x => x._2._1
).collect().foreach{ cmd => {
218:  cmd match {
case cmd_plus(w) => {
...
} case cmd_minus(w) => { ... } } }} })

It seems a bit excessive for scala to throw exceptions because a regex
didn't match. Something feels wrong.


Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Wei Tan
This will make the compilation pass but you may not be able to run it 
correctly.

I used maven adding these two jars (I use Hadoop 1), maven added their 
dependent jars (a lot) for me.


org.apache.spark
spark-core_2.10
1.0.0


org.apache.hadoop
hadoop-client
1.2.1




Best regards,
Wei 

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Krishna Sankar 
To: user@spark.apache.org, 
Date:   06/08/2014 11:19 AM
Subject:Re: How to compile a Spark project in Scala IDE for 
Eclipse?



Project->Properties->Java Build Path->Add External Jars
Add the /spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
Cheers



On Sun, Jun 8, 2014 at 8:06 AM, Carter  wrote:
Hi All,

I just downloaded the Scala IDE for Eclipse. After I created a Spark 
project
and clicked "Run" there was an error on this line of code "import
org.apache.spark.SparkContext": "object apache is not a member of package
org". I guess I need to import the Spark dependency into Scala IDE for
Eclipse, can anyone tell me how to do it? Thanks a lot.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Krishna Sankar
Project->Properties->Java Build Path->Add External Jars
Add the /spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
Cheers



On Sun, Jun 8, 2014 at 8:06 AM, Carter  wrote:

> Hi All,
>
> I just downloaded the Scala IDE for Eclipse. After I created a Spark
> project
> and clicked "Run" there was an error on this line of code "import
> org.apache.spark.SparkContext": "object apache is not a member of package
> org". I guess I need to import the Spark dependency into Scala IDE for
> Eclipse, can anyone tell me how to do it? Thanks a lot.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Hi All,

I just downloaded the Scala IDE for Eclipse. After I created a Spark project
and clicked "Run" there was an error on this line of code "import
org.apache.spark.SparkContext": "object apache is not a member of package
org". I guess I need to import the Spark dependency into Scala IDE for
Eclipse, can anyone tell me how to do it? Thanks a lot.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Best practise for 'Streaming' dumps?

2014-06-08 Thread Gino Bustelo
Yeah... Have not tried it, but if you set the slidingDuration == windowDuration 
that should prevent overlaps. 

Gino B.

> On Jun 8, 2014, at 8:25 AM, Jeremy Lee  wrote:
> 
> I read it more carefully, and window() might actually work for some other 
> stuff like logs. (assuming I can have multiple windows with entirely 
> different attributes on a single stream..) 
> 
> Thanks for that!
> 
> 
>> On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee  
>> wrote:
>> Yes.. but from what I understand that's a "sliding window" so for a window 
>> of (60) over (1) second DStreams, that would save the entire last minute of 
>> data once per second. That's more than I need.
>> 
>> I think what I'm after is probably updateStateByKey... I want to mutate data 
>> structures (probably even graphs) as the stream comes in, but I also want 
>> that state to be persistent across restarts of the application, (Or parallel 
>> version of the app, if possible) So I'd have to save that structure 
>> occasionally and reload it as the "primer" on the next run.
>> 
>> I was almost going to use HBase or Hive, but they seem to have been 
>> deprecated in 1.0.0? Or just late to the party?
>> 
>> Also, I've been having trouble deleting hadoop directories.. the old "two 
>> line" examples don't seem to work anymore. I actually managed to fill up the 
>> worker instances (I gave them tiny EBS) and I think I crashed them.
>> 
>> 
>> 
>>> On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo  wrote:
>>> Have you thought of using window?
>>> 
>>> Gino B.
>>> 
>>> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee  
>>> > wrote:
>>> >
>>> >
>>> > It's going well enough that this is a "how should I in 1.0.0" rather than 
>>> > "how do i" question.
>>> >
>>> > So I've got data coming in via Streaming (twitters) and I want to 
>>> > archive/log it all. It seems a bit wasteful to generate a new HDFS file 
>>> > for each DStream, but also I want to guard against data loss from crashes,
>>> >
>>> > I suppose what I want is to let things build up into "superbatches" over 
>>> > a few minutes, and then serialize those to parquet files, or similar? Or 
>>> > do i?
>>> >
>>> > Do I count-down the number of DStreams, or does Spark have a preferred 
>>> > way of scheduling cron events?
>>> >
>>> > What's the best practise for keeping persistent data for a streaming app? 
>>> > (Across restarts) And to clean up on termination?
>>> >
>>> >
>>> > --
>>> > Jeremy Lee  BCompSci(Hons)
>>> >   The Unorthodox Engineers
>> 
>> 
>> 
>> -- 
>> Jeremy Lee  BCompSci(Hons)
>>   The Unorthodox Engineers
> 
> 
> 
> -- 
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers


Re: Best practise for 'Streaming' dumps?

2014-06-08 Thread Jeremy Lee
I read it more carefully, and window() might actually work for some other
stuff like logs. (assuming I can have multiple windows with entirely
different attributes on a single stream..)

Thanks for that!


On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee 
wrote:

> Yes.. but from what I understand that's a "sliding window" so for a window
> of (60) over (1) second DStreams, that would save the entire last minute of
> data once per second. That's more than I need.
>
> I think what I'm after is probably updateStateByKey... I want to mutate
> data structures (probably even graphs) as the stream comes in, but I also
> want that state to be persistent across restarts of the application, (Or
> parallel version of the app, if possible) So I'd have to save that
> structure occasionally and reload it as the "primer" on the next run.
>
> I was almost going to use HBase or Hive, but they seem to have been
> deprecated in 1.0.0? Or just late to the party?
>
> Also, I've been having trouble deleting hadoop directories.. the old "two
> line" examples don't seem to work anymore. I actually managed to fill up
> the worker instances (I gave them tiny EBS) and I think I crashed them.
>
>
>
> On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo  wrote:
>
>> Have you thought of using window?
>>
>> Gino B.
>>
>> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee 
>> wrote:
>> >
>> >
>> > It's going well enough that this is a "how should I in 1.0.0" rather
>> than "how do i" question.
>> >
>> > So I've got data coming in via Streaming (twitters) and I want to
>> archive/log it all. It seems a bit wasteful to generate a new HDFS file for
>> each DStream, but also I want to guard against data loss from crashes,
>> >
>> > I suppose what I want is to let things build up into "superbatches"
>> over a few minutes, and then serialize those to parquet files, or similar?
>> Or do i?
>> >
>> > Do I count-down the number of DStreams, or does Spark have a preferred
>> way of scheduling cron events?
>> >
>> > What's the best practise for keeping persistent data for a streaming
>> app? (Across restarts) And to clean up on termination?
>> >
>> >
>> > --
>> > Jeremy Lee  BCompSci(Hons)
>> >   The Unorthodox Engineers
>>
>
>
>
> --
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers
>



-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Best practise for 'Streaming' dumps?

2014-06-08 Thread Jeremy Lee
Yes.. but from what I understand that's a "sliding window" so for a window
of (60) over (1) second DStreams, that would save the entire last minute of
data once per second. That's more than I need.

I think what I'm after is probably updateStateByKey... I want to mutate
data structures (probably even graphs) as the stream comes in, but I also
want that state to be persistent across restarts of the application, (Or
parallel version of the app, if possible) So I'd have to save that
structure occasionally and reload it as the "primer" on the next run.

I was almost going to use HBase or Hive, but they seem to have been
deprecated in 1.0.0? Or just late to the party?

Also, I've been having trouble deleting hadoop directories.. the old "two
line" examples don't seem to work anymore. I actually managed to fill up
the worker instances (I gave them tiny EBS) and I think I crashed them.



On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo  wrote:

> Have you thought of using window?
>
> Gino B.
>
> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee 
> wrote:
> >
> >
> > It's going well enough that this is a "how should I in 1.0.0" rather
> than "how do i" question.
> >
> > So I've got data coming in via Streaming (twitters) and I want to
> archive/log it all. It seems a bit wasteful to generate a new HDFS file for
> each DStream, but also I want to guard against data loss from crashes,
> >
> > I suppose what I want is to let things build up into "superbatches" over
> a few minutes, and then serialize those to parquet files, or similar? Or do
> i?
> >
> > Do I count-down the number of DStreams, or does Spark have a preferred
> way of scheduling cron events?
> >
> > What's the best practise for keeping persistent data for a streaming
> app? (Across restarts) And to clean up on termination?
> >
> >
> > --
> > Jeremy Lee  BCompSci(Hons)
> >   The Unorthodox Engineers
>



-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter
Thank you very much Gerard.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191p7193.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Gerard Maas
You can consult the docs at :
https://spark.apache.org/docs/latest/api/scala/index.html#package

In particular, the rdd docs contain the explanation of each method :
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

Kr, Gerard
On Jun 8, 2014 1:00 PM, "Carter"  wrote:

> Hi All,
>
> I am new to Spark.
>
> In the Spark shell, how can I get the help or explanation for those
> functions that I can use for a variable or RDD? For example, after I input
> a
> RDD's name with a dot (.) at the end, if I press the Tab key, a list of
> functions that I can use for this RDD will be displayed, but I dont know
> how
> to use these functions.
>
> Your help is greatly appreciated.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter
Hi All,

I am new to Spark. 

In the Spark shell, how can I get the help or explanation for those
functions that I can use for a variable or RDD? For example, after I input a
RDD's name with a dot (.) at the end, if I press the Tab key, a list of
functions that I can use for this RDD will be displayed, but I dont know how
to use these functions.

Your help is greatly appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Gradient Descent with MLBase

2014-06-08 Thread Aslan Bekirov
Hi DB,

Thanks a lot.
Appreciated.

BR,
Aslan


On Sun, Jun 8, 2014 at 2:52 AM, DB Tsai  wrote:

> Hi Aslan,
>
> You can check out the unittest code of GradientDescent.runMiniBatchSGD
>
>
> https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sat, Jun 7, 2014 at 6:24 AM, Aslan Bekirov 
> wrote:
>
>> Hi All,
>>
>> I have to create a model using SGD in mlbase. I examined a bit mlbase and
>> run some samples of classification , collaborative filtering etc.. But I
>> could not run Gradient descent. I have to run
>>
>>  "val model =  GradientDescent.runMiniBatchSGD(params)"
>>
>> of course before params must be computed. I tried but could not managed
>> to give parameters correctly.
>>
>> Can anyone explain parameters a bit and give an example of code?
>>
>> BR,
>> Aslan
>>
>>
>