Spark SQL reduce number of java threads

2014-10-28 Thread Wanda Hawk
Hello

I am trying to reduce the number of java threads (about 80 on my system) to as 
few as there can be.
What settings can be done in spark-1.1.0/conf/spark-env.sh ? (or other places 
as well)
I am also using hadoop for storing data on hdfs

Thank you,
Wanda

Re: Spark SQL reduce number of java threads

2014-10-28 Thread Wanda Hawk
I am trying to get a software trace and I need to get the number of active 
threads as low as I can in order to inspect the active part of the workload



 From: Prashant Sharma scrapco...@gmail.com
To: Wanda Hawk wanda_haw...@yahoo.com 
Cc: user@spark.apache.org user@spark.apache.org 
Sent: Tuesday, October 28, 2014 11:17 AM
Subject: Re: Spark SQL reduce number of java threads
 


What is the motivation behind this ? 

You can start with master as local[NO_OF_THREADS]. Reducing the threads at all 
other places can have unexpected results. Take a look at this. 
http://spark.apache.org/docs/latest/configuration.html.



Prashant Sharma





On Tue, Oct 28, 2014 at 2:08 PM, Wanda Hawk wanda_haw...@yahoo.com.invalid 
wrote:

Hello


I am trying to reduce the number of java threads (about 80 on my system) to as 
few as there can be.
What settings can be done in spark-1.1.0/conf/spark-env.sh ? (or other places 
as well)
I am also using hadoop for storing data on hdfs


Thank you,
Wanda

Re: How can number of partitions be set in spark-env.sh?

2014-10-28 Thread Wanda Hawk
Is this what are you looking for ?

In Shark, default reducer number is 1 and is controlled by the property 
mapred.reduce.tasks. Spark SQL deprecates this property in favor 
ofspark.sql.shuffle.partitions, whose default value is 200. Users may customize 
this property via SET:
SET spark.sql.shuffle.partitions=10; SELECT page, count(*) c
FROM logs_last_month_cached
GROUP BY page ORDER BY c DESC LIMIT 10;

Spark SQL Programming Guide - Spark 1.1.0 Documentation

  
  
Spark SQL Programming Guide - Spark 1.1.0 Documentation
Spark SQL Programming Guide Overview Getting Started Data Sources RDDs 
Inferring the Schema Using Reflection Programmatically Specifying the Schema 
Parquet Files Loading Data Programmatically   
View on spark.apache.org Preview by Yahoo  
  



 From: shahab shahab.mok...@gmail.com
To: user@spark.apache.org 
Sent: Tuesday, October 28, 2014 3:20 PM
Subject: How can number of partitions be set in spark-env.sh?
 


I am running a stand alone Spark cluster, 2 workers each has 2 cores.
Apparently, I am loading and processing relatively large chunk of data so that 
I receive task failure   .  As I read from some posts and discussions in the 
mailing list the failures could be related to the large size of processing data 
in the partitions and if I have understood correctly I should have smaller 
partitions (but many of them) ?!

Is there any way that I can set the number of partitions dynamically in 
spark-env.sh or in the submiited Spark application?


best,
/Shahab

Re: KMeans code is rubbish

2014-07-14 Thread Wanda Hawk
The problem is that I get the same results every time


On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar atalwal...@gmail.com wrote:
 


Hi Wanda,

As Sean mentioned, K-means is not guaranteed to find an optimal answer, even 
for seemingly simple toy examples. A common heuristic to deal with this issue 
is to run kmeans multiple times and choose the best answer.  You can do this by 
changing the runs parameter from the default value (1) to something larger (say 
10).

-Ameet



On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

I also took a look at 
spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
 and ran the code in a shell.


There is an issue here:
    val initMode = params.initializationMode match {
      case Random = KMeans.RANDOM
      case Parallel = KMeans.K_MEANS_PARALLEL
    }



If I use initMode=KMeans.RANDOM everything is ok.
If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know 
why. The example proposed is a really simple one that should not accept 
multiple solutions and always converge to the correct one.


Now what can be altered in the original SparkKMeans.scala (the seed or 
something else ?) to get the correct results each and every single time ?
On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com wrote:
 


SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui


On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
 dataset as well, I got the expected answer. And I believe that even though
 initialization is done using sampling, the example actually sets the seed to
 a constant 42, so the result should always be the same no matter how many
 times you run it. So I am not really sure whats going on here.

 Can you tell us more about which version of Spark you are running? Which
 Java version?


 ==

 [tdas @ Xion spark2] cat input
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info
 from
 SCDynamicStore
 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/07/10 02:45:07 WARN LoadSnappy:
 Snappy native library not loaded
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS
 Finished iteration (delta = 3.0)
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector(5.0, 2.0)
 DenseVector(2.0, 2.0)



 On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

 so this is what I am running:
 ./bin/run-example SparkKMeans
 ~/Documents/2dim2.txt 2 0.001

 And this is the input file:
 ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
 └───#!cat ~/Documents/2dim2.txt
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 

 This is the final output from spark:
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks
 out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, targetRequestSize: 10066329
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 14

 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
 localhost (progress: 1/2)
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 15
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
 localhost (progress: 2/2)
 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
 SparkKMeans.scala:75) finished in 0.008 s
 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose
 tasks
 have all completed, from pool
 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
 SparkKMeans.scala:75, took 0.02472681 s
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector

Re: KMeans code is rubbish

2014-07-11 Thread Wanda Hawk
I also took a look at 
spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala
 and ran the code in a shell.

There is an issue here:
    val initMode = params.initializationMode match {
      case Random = KMeans.RANDOM
      case Parallel = KMeans.K_MEANS_PARALLEL
    }


If I use initMode=KMeans.RANDOM everything is ok.
If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know 
why. The example proposed is a really simple one that should not accept 
multiple solutions and always converge to the correct one.

Now what can be altered in the original SparkKMeans.scala (the seed or 
something else ?) to get the correct results each and every single time ?
On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com wrote:
 


SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui


On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your
 dataset as well, I got the expected answer. And I believe that even though
 initialization is done using sampling, the example actually sets the seed to
 a constant 42, so the result should always be the same no matter how many
 times you run it. So I am not really sure whats going on here.

 Can you tell us more about which version of Spark you are running? Which
 Java version?


 ==

 [tdas @ Xion spark2] cat input
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from
 SCDynamicStore
 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/07/10 02:45:07 WARN LoadSnappy:
 Snappy native library not loaded
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS
 Finished iteration (delta = 3.0)
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector(5.0, 2.0)
 DenseVector(2.0, 2.0)



 On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

 so this is what I am running:
 ./bin/run-example SparkKMeans
 ~/Documents/2dim2.txt 2 0.001

 And this is the input file:
 ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
 └───#!cat ~/Documents/2dim2.txt
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3
 

 This is the final output from spark:
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks
 out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, targetRequestSize: 10066329
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 2 non-empty blocks out of 2 blocks
 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 0 ms
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 14

 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on
 localhost (progress: 1/2)
 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
 14/07/10 20:05:12 INFO Executor: Finished task ID 15
 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on
 localhost (progress: 2/2)
 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at
 SparkKMeans.scala:75) finished in 0.008 s
 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose
 tasks
 have all completed, from pool
 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at
 SparkKMeans.scala:75, took 0.02472681 s
 Finished iteration (delta = 0.0)
 Final centers:
 DenseVector(2.8571428571428568, 2.0)
 DenseVector(5.6005, 2.0)
 




 On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com
 wrote:


 A picture is worth a thousand... Well, a picture with this dataset, what

 you are expecting and what you get, would help answering your initial
 question.

 Bertrand


 On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com
 wrote:

 Can someone please run the standard kMeans code on this input with 2
 centers ?:
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3

 The obvious result

KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
Can someone please run the standard kMeans code on this input with 2 centers ?:
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3

The obvious result should be (2,2) and (5,2) ... (you can draw them if you 
don't believe me ...)

Thanks, 
Wanda

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
so this is what I am running: 
./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001

And this is the input file:
┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
└───#!cat ~/Documents/2dim2.txt
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3


This is the final output from spark:
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 0 ms
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 0 ms
14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
14/07/10 20:05:12 INFO Executor: Finished task ID 14
14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost 
(progress: 1/2)
14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
14/07/10 20:05:12 INFO Executor: Finished task ID 15
14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost 
(progress: 2/2)
14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at 
SparkKMeans.scala:75) finished in 0.008 s
14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have 
all completed, from pool
14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at 
SparkKMeans.scala:75, took 0.02472681 s
Finished iteration (delta = 0.0)
Final centers:
DenseVector(2.8571428571428568, 2.0)
DenseVector(5.6005, 2.0)





On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com 
wrote:
 


A picture is worth a thousand... Well, a picture with this dataset, what you 
are expecting and what you get, would help answering your initial question.


Bertrand


On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

Can someone please run the standard kMeans code on this input with 2 centers ?:
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3


The obvious result should be (2,2) and (5,2) ... (you can draw them if you 
don't believe me ...)


Thanks, 
Wanda

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I ran the example with ./bin/run-example SparkKMeans file.txt 2 0.001
I get this response:
Finished iteration (delta = 0.0)
Final centers:
DenseVector(2.8571428571428568, 2.0)
DenseVector(5.6005, 2.0)


The start point is not random. It uses the first K points from the given set


On Thursday, July 10, 2014 11:57 AM, Sean Owen so...@cloudera.com wrote:
 


I ran it, and your answer is exactly what I got.

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.clustering._

val vectors = 
sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p
= Vectors.dense(Array[Double](p._1, p._2

val kmeans = new KMeans()
kmeans.setK(2)
val model = kmeans.run(vectors)

model.clusterCenters

res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0])

You may be aware that k-means starts from a random set of centroids.
It's possible that your run picked one that leads to a suboptimal
clustering. This is all the easier on a toy example like this and you
can find examples on the internet. That said, I never saw any other
answer.

The standard approach is to run many times. Call kmeans.setRuns(10) or
something to try 10 times instead of once.


On Thu, Jul 10, 2014 at 9:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 Can someone please run the standard kMeans code on this input with 2 centers
 ?:
 2 1
 1 2
 3 2
 2 3
 4 1
 5 1
 6 1
 4 2
 6 2
 4 3
 5 3
 6 3

 The obvious result should be (2,2) and (5,2) ... (you can draw them if you
 don't believe me ...)

 Thanks,
 Wanda

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I am running spark-1.0.0 with java 1.8

java version 1.8.0_05
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

which spark-shell
~/bench/spark-1.0.0/bin/spark-shell

which scala
~/bench/scala-2.10.4/bin/scala


On Thursday, July 10, 2014 12:46 PM, Tathagata Das 
tathagata.das1...@gmail.com wrote:
 


I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your 
dataset as well, I got the expected answer. And I believe that even though 
initialization is done using sampling, the example actually sets the seed to a 
constant 42, so the result should always be the same no matter how many times 
you run it. So I am not really sure whats going on here.

Can you tell us more about which version of Spark you are running? Which Java 
version? 


==

[tdas @ Xion spark2] cat input
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3
[tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001
2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from 
SCDynamicStore
14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded
14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
Finished iteration (delta = 3.0)
Finished iteration (delta = 0.0)
Final centers:
DenseVector(5.0, 2.0)
DenseVector(2.0, 2.0)




On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

so this is what I am running: 
./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001


And this is the input file:
┌───[spark2013@SparkOne]──[~/spark-1.0.0].$
└───#!cat ~/Documents/2dim2.txt
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3



This is the final output from spark:
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
Getting 2 non-empty blocks out of 2 blocks
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 0 ms
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 0 ms
14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433
14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver
14/07/10 20:05:12 INFO Executor: Finished task ID 14
14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0)
14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost 
(progress: 1/2)
14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433
14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver
14/07/10 20:05:12 INFO Executor: Finished task ID 15
14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1)
14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost 
(progress: 2/2)
14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at 
SparkKMeans.scala:75) finished in 0.008 s
14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks 
have all completed, from pool
14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at 
SparkKMeans.scala:75, took 0.02472681 s
Finished iteration (delta = 0.0)
Final centers:
DenseVector(2.8571428571428568, 2.0)
DenseVector(5.6005, 2.0)








On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com 
wrote:
 


A picture is worth a thousand... Well, a picture with this dataset, what you 
are expecting and what you get, would help answering your initial question.


Bertrand


On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

Can someone please run the standard kMeans code on this input with 2 centers ?:
2 1
1 2
3 2
2 3
4 1
5 1
6 1
4 2
6 2
4 3
5 3
6 3


The obvious result should be (2,2) and (5,2) ... (you can draw them if you 
don't believe me ...)


Thanks, 
Wanda




Re: java options for spark-1.0.0

2014-07-03 Thread Wanda Hawk
With spark-1.0.0 this is the cmdline from /proc/#pid: (with the export line 
export _JAVA_OPTIONS=...)

/usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms512m-Xmx512morg.apache.spark.deploy.SparkSubmit--classSparkKMeans--verbose--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001


This is the cmdline from /proc/#pid with spark-0.8.0 and launching KMeans 
with scala -J-Xms16g -J-Xms16g . The export line from bashrc is ignored 
here also (If I do launch without specifying the java options after the scala 
command , the heap will have the default value) - the results below are from 
launching it with the java options specified after the scala command:

/usr/java/jdk1.7.0_51/bin/java-Xmx256M-Xms32M-Xms16g-Xmx16g-Xbootclasspath/a:/home/spark2013/scala-2.9.3/lib/jline.jar:/home/spark2013/scala-2.9.3/lib/scalacheck.jar:/home/spark2013/scala-2.9.3/lib/scala-compiler.jar:/home/spark2013/scala-2.9.3/lib/scala-dbc.jar:/home/spark2013/scala-2.9.3/lib/scala-library.jar:/home/spark2013/scala-2.9.3/lib/scala-partest.jar:/home/spark2013/scala-2.9.3/lib/scalap.jar:/home/spark2013/scala-2.9.3/lib/scala-swing.jar-Dscala.usejavacp=true-Dscala.home=/home/spark2013/scala-2.9.3-Denv.emacs=scala.tools.nsc.MainGenericRunner-J-Xms16g-J-Xmx16g-cp/home/spark2013/Runs/KMeans/GC/classesSparkKMeanslocal[24]/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001


Launching spark-1.0.0 with spark-submit and --driver-memory-10g gets picked up, 
but the results in the execution are the same, a lot of alocation failures
/usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms10g-Xmx10gorg.apache.spark.deploy.SparkSubmit--driver-memory10g--classSparkKMeans--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001


Adding --executor-memory 11g will not change the outcome:
cat /proc/13286/cmdline
/usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms10g-Xmx10gorg.apache.spark.deploy.SparkSubmit--driver-memory10g--executor-memory11g--classSparkKMeans--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001

So the Xmx and Xms can be altered, but the execution is rubbish in performance 
compared to spark 0.8.0. How can I improve it ?


Thanks
On Wednesday, July 2, 2014 9:34 PM, Matei Zaharia matei.zaha...@gmail.com 
wrote:
 


Try looking at the running processes with “ps” to see their full command line 
and see whether any options are different. It seems like in both cases, your 
young generation is quite large (11 GB), which doesn’t make lot of sense with a 
heap of 15 GB. But maybe I’m misreading something.

Matei

On Jul 2, 2014, at 4:50 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:

I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with 
spark-0.8.0 with this line in bash.rc  export _JAVA_OPTIONS=-Xmx15g -Xms15g 
-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a 
decent time, ~50 seconds, and I had only a few Full GC messages from 
Java. (a max of 4-5)


Now, using the same export in bash.rc but with spark-1.0.0  (and running it 
with spark-submit) the first loop never finishes and  I get a lot of:
18.537: [GC (Allocation Failure) --[PSYoungGen: 
11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 
secs] [Times: user=5.81 sys=2.12, real=2.85 secs]

or 


 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] 
[ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 
37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, 
real=2.31 secs]
 
I tried passing different parameters for the JVM through spark-submit, but the 
results are the same
This happens with java 1.7 and also with java 1.8.
I do not know what the Ergonomics stands for ...


How can I get a decent performance from spark-1.0.0 considering

Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-03 Thread Wanda Hawk
I have given this a try in a spark-shell and I still get many Allocation 
Failures


On Thursday, July 3, 2014 9:51 AM, Xiangrui Meng men...@gmail.com wrote:
 


The SparkKMeans is just an example code showing a barebone
implementation of k-means. To run k-means on big datasets, please use
the KMeans implemented in MLlib directly:
http://spark.apache.org/docs/latest/mllib-clustering.html

-Xiangrui


On Wed, Jul 2, 2014 at 9:50 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 I can run it now with the suggested method. However, I have encountered a
 new problem that I have not faced before (sent another email with that one
 but here it goes again ...)

 I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with
 spark-0.8.0 with this line in bash.rc  export _JAVA_OPTIONS=-Xmx15g
 -Xms15g -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It
 finished in a decent time, ~50 seconds, and I had only a few Full GC
 messages from Java. (a max of 4-5)

 Now, using the same export in bash.rc but with spark-1.0.0  (and running it
 with spark-submit) the first loop never finishes and  I get a lot of:
 18.537: [GC (Allocation Failure) --[PSYoungGen:
 11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311
 secs] [Times: user=5.81 sys=2.12, real=2.85 secs]
 
 or

  31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)]
 [ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace:
 37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11,
 real=2.31 secs]

 I tried passing different parameters for the JVM through spark-submit, but
 the results are the same
 This happens with java 1.7 and also with java 1.8.
 I do not know what the Ergonomics stands for ...

 How can I get a decent performance from spark-1.0.0 considering that
 spark-0.8.0 did not need any fine tuning on the gargage collection method
 (the default worked well) ?

 Thank you


 On Wednesday, July 2, 2014 4:45 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:


 The scripts that Xiangrui mentions set up the classpath...Can you run
 ./run-example for the provided example sucessfully?

 What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call
 run-example -- that will show you the exact java command used to run
 the example at the start of execution. Assuming you can run examples
 succesfully, you should be able to just copy that and add your jar to
 the front of the classpath. If that works you can start removing extra
 jars (run-examples put all the example jars in the cp, which you won't
 need)

 As you said the error you see is indicative of the class not being
 available/seen at runtime but it's hard to tell why.

 On Wed, Jul 2, 2014 at 2:13 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 I want to make some minor modifications in the SparkMeans.scala so running
 the basic example won't do.
 I have also packed my code under a jar file with sbt. It completes
 successfully but when I try to run it : java -jar myjar.jar I get the
 same
 error:
 Exception in thread main java.lang.NoClassDefFoundError:
 breeze/linalg/Vector
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at
 sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at
 sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 

 If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does
 it succeeds in compiling and does not give the same error ?
 The error itself NoClassDefFoundError means that the files are available
 at compile time, but for some reason I cannot figure out they are not
 available at run time. Does anyone know why ?

 Thank you


 On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote:


 You can use either bin/run-example or bin/spark-summit to run example
 code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark
 classpath. There are examples in the official doc:
 http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
 -Xiangrui

 On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 Hello,

 I have installed spark-1.0.0 with scala2.10.3. I have built spark with
 sbt/sbt assembly and added


 /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 to my CLASSPATH variable.
 Then I went here
 ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples
 created
 a
 new directory classes and compiled SparkKMeans.scala with scalac -d
 classes/ SparkKMeans.scala
 Then I navigated to classes (I commented this line in the scala file :
 package org.apache.spark.examples ) and tried to run it with java -cp .
 SparkKMeans and I get the following error:
 Exception in thread main java.lang.NoClassDefFoundError:
 breeze/linalg/Vector

Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-02 Thread Wanda Hawk
I want to make some minor modifications in the SparkMeans.scala so running the 
basic example won't do. 
I have also packed my code under a jar file with sbt. It completes 
successfully but when I try to run it : java -jar myjar.jar I get the same 
error:
Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)


If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does it 
succeeds in compiling and does not give the same error ? 
The error itself NoClassDefFoundError means that the files are available at 
compile time, but for some reason I cannot figure out they are not available at 
run time. Does anyone know why ?

Thank you


On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote:
 


You can use either bin/run-example or bin/spark-summit to run example
code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark
classpath. There are examples in the official doc:
http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
-Xiangrui


On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 Hello,

 I have installed spark-1.0.0 with scala2.10.3. I have built spark with
 sbt/sbt assembly and added
 /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 to my CLASSPATH variable.
 Then I went here
 ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created a
 new directory classes and compiled SparkKMeans.scala with scalac -d
 classes/ SparkKMeans.scala
 Then I navigated to classes (I commented this line in the scala file :
 package org.apache.spark.examples ) and tried to run it with java -cp .
 SparkKMeans and I get the following error:
 Exception in thread main java.lang.NoClassDefFoundError:
 breeze/linalg/Vector
         at java.lang.Class.getDeclaredMethods0(Native Method)
         at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
         at java.lang.Class.getMethod0(Class.java:2774)
         at java.lang.Class.getMethod(Class.java:1663)
         at
 sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
         at
 sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector
         at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
         at java.security.AccessController.doPrivileged(Native Method)
         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
         ... 6 more
 
 The jar under
 /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 contains the breeze/linalg/Vector* path, I even tried to unpack it and put
 it in CLASSPATH to it does not seem to pick it up


 I am currently running java 1.8
 java version 1.8.0_05
 Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

 What I am doing wrong ?


Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-02 Thread Wanda Hawk
Got it ! Ran the jar with spark-submit. Thanks !


On Wednesday, July 2, 2014 9:16 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 


I want to make some minor modifications in the SparkMeans.scala so running the 
basic example won't do. 
I have also packed my code under a jar file with sbt. It completes 
successfully but when I try to run it : java -jar myjar.jar I get the same 
error:
Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)


If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does it 
succeeds in compiling and does not give the same error ? 
The error itself NoClassDefFoundError means that the files are available at 
compile time, but for some reason I cannot figure out they are not available at 
run time. Does anyone know why ?

Thank you


On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote:
 


You can use either bin/run-example or bin/spark-summit to run example
code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark
classpath. There
 are examples in the official doc:
http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
-Xiangrui


On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 Hello,

 I have installed spark-1.0.0 with scala2.10.3. I have built spark with
 sbt/sbt assembly and added

 
/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 to my CLASSPATH variable.
 Then I went here
 ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created a
 new directory classes and compiled SparkKMeans.scala with scalac -d
 classes/ SparkKMeans.scala
 Then I navigated to classes (I commented this line in the scala file :
 package org.apache.spark.examples ) and tried to run it with java -cp .
 SparkKMeans and I get the following error:
 Exception in thread main java.lang.NoClassDefFoundError:

 breeze/linalg/Vector
         at java.lang.Class.getDeclaredMethods0(Native Method)
         at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
         at java.lang.Class.getMethod0(Class.java:2774)
         at java.lang.Class.getMethod(Class.java:1663)
         at
 sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
         at
 sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector
         at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
         at java.security.AccessController.doPrivileged(Native Method)
         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
         ... 6 more
 
 The jar under
 /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 contains the breeze/linalg/Vector* path, I even tried to unpack it and put
 it in CLASSPATH to it does not seem to pick it up


 I am currently running java 1.8
 java version 1.8.0_05
 Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

 What I am doing wrong ?


java options for spark-1.0.0

2014-07-02 Thread Wanda Hawk
I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with 
spark-0.8.0 with this line in bash.rc  export _JAVA_OPTIONS=-Xmx15g -Xms15g 
-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a 
decent time, ~50 seconds, and I had only a few Full GC messages from 
Java. (a max of 4-5)

Now, using the same export in bash.rc but with spark-1.0.0  (and running it 
with spark-submit) the first loop never finishes and  I get a lot of:
18.537: [GC (Allocation Failure) --[PSYoungGen: 
11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 
secs] [Times: user=5.81 sys=2.12, real=2.85 secs]

or 

 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] 
[ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 
37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, 
real=2.31 secs]
 
I tried passing different parameters for the JVM through spark-submit, but the 
results are the same
This happens with java 1.7 and also with java 1.8.
I do not know what the Ergonomics stands for ...

How can I get a decent performance from spark-1.0.0 considering that 
spark-0.8.0 did not need any fine tuning on the gargage collection method (the 
default worked well) ?

Thank you

Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-02 Thread Wanda Hawk
I can run it now with the suggested method. However, I have encountered a new 
problem that I have not faced before (sent another email with that one but here 
it goes again ...)

I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with 
spark-0.8.0 with this line in bash.rc  export _JAVA_OPTIONS=-Xmx15g -Xms15g 
-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a 
decent time, ~50 seconds, and I had only a few Full GC messages from 
Java. (a max of 4-5)

Now, using the same export in bash.rc but with spark-1.0.0  (and running it 
with spark-submit) the first loop never finishes and  I get a lot of:
18.537: [GC (Allocation Failure) --[PSYoungGen: 
11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 
secs] [Times: user=5.81 sys=2.12, real=2.85 secs]

or 

 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] 
[ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 
37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, 
real=2.31 secs]
 
I tried passing different parameters for the JVM through spark-submit, but the 
results are the same
This happens with java 1.7 and also with java 1.8.
I do not know what the Ergonomics stands for ...

How can I get a decent performance from spark-1.0.0 considering that 
spark-0.8.0 did not need any fine tuning on the gargage collection method (the 
default worked well) ?

Thank you


On Wednesday, July 2, 2014 4:45 PM, Yana Kadiyska yana.kadiy...@gmail.com 
wrote:
 


The scripts that Xiangrui mentions set up the classpath...Can you run
./run-example for the provided example sucessfully?

What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call
run-example -- that will show you the exact java command used to run
the example at the start of execution. Assuming you can run examples
succesfully, you should be able to just copy that and add your jar to
the front of the classpath. If that works you can start removing extra
jars (run-examples put all the example jars in the cp, which you won't
need)

As you said the error you see is indicative of the class not being
available/seen at runtime but it's hard to tell why.


On Wed, Jul 2, 2014 at 2:13 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 I want to make some minor modifications in the SparkMeans.scala so running
 the basic example won't do.
 I have also packed my code under a jar file with sbt. It completes
 successfully but when I try to run it : java -jar myjar.jar I get the same
 error:
 Exception in thread main java.lang.NoClassDefFoundError:
 breeze/linalg/Vector
         at java.lang.Class.getDeclaredMethods0(Native Method)
         at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
         at java.lang.Class.getMethod0(Class.java:2774)
         at java.lang.Class.getMethod(Class.java:1663)
         at
 sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
         at
 sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 

 If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does
 it succeeds in compiling and does not give the same error ?
 The error itself NoClassDefFoundError means that the files are available
 at compile time, but for some reason I cannot figure out they are not
 available at run time. Does anyone know why ?

 Thank you


 On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote:


 You can use either bin/run-example or bin/spark-summit to run example
 code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark
 classpath. There are examples in the official doc:
 http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
 -Xiangrui

 On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 Hello,

 I have installed spark-1.0.0 with scala2.10.3. I have built spark with
 sbt/sbt assembly and added

 /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 to my CLASSPATH variable.
 Then I went here
 ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created
 a
 new directory classes and compiled SparkKMeans.scala with scalac -d
 classes/ SparkKMeans.scala
 Then I navigated to classes (I commented this line in the scala file :
 package org.apache.spark.examples ) and tried to run it with java -cp .
 SparkKMeans and I get the following error:
 Exception in thread main java.lang.NoClassDefFoundError:
 breeze/linalg/Vector
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at
 sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at
 sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector
        at java.net.URLClassLoader$1.run