Spark SQL reduce number of java threads
Hello I am trying to reduce the number of java threads (about 80 on my system) to as few as there can be. What settings can be done in spark-1.1.0/conf/spark-env.sh ? (or other places as well) I am also using hadoop for storing data on hdfs Thank you, Wanda
Re: Spark SQL reduce number of java threads
I am trying to get a software trace and I need to get the number of active threads as low as I can in order to inspect the active part of the workload From: Prashant Sharma scrapco...@gmail.com To: Wanda Hawk wanda_haw...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Sent: Tuesday, October 28, 2014 11:17 AM Subject: Re: Spark SQL reduce number of java threads What is the motivation behind this ? You can start with master as local[NO_OF_THREADS]. Reducing the threads at all other places can have unexpected results. Take a look at this. http://spark.apache.org/docs/latest/configuration.html. Prashant Sharma On Tue, Oct 28, 2014 at 2:08 PM, Wanda Hawk wanda_haw...@yahoo.com.invalid wrote: Hello I am trying to reduce the number of java threads (about 80 on my system) to as few as there can be. What settings can be done in spark-1.1.0/conf/spark-env.sh ? (or other places as well) I am also using hadoop for storing data on hdfs Thank you, Wanda
Re: How can number of partitions be set in spark-env.sh?
Is this what are you looking for ? In Shark, default reducer number is 1 and is controlled by the property mapred.reduce.tasks. Spark SQL deprecates this property in favor ofspark.sql.shuffle.partitions, whose default value is 200. Users may customize this property via SET: SET spark.sql.shuffle.partitions=10; SELECT page, count(*) c FROM logs_last_month_cached GROUP BY page ORDER BY c DESC LIMIT 10; Spark SQL Programming Guide - Spark 1.1.0 Documentation Spark SQL Programming Guide - Spark 1.1.0 Documentation Spark SQL Programming Guide Overview Getting Started Data Sources RDDs Inferring the Schema Using Reflection Programmatically Specifying the Schema Parquet Files Loading Data Programmatically View on spark.apache.org Preview by Yahoo From: shahab shahab.mok...@gmail.com To: user@spark.apache.org Sent: Tuesday, October 28, 2014 3:20 PM Subject: How can number of partitions be set in spark-env.sh? I am running a stand alone Spark cluster, 2 workers each has 2 cores. Apparently, I am loading and processing relatively large chunk of data so that I receive task failure . As I read from some posts and discussions in the mailing list the failures could be related to the large size of processing data in the partitions and if I have understood correctly I should have smaller partitions (but many of them) ?! Is there any way that I can set the number of partitions dynamically in spark-env.sh or in the submiited Spark application? best, /Shahab
Re: KMeans code is rubbish
The problem is that I get the same results every time On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar atalwal...@gmail.com wrote: Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this issue is to run kmeans multiple times and choose the best answer. You can do this by changing the runs parameter from the default value (1) to something larger (say 10). -Ameet On Fri, Jul 11, 2014 at 1:20 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I also took a look at spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala and ran the code in a shell. There is an issue here: val initMode = params.initializationMode match { case Random = KMeans.RANDOM case Parallel = KMeans.K_MEANS_PARALLEL } If I use initMode=KMeans.RANDOM everything is ok. If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know why. The example proposed is a really simple one that should not accept multiple solutions and always converge to the correct one. Now what can be altered in the original SparkKMeans.scala (the seed or something else ?) to get the correct results each and every single time ? On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com wrote: SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here. Can you tell us more about which version of Spark you are running? Which Java version? == [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector
Re: KMeans code is rubbish
I also took a look at spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala and ran the code in a shell. There is an issue here: val initMode = params.initializationMode match { case Random = KMeans.RANDOM case Parallel = KMeans.K_MEANS_PARALLEL } If I use initMode=KMeans.RANDOM everything is ok. If I use initMode=KMeans.K_MEANS_PARALLEL I get a wrong result. I do not know why. The example proposed is a really simple one that should not accept multiple solutions and always converge to the correct one. Now what can be altered in the original SparkKMeans.scala (the seed or something else ?) to get the correct results each and every single time ? On Thursday, July 10, 2014 7:58 PM, Xiangrui Meng men...@gmail.com wrote: SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here. Can you tell us more about which version of Spark you are running? Which Java version? == [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com wrote: A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result
KMeans code is rubbish
Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: KMeans code is rubbish
so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com wrote: A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: KMeans code is rubbish
I ran the example with ./bin/run-example SparkKMeans file.txt 2 0.001 I get this response: Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) The start point is not random. It uses the first K points from the given set On Thursday, July 10, 2014 11:57 AM, Sean Owen so...@cloudera.com wrote: I ran it, and your answer is exactly what I got. import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.clustering._ val vectors = sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p = Vectors.dense(Array[Double](p._1, p._2 val kmeans = new KMeans() kmeans.setK(2) val model = kmeans.run(vectors) model.clusterCenters res10: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,2.0], [2.0,2.0]) You may be aware that k-means starts from a random set of centroids. It's possible that your run picked one that leads to a suboptimal clustering. This is all the easier on a toy example like this and you can find examples on the internet. That said, I never saw any other answer. The standard approach is to run many times. Call kmeans.setRuns(10) or something to try 10 times instead of once. On Thu, Jul 10, 2014 at 9:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: KMeans code is rubbish
I am running spark-1.0.0 with java 1.8 java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) which spark-shell ~/bench/spark-1.0.0/bin/spark-shell which scala ~/bench/scala-2.10.4/bin/scala On Thursday, July 10, 2014 12:46 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how many times you run it. So I am not really sure whats going on here. Can you tell us more about which version of Spark you are running? Which Java version? == [tdas @ Xion spark2] cat input 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 [tdas @ Xion spark2] ./bin/run-example SparkKMeans input 2 0.001 2014-07-10 02:45:06.764 java[45244:d17] Unable to load realm info from SCDynamicStore 14/07/10 02:45:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/10 02:45:07 WARN LoadSnappy: Snappy native library not loaded 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/07/10 02:45:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Finished iteration (delta = 3.0) Finished iteration (delta = 0.0) Final centers: DenseVector(5.0, 2.0) DenseVector(2.0, 2.0) On Thu, Jul 10, 2014 at 2:17 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: so this is what I am running: ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/07/10 20:05:12 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms 14/07/10 20:05:12 INFO Executor: Serialized size of result for 14 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 14 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 14 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 0) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 14 in 5 ms on localhost (progress: 1/2) 14/07/10 20:05:12 INFO Executor: Serialized size of result for 15 is 1433 14/07/10 20:05:12 INFO Executor: Sending result for 15 directly to driver 14/07/10 20:05:12 INFO Executor: Finished task ID 15 14/07/10 20:05:12 INFO DAGScheduler: Completed ResultTask(6, 1) 14/07/10 20:05:12 INFO TaskSetManager: Finished TID 15 in 7 ms on localhost (progress: 2/2) 14/07/10 20:05:12 INFO DAGScheduler: Stage 6 (collectAsMap at SparkKMeans.scala:75) finished in 0.008 s 14/07/10 20:05:12 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool 14/07/10 20:05:12 INFO SparkContext: Job finished: collectAsMap at SparkKMeans.scala:75, took 0.02472681 s Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) On Thursday, July 10, 2014 12:02 PM, Bertrand Dechoux decho...@gmail.com wrote: A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks, Wanda
Re: java options for spark-1.0.0
With spark-1.0.0 this is the cmdline from /proc/#pid: (with the export line export _JAVA_OPTIONS=...) /usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms512m-Xmx512morg.apache.spark.deploy.SparkSubmit--classSparkKMeans--verbose--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001 This is the cmdline from /proc/#pid with spark-0.8.0 and launching KMeans with scala -J-Xms16g -J-Xms16g . The export line from bashrc is ignored here also (If I do launch without specifying the java options after the scala command , the heap will have the default value) - the results below are from launching it with the java options specified after the scala command: /usr/java/jdk1.7.0_51/bin/java-Xmx256M-Xms32M-Xms16g-Xmx16g-Xbootclasspath/a:/home/spark2013/scala-2.9.3/lib/jline.jar:/home/spark2013/scala-2.9.3/lib/scalacheck.jar:/home/spark2013/scala-2.9.3/lib/scala-compiler.jar:/home/spark2013/scala-2.9.3/lib/scala-dbc.jar:/home/spark2013/scala-2.9.3/lib/scala-library.jar:/home/spark2013/scala-2.9.3/lib/scala-partest.jar:/home/spark2013/scala-2.9.3/lib/scalap.jar:/home/spark2013/scala-2.9.3/lib/scala-swing.jar-Dscala.usejavacp=true-Dscala.home=/home/spark2013/scala-2.9.3-Denv.emacs=scala.tools.nsc.MainGenericRunner-J-Xms16g-J-Xmx16g-cp/home/spark2013/Runs/KMeans/GC/classesSparkKMeanslocal[24]/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001 Launching spark-1.0.0 with spark-submit and --driver-memory-10g gets picked up, but the results in the execution are the same, a lot of alocation failures /usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms10g-Xmx10gorg.apache.spark.deploy.SparkSubmit--driver-memory10g--classSparkKMeans--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001 Adding --executor-memory 11g will not change the outcome: cat /proc/13286/cmdline /usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms10g-Xmx10gorg.apache.spark.deploy.SparkSubmit--driver-memory10g--executor-memory11g--classSparkKMeans--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001 So the Xmx and Xms can be altered, but the execution is rubbish in performance compared to spark 0.8.0. How can I improve it ? Thanks On Wednesday, July 2, 2014 9:34 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Try looking at the running processes with “ps” to see their full command line and see whether any options are different. It seems like in both cases, your young generation is quite large (11 GB), which doesn’t make lot of sense with a heap of 15 GB. But maybe I’m misreading something. Matei On Jul 2, 2014, at 4:50 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with spark-0.8.0 with this line in bash.rc export _JAVA_OPTIONS=-Xmx15g -Xms15g -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a decent time, ~50 seconds, and I had only a few Full GC messages from Java. (a max of 4-5) Now, using the same export in bash.rc but with spark-1.0.0 (and running it with spark-submit) the first loop never finishes and I get a lot of: 18.537: [GC (Allocation Failure) --[PSYoungGen: 11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 secs] [Times: user=5.81 sys=2.12, real=2.85 secs] or 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] [ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, real=2.31 secs] I tried passing different parameters for the JVM through spark-submit, but the results are the same This happens with java 1.7 and also with java 1.8. I do not know what the Ergonomics stands for ... How can I get a decent performance from spark-1.0.0 considering
Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector
I have given this a try in a spark-shell and I still get many Allocation Failures On Thursday, July 3, 2014 9:51 AM, Xiangrui Meng men...@gmail.com wrote: The SparkKMeans is just an example code showing a barebone implementation of k-means. To run k-means on big datasets, please use the KMeans implemented in MLlib directly: http://spark.apache.org/docs/latest/mllib-clustering.html -Xiangrui On Wed, Jul 2, 2014 at 9:50 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I can run it now with the suggested method. However, I have encountered a new problem that I have not faced before (sent another email with that one but here it goes again ...) I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with spark-0.8.0 with this line in bash.rc export _JAVA_OPTIONS=-Xmx15g -Xms15g -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a decent time, ~50 seconds, and I had only a few Full GC messages from Java. (a max of 4-5) Now, using the same export in bash.rc but with spark-1.0.0 (and running it with spark-submit) the first loop never finishes and I get a lot of: 18.537: [GC (Allocation Failure) --[PSYoungGen: 11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 secs] [Times: user=5.81 sys=2.12, real=2.85 secs] or 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] [ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, real=2.31 secs] I tried passing different parameters for the JVM through spark-submit, but the results are the same This happens with java 1.7 and also with java 1.8. I do not know what the Ergonomics stands for ... How can I get a decent performance from spark-1.0.0 considering that spark-0.8.0 did not need any fine tuning on the gargage collection method (the default worked well) ? Thank you On Wednesday, July 2, 2014 4:45 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: The scripts that Xiangrui mentions set up the classpath...Can you run ./run-example for the provided example sucessfully? What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call run-example -- that will show you the exact java command used to run the example at the start of execution. Assuming you can run examples succesfully, you should be able to just copy that and add your jar to the front of the classpath. If that works you can start removing extra jars (run-examples put all the example jars in the cp, which you won't need) As you said the error you see is indicative of the class not being available/seen at runtime but it's hard to tell why. On Wed, Jul 2, 2014 at 2:13 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I want to make some minor modifications in the SparkMeans.scala so running the basic example won't do. I have also packed my code under a jar file with sbt. It completes successfully but when I try to run it : java -jar myjar.jar I get the same error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does it succeeds in compiling and does not give the same error ? The error itself NoClassDefFoundError means that the files are available at compile time, but for some reason I cannot figure out they are not available at run time. Does anyone know why ? Thank you On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote: You can use either bin/run-example or bin/spark-summit to run example code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark classpath. There are examples in the official doc: http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here -Xiangrui On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Hello, I have installed spark-1.0.0 with scala2.10.3. I have built spark with sbt/sbt assembly and added /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar to my CLASSPATH variable. Then I went here ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created a new directory classes and compiled SparkKMeans.scala with scalac -d classes/ SparkKMeans.scala Then I navigated to classes (I commented this line in the scala file : package org.apache.spark.examples ) and tried to run it with java -cp . SparkKMeans and I get the following error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector
Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector
I want to make some minor modifications in the SparkMeans.scala so running the basic example won't do. I have also packed my code under a jar file with sbt. It completes successfully but when I try to run it : java -jar myjar.jar I get the same error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does it succeeds in compiling and does not give the same error ? The error itself NoClassDefFoundError means that the files are available at compile time, but for some reason I cannot figure out they are not available at run time. Does anyone know why ? Thank you On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote: You can use either bin/run-example or bin/spark-summit to run example code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark classpath. There are examples in the official doc: http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here -Xiangrui On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Hello, I have installed spark-1.0.0 with scala2.10.3. I have built spark with sbt/sbt assembly and added /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar to my CLASSPATH variable. Then I went here ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created a new directory classes and compiled SparkKMeans.scala with scalac -d classes/ SparkKMeans.scala Then I navigated to classes (I commented this line in the scala file : package org.apache.spark.examples ) and tried to run it with java -cp . SparkKMeans and I get the following error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 6 more The jar under /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar contains the breeze/linalg/Vector* path, I even tried to unpack it and put it in CLASSPATH to it does not seem to pick it up I am currently running java 1.8 java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) What I am doing wrong ?
Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector
Got it ! Ran the jar with spark-submit. Thanks ! On Wednesday, July 2, 2014 9:16 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I want to make some minor modifications in the SparkMeans.scala so running the basic example won't do. I have also packed my code under a jar file with sbt. It completes successfully but when I try to run it : java -jar myjar.jar I get the same error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does it succeeds in compiling and does not give the same error ? The error itself NoClassDefFoundError means that the files are available at compile time, but for some reason I cannot figure out they are not available at run time. Does anyone know why ? Thank you On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote: You can use either bin/run-example or bin/spark-summit to run example code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark classpath. There are examples in the official doc: http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here -Xiangrui On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Hello, I have installed spark-1.0.0 with scala2.10.3. I have built spark with sbt/sbt assembly and added /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar to my CLASSPATH variable. Then I went here ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created a new directory classes and compiled SparkKMeans.scala with scalac -d classes/ SparkKMeans.scala Then I navigated to classes (I commented this line in the scala file : package org.apache.spark.examples ) and tried to run it with java -cp . SparkKMeans and I get the following error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 6 more The jar under /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar contains the breeze/linalg/Vector* path, I even tried to unpack it and put it in CLASSPATH to it does not seem to pick it up I am currently running java 1.8 java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) What I am doing wrong ?
java options for spark-1.0.0
I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with spark-0.8.0 with this line in bash.rc export _JAVA_OPTIONS=-Xmx15g -Xms15g -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a decent time, ~50 seconds, and I had only a few Full GC messages from Java. (a max of 4-5) Now, using the same export in bash.rc but with spark-1.0.0 (and running it with spark-submit) the first loop never finishes and I get a lot of: 18.537: [GC (Allocation Failure) --[PSYoungGen: 11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 secs] [Times: user=5.81 sys=2.12, real=2.85 secs] or 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] [ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, real=2.31 secs] I tried passing different parameters for the JVM through spark-submit, but the results are the same This happens with java 1.7 and also with java 1.8. I do not know what the Ergonomics stands for ... How can I get a decent performance from spark-1.0.0 considering that spark-0.8.0 did not need any fine tuning on the gargage collection method (the default worked well) ? Thank you
Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector
I can run it now with the suggested method. However, I have encountered a new problem that I have not faced before (sent another email with that one but here it goes again ...) I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with spark-0.8.0 with this line in bash.rc export _JAVA_OPTIONS=-Xmx15g -Xms15g -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a decent time, ~50 seconds, and I had only a few Full GC messages from Java. (a max of 4-5) Now, using the same export in bash.rc but with spark-1.0.0 (and running it with spark-submit) the first loop never finishes and I get a lot of: 18.537: [GC (Allocation Failure) --[PSYoungGen: 11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 secs] [Times: user=5.81 sys=2.12, real=2.85 secs] or 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] [ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, real=2.31 secs] I tried passing different parameters for the JVM through spark-submit, but the results are the same This happens with java 1.7 and also with java 1.8. I do not know what the Ergonomics stands for ... How can I get a decent performance from spark-1.0.0 considering that spark-0.8.0 did not need any fine tuning on the gargage collection method (the default worked well) ? Thank you On Wednesday, July 2, 2014 4:45 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: The scripts that Xiangrui mentions set up the classpath...Can you run ./run-example for the provided example sucessfully? What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call run-example -- that will show you the exact java command used to run the example at the start of execution. Assuming you can run examples succesfully, you should be able to just copy that and add your jar to the front of the classpath. If that works you can start removing extra jars (run-examples put all the example jars in the cp, which you won't need) As you said the error you see is indicative of the class not being available/seen at runtime but it's hard to tell why. On Wed, Jul 2, 2014 at 2:13 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I want to make some minor modifications in the SparkMeans.scala so running the basic example won't do. I have also packed my code under a jar file with sbt. It completes successfully but when I try to run it : java -jar myjar.jar I get the same error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does it succeeds in compiling and does not give the same error ? The error itself NoClassDefFoundError means that the files are available at compile time, but for some reason I cannot figure out they are not available at run time. Does anyone know why ? Thank you On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote: You can use either bin/run-example or bin/spark-summit to run example code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark classpath. There are examples in the official doc: http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here -Xiangrui On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Hello, I have installed spark-1.0.0 with scala2.10.3. I have built spark with sbt/sbt assembly and added /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar to my CLASSPATH variable. Then I went here ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created a new directory classes and compiled SparkKMeans.scala with scalac -d classes/ SparkKMeans.scala Then I navigated to classes (I commented this line in the scala file : package org.apache.spark.examples ) and tried to run it with java -cp . SparkKMeans and I get the following error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector at java.net.URLClassLoader$1.run