Re: Only master is really busy at KMeans training
Right now, I have issues even at a far earlier point. I'm fetching data from a registerd table via var texts = ctx.sql("SELECT text FROM tweetTrainTable LIMIT 2000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) //persisted because it's used again later var dict = texts.flatMap(_.split(" ").map(_.toLowerCase())).repartition(80) //80=2*num_cpu var count = dict.count.toInt As far as I can see, it's the repartitioning that is causingthe problems. However, without that, I have only one partition for further RDD operations on dict, so it seems to be necessary. The errors given are 14/08/26 10:43:52 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1 (TID 2300, idp11.foo.bar): java.lang.OutOfMemoryError: Requested array size exceeds VM limit java.util.Arrays.copyOf(Arrays.java:3230) java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) ... Then the RDD operations start again, but later I will get 14/08/26 10:47:14 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.2 (TID 2655, idp41.foo.bar: java.lang.NullPointerException: $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:26) $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:26) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236) and another java.lang.OutOfMemoryError. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12842.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Only master is really busy at KMeans training
How many partitions now? Btw, which Spark version are you using? I checked your code and I don't understand why you want to broadcast vectors2, which is an RDD. var vectors2 = vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var broadcastVector = sc.broadcast(vectors2) What is the total memory of your cluster? Does the dataset fit into memory? If not, you can try turning on `spark.rdd.compress`. The whole dataset is not small. -Xiangrui On Mon, Aug 25, 2014 at 11:46 PM, durin wrote: > With a lower number of partitions, I keep losing executors during >collect at KMeans.scala:283 > The error message is "ExecutorLostFailure (executor lost)". > The program recovers by automatically repartitioning the whole dataset > (126G), which takes very long and seems to only delay the inevitable > failure. > > Is there a recommended solution to this issue? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12803.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Only master is really busy at KMeans training
With a lower number of partitions, I keep losing executors during collect at KMeans.scala:283 The error message is "ExecutorLostFailure (executor lost)". The program recovers by automatically repartitioning the whole dataset (126G), which takes very long and seems to only delay the inevitable failure. Is there a recommended solution to this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12803.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Only master is really busy at KMeans training
There are only 5 worker nodes. So please try to reduce the number of partitions to the number of available CPU cores. 1000 partitions are too bigger, because the driver needs to collect to task result from each partition. -Xiangrui On Tue, Aug 19, 2014 at 1:41 PM, durin wrote: > When trying to use KMeans.train with some large data and 5 worker nodes, it > would due to BlockManagers shutting down because of timeout. I was able to > prevent that by adding > > spark.storage.blockManagerSlaveTimeoutMs 300 > > to the spark-defaults.conf. > > However, with 1 Million feature vectors, the Stage takeSample at > KMeans.scala:263 runs for about 50 minutes. In this time, about half of the > tasks are done, then I lose the executors and Spark starts a new > repartitioning stage. > > I also noticed that in the takeSample stage, the task was running for about > 2.5 minutes until suddenly it is finished and duration (prev. those 2.5min) > change to 2s, with 0.9s GC time. > > The training data is supplied in this form: > var vectors2 = > vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) > var broadcastVector = sc.broadcast(vectors2) > > The 1000 partitions is something that could probably be optimized, but too > few will cause OOM erros. > > Using Ganglia, I can see that the master node is the only one that is > properly busy regarding CPU, and that most (600-700 of 800 total percent > CPU) is used by the master. > The workers on each node only use 1 Core, i.e. 100% CPU. > > > What would be the most likely cause for such an inefficient use of the > cluster, and how to prevent it? > Number of partitions, way of caching, ...? > > I'm trying to find out myself with tests, but ideas from someone with more > experience are very welcome. > > > Best regards, > simn > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org