Right now, I have issues even at a far earlier point. I'm fetching data from a registerd table via
var texts = ctx.sql("SELECT text FROM tweetTrainTable LIMIT 20000000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) //persisted because it's used again later var dict = texts.flatMap(_.split(" ").map(_.toLowerCase())).repartition(80) //80=2*num_cpu var count = dict.count.toInt As far as I can see, it's the repartitioning that is causingthe problems. However, without that, I have only one partition for further RDD operations on dict, so it seems to be necessary. The errors given are 14/08/26 10:43:52 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1 (TID 2300, idp11.foo.bar): java.lang.OutOfMemoryError: Requested array size exceeds VM limit java.util.Arrays.copyOf(Arrays.java:3230) java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) ... Then the RDD operations start again, but later I will get 14/08/26 10:47:14 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.2 (TID 2655, idp41.foo.bar: java.lang.NullPointerException: $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:26) $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:26) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236) .... and another java.lang.OutOfMemoryError. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12842.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org