Right now, I have issues even at a far earlier point.

I'm fetching data from a registerd table via

var texts = ctx.sql("SELECT text FROM tweetTrainTable LIMIT
20000000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
//persisted because it's used again later

var dict = texts.flatMap(_.split(" ").map(_.toLowerCase())).repartition(80)
//80=2*num_cpu

var count = dict.count.toInt


As far as I can see, it's the repartitioning that is causingthe problems.
However, without that, I have only one partition for further RDD operations
on dict, so it seems to be necessary.

The errors given are

14/08/26 10:43:52 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1
(TID 2300, idp11.foo.bar): java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
        java.util.Arrays.copyOf(Arrays.java:3230)
        java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
       
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        ...


Then the RDD operations start again, but later I will get

14/08/26 10:47:14 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.2
(TID 2655, idp41.foo.bar: java.lang.NullPointerException:
       
$line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:26)
       
$line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:26)
        scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
       
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)
        ....

and another java.lang.OutOfMemoryError.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12842.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to