Column Similarities using DIMSUM fails with GC overhead limit exceeded

Sabarish Sasidharan Sun, 01 Mar 2015 15:33:39 -0800

I am trying to compute column similarities on a 30x1000 RowMatrix of
DenseVectors. The size of the input RDD is 3.1MB and its all in one
partition. I am running on a single node of 15G and giving the driver 1G
and the executor 9G. This is on a single node hadoop. In the first attempt
the BlockManager doesn't respond within the heart beat interval. In the
second attempt I am seeing a GC overhead limit exceeded error. And it is
almost always in the RowMatrix.columSimilaritiesDIMSUM ->
mapPartitionsWithIndex (line 570)


java.lang.OutOfMemoryError: GC overhead limit exceeded
        at
org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$19$$anonfun$apply$2.apply(RowMatrix.scala:570)
        at
org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$19$$anonfun$apply$2.apply(RowMatrix.scala:528)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)


It also really seems to be running out of memory. I am seeing the following
in the attempt log
Heap
 PSYoungGen      total 2752512K, used 2359296K
  eden space 2359296K, 100% used
  from space 393216K, 0% used
  to   space 393216K, 0% used
 ParOldGen       total 6291456K, used 6291376K [0x0000000580000000,
0x0000000700000000, 0x0000000700000000)
  object space 6291456K, 99% used
 Metaspace       used 39225K, capacity 39558K, committed 39904K, reserved
1083392K
  class space    used 5736K, capacity 5794K, committed 5888K, reserved
1048576K

What could be going wrong?

Regards
Sab

Column Similarities using DIMSUM fails with GC overhead limit exceeded

Reply via email to