[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

Sean Owen (JIRA) Mon, 13 Jun 2016 07:42:14 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327510#comment-15327510
 ]


Sean Owen commented on SPARK-15904:
-----------------------------------

Yes, that just means "out of memory". The question is whether this is unusual 
or not. You might try storing the serialized representation in memory, not the 
'raw' object form, which is often bigger. You almost certainly need more 
partitions in the source data, since I expect it's just 1 or 2 partitions 
according to the block size, but, you probably want the problem to be broken 
down into smaller chunks rather than process big chunks at once in memory. It's 
the second arg to textFile.

Finally you may get better results with 2.0, or, by using the ML + Dataset 
APIs. Those are bigger changes though.

> High Memory Pressure using MLlib K-means
> ----------------------------------------
>
>                 Key: SPARK-15904
>                 URL: https://issues.apache.org/jira/browse/SPARK-15904
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.6.1
>         Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>            Reporter: Alessio
>            Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD <idx> from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

Reply via email to