[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327272#comment-15327272
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 12:44 PM:
-----------------------------------------------------------

Dear Sean,
I must certainly agree with you on k<<number of points. Although I have to test 
this dataset with several values for K, some of them rather large, as you can 
see.
The command "verbose gc" doesn't add anything new to the Log.

Some more infos, as you requested: I am running this experiment on PySpark, I 
have a .py script which essentially calls the "KMeans.train" function for 
several K values. Nothing that fancy I reckon.

Inside a for-loop that scans several K candidates, there's this one line 

clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, 
epsilon=0.0, initialModel = 
KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:]))

where variables "K" and "k_tmp" change thanks to the for-loop. The former is 
the actual K value, the latter is the position of K inside a vector of 
candidates.
The "KMeans.Model" is basically because I have selected (in a supervised 
fashion) the initial seeds. "initCentroids" is a former Matlab cell array which 
contains the K point IDs that must be selected as seeds. These IDs will be 
extracted from "datatmp", which is a local dataset copy (saved as python numpy 
array).


was (Author: purple):
Dear Sean,
I must certainly agree with you on k<<number of points. Although I have to test 
this dataset with several values for K, some of them rather large, as you can 
see.
The command "verbose gc" doesn't add anything new to the Log.

Some more infos, as you requested: I am running this experiment on PySpark, I 
have a .py script which essentially calls the "KMeans.train" function for 
several K values. Nothing that fancy I reckon.

Inside a for-loop that scans several K candidates, there's this one line 

clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, 
epsilon=0.0, initialModel = 
KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:]))

where variables "K" and "k_tmp" change thanks to the for-loop. The former is 
the actual K value, the latter is the position of K inside a vector of 
candidates.
The "KMeans.Model" is basically because I have selected (in a supervised 
fashion) the initial seeds. "initCentroids" is a former Matlab cell array which 
contains the K point IDs that must be selected as seeds. These IDs will be 
extracted from "datatmp", which is a local python numpy array.

> High Memory Pressure using MLlib K-means
> ----------------------------------------
>
>                 Key: SPARK-15904
>                 URL: https://issues.apache.org/jira/browse/SPARK-15904
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.6.1
>         Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>            Reporter: Alessio
>            Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD <idx> from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to