[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332185#comment-15332185 ]
Miao Wang commented on SPARK-15784: ----------------------------------- [~josephkb][~mengxr][~yanboliang] I am trying to add PIC to spark.ml and I have some questions regarding model.predict and saveImpl. The basic PIC algorithm has the following steps: Input: A row-normalized affinity matrix W and the number of clusters k Output: Clusters C1, C2, …, Ck Pick an initial vector v0 Repeat Set vt+1 ← Wvt Set δt+1 ← |vt+1 – vt| Increment t Stop when |δt – δt-1| ≈ 0 Use k-means to cluster points on vt and return clusters C1, C2, …, Ck In the last step, k-means takes the pseudo-eigenvector `v ` generated by PIC to do the classification. Therefore, the model.predict should use the trained k-means to do the prediction. However, the vector `v` should run PIC again on the data to be predicted. So, there is no trained model for predicting new data set. model.predict is actually training again using the PIC.fit method. In this case, PIC.fit and PIC.predict actually call the same run method in MLLib implementation. Since we have to train data anyway, the model save is not useful as there is no model to be save. In the MLLib implementation, save function saves the assignment results of the current data set, which can't be used for new data clustering. The only usage of the result is when the same data is given, we don't have to train again. However, we don't know whether it is the previous training data from the saved model. Please correct me if I misunderstand anything. Thanks! Miao > Add Power Iteration Clustering to spark.ml > ------------------------------------------ > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML > Reporter: Xinh Huynh > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org