Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-69134341 Hi @yu-iskw and @rnowling , I've spent time reviewing the code and using it in both Python and Scala. Overall great work, terrific to see my little gist turned into something so refined and performant! =) I left lots of comments, most minor, though documenting the caching behavior seems quite important. The one significant addition I'd suggest is exposing another model output: a list of the centers at all nodes in the learned tree. This would be in addition to just the centers of the leaves, which is currently returned by `getCenters` (or `clusterCenters` in Python). Maybe call it `getTreeCenters`. It's basically given by `model.clusterTree.toSeq().map(_.center)`. But we should make sure it's sorted so that it can be indexed using the values from the merge list. In other words, if `Z` is the merge list, and row i indicates that `Z[i,0]` and `Z[i,1]` were merged, we want to be able to get the centers associated with those nodes by calling, for example, `model.treeCenters[Z[i,0]]` and `model.treeCenters[Z[i,1]]`. What do you think?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org