Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/1964#issuecomment-54953348
  
    @mengxr, @erikerlandson, @rnowling  Sorry for having been kept waiting. I 
redesigned the interface of distance functions. Here is the full file change.
    https://github.com/apache/spark/pull/1964/files
    
    I agree to keep MLlib's linear algebra lightweight. And we would had better 
to use Breeze in terms of the performance. However, I think the standardized 
distance function API helps us to expand MLlib efficiently.  And also, by 
contributing another distance functions to Breeze, we can use them in Spark.
    
    I am a bit worried about using Breeze directly in Spark. One reason is that 
we can't absolutely control the release of Breeze. If we want to use a new 
distance function in Spark, we contribute it to Breeze as I did. Because there 
is a time difference between being merged and release. We have to wait. That's 
why we should have a wrapper API for Breeze linear algebra. And the wrapper API 
helps us to extend another distance function easily.
    
    I think we need two the standardized distance function API. One is for 
Spark user, the other is for MLlib developer. `(Vector, Vector) => Double` is 
for Spark user.  `(BV[Double], BV[Double]) => Double ` is for MLlib developer. 
And Breeze should be hidden for Spark user.  At least, we should have a mapping 
function from `String` to a distance function because of PySpark API.
    
    We have to use the implementation in a wrapper API as much as we can. That 
is, you know, I contributed some distance functions to Breeze. However,they 
have not  been released on Maven repository. That's why we can't use them in 
Spark yet. If they will be released, I will modify the implementation, such as 
the  `apply` method of `EuclideanDistanceMetric` class.
    
    ## Diff Summary
    
    - `(BV[Double], BV[Double]) => Double` is for Spark user, such as 
`EuclideanDistanceMetric` class
    - Companion Object of the distance Function is for MLlib developer, such as 
`EuclideanDistanceMetric` object
    - Extract WeightedDistanceMetric and its sub class from DistanceMetric
    - Extract WeightedDistanceMeasure and its sub class from DistanceMeasure
    - Add a mapping function in 
`mllib/src/main/scala/org/apache/spark/mllib/linalg/distance/DistanceMeasureFactory.scala`
    
    thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to