[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482297#comment-14482297 ] Sai Nishanth Parepally edited comment on SPARK-3219 at 4/7/15 6:10 PM: --- [~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering going to be merged into mllib as I would like to use jaccard distance as a distance metric for kmeans clustering? and I would like to know if I should add this distance metric to derrickburns's repository or just make the current mllib's implementation of kmeans accept a method which computes the distance between any two points? was (Author: nishanthps): [~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering going to be merged into mllib as I would like to use jaccard distance as a distance metric for kmeans clustering? K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns Labels: clustering The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:14 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:16 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:16 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:15 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values as is done with the Fast Euclidean distance which pre-computes magnitudes. (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values as is done with the Fast Euclidean distance which pre-computes magnitudes. (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values as is done with the Fast Euclidean distance which pre-computes magnitudes. (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Point (P), Center (C), and Centroid. Then, one can implementation a distance function Trait (called PointOps below) in a way that allows the implementer to pre-compute values for Point and Center, such as is hard-coded for the fast squared Euclidean distance function in the 1.0.2 K-Means implementation. Since the representation of Point and Center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. Additionally, one can abstract the Distance (Float or Double) and the user data point T. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:23 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values as is done with the Fast Euclidean distance which pre-computes magnitudes. (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:24 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points.) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (the type used to represent distance, such as Float or Double), T (the data type used for a point by the K-Means client), P (the data type used for a point by the distance function), C (the data type used for a cluster center by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (e.g. Float or Double), T (the input data type of a point), P (a point as represented by the distance function), C (a cluster center as represented by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134339#comment-14134339 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM: --- The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (the type used to represent distance, such as Float or Double), T (the data type used for a point by the K-Means client), P (the data type used for a point by the distance function), C (the data type used for a cluster center by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} was (Author: derrickburns): The key abstractions that need to be added to the K-Means implementation to support interesting distance functions are: Distance (the type used to represent distance, such as Float or Double), T (the data type used for a point by the K-Means client), P (the data type used for a point by the distance function), C (the data type used for a cluster center by the distance function), and Centroid. By separating the user type T from the types P (point) and C (center), one can do things like pre-compute values (as is done with the Fast Euclidean distance in the 1.0.2 implementation that pre-computes magnitudes). (With more complex distance functions such as the Kullback-Leibler function, one can pre-compute the log of the points, which is too expensive to re-compute in the distance calculation!) Further, since the representation of point and center is abstracted, the implementer of the trait can use JBlas, Breeze, or whatever math library is preferred, again, without touching the generic K-Means implementation. {code} type Distance = Double trait FP[T] extends Serializable { val weight: Distance val index: Option[T] val raw : Array[Distance] } trait PointOps[P : FP[T], C : FP[T], T] { def distance(p: P, c: C, upperBound: Distance): Distance def userToPoint(v: Array[Double], index: Option[T]): P def centerToPoint(v: C): P def pointToCenter(v: P): C def centroidToCenter(v: Centroid): C def centroidToPoint(v: Centroid): P def centerMoved(v: P, w: C): Boolean } {code} K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120325#comment-14120325 ] Derrick Burns edited comment on SPARK-3219 at 9/15/14 7:26 PM: --- Great! You can find my work here: https://github.com/derrickburns/generalized-kmeans-clustering.git. I should warn you that I rewrote much of the original Spark clusterer because the original is too tightly coupled to using the Euclidean norm and does not allow one to identify efficiently which points belong to which clusters. I have tested this version extensively. You will notice a package call com.rincaro.clusterer.metrics. Please take a look at the two files EuOps.scala and FastEuclideansOps.scala. They both implement the Euclidean norm. However, one is much faster than the other by using the same algebraic transformations that the Spark version uses. This demonstrates that it is possible to be efficient while not being tightly coupled. One could easily re-implement FastEuclideanOps using Breeze or Blas without affecting the core Kmeans implementation. Not included in this project is another distance function that that I have implemented: the Kullback-Leibler distance function, a.k.a. relative entropy. In my implementation, I also perform algebraic transformations to expedite the computation, resulting in a distance computation that is even faster than the fast euclidean norm. Let me know if this is useful to you. was (Author: derrickburns): Great! You can find my work here: https://github.com/derrickburns/generalized-kmeans-clustering.git. I should warn you that I rewrote much of the original Spark clusterer because the original is too tightly coupled to using the Euclidean norm and does not allow one to identify efficiently which points belong to which clusters. I have tested this version extensively. You will notice a package call com.rincaro.clusterer.metrics. Please take a look at the two files EuOps.scala and FastEuclideansOps.scala. They both implement the Euclidean norm. However, one is much faster than the other by using the same algebraic transformations that the Spark version uses. This demonstrates that it is possible to be efficient while not being tightly coupled. One could easily re-implement FastEuclideanOps using Breeze or Blas without effecting the core Kmeans implementation. Not included in this project is another distance function that that I have implemented: the Kullback-Leibler distance function, a.k.a. relative entropy. In my implementation, I also perform algebraic transformations to expedite the computation, resulting in a distance computation that is even faster than the fast euclidean norm. Let me know if this is useful to you. K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org