[jira] [Created] (SPARK-10654) Add columnSimilarities to IndexedRowMatrix
Reza Zadeh created SPARK-10654: -- Summary: Add columnSimilarities to IndexedRowMatrix Key: SPARK-10654 URL: https://issues.apache.org/jira/browse/SPARK-10654 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Add columnSimilarities to IndexedRowMatrix. In another JIRA adding rowSimilarities to IndexedRowMatrix, tracked by SPARK-4823 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511621#comment-14511621 ] Reza Zadeh commented on SPARK-4590: --- I am resolving this ticket as it has served its purpose. We can continue the discussion in two places: 1) SPARK-6932 a prototype parameter server 2) SPARK-6567 to avoid using parameter servers for linear models and use joins instead Early investigation of parameter server --- Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh resolved SPARK-4590. --- Resolution: Fixed Early investigation of parameter server --- Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey
[ https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504583#comment-14504583 ] Reza Zadeh commented on SPARK-6567: --- Hi Hucheng, Yes this is what I have in mind. Do you have a sample implementation that is public? Thanks, Reza Large linear model parallelism via a join and reduceByKey - Key: SPARK-6567 URL: https://issues.apache.org/jira/browse/SPARK-6567 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Reza Zadeh Attachments: model-parallelism.pptx To train a linear model, each training point in the training set needs its dot product computed against the model, per iteration. If the model is large (too large to fit in memory on a single machine) then SPARK-4590 proposes using parameter server. There is an easier way to achieve this without parameter servers. In particular, if the data is held as a BlockMatrix and the model as an RDD, then each block can be joined with the relevant part of the model, followed by a reduceByKey to compute the dot products. This obviates the need for a parameter server, at least for linear models. However, it's unclear how it compares performance-wise to parameter servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6567) Large linear model parallelism via a join and reduceByKey
[ https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504583#comment-14504583 ] Reza Zadeh edited comment on SPARK-6567 at 4/21/15 8:32 AM: Hi Hucheng, Yes this is what I have in mind. Do you have a sample implementation that is public? Are you interested in working on this? Thanks, Reza was (Author: rezazadeh): Hi Hucheng, Yes this is what I have in mind. Do you have a sample implementation that is public? Thanks, Reza Large linear model parallelism via a join and reduceByKey - Key: SPARK-6567 URL: https://issues.apache.org/jira/browse/SPARK-6567 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Reza Zadeh Attachments: model-parallelism.pptx To train a linear model, each training point in the training set needs its dot product computed against the model, per iteration. If the model is large (too large to fit in memory on a single machine) then SPARK-4590 proposes using parameter server. There is an easier way to achieve this without parameter servers. In particular, if the data is held as a BlockMatrix and the model as an RDD, then each block can be joined with the relevant part of the model, followed by a reduceByKey to compute the dot products. This obviates the need for a parameter server, at least for linear models. However, it's unclear how it compares performance-wise to parameter servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500850#comment-14500850 ] Reza Zadeh commented on SPARK-6932: --- Hi Qiping, Thank you for making this very well designed proposal. Because this is such a large change, I think it makes sense to release some kind of separate package first that we can test thoroughly before any code is committed to master. Also, we've decided to first try implementing linear models without parameter servers as in SPARK-6567. However, we still need a key-value store in Spark, so this proposal is helpful for that. If you start building this proposal in a separate package, we can work concurrently. Your proposal generalizes to a key-value store with limited fault-tolerance (i.e. just checkpointing). This is definitely a useful abstraction. Do you think you could turn it into a package first? Best, Reza A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib, Spark Core Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use `reduceFunc` to reduce these `delta`s frist. def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit // update multiple parameters at the same time, use the same `reduceFunc`. def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = T: Unit // advance clock to indicate that current iteration is finished. def clock(): Unit // block until all workers have reached this line of code. def sync(): Unit {code} *`PSContext` provides following functions to use on driver:* {code} // load parameters from existing rdd. def loadPSModel[T](model: RDD[String, T]) // fetch parameters from parameter server to construct model. def fetchPSModel[T](keys: Array[String]): Array[T] {code} *A new function has been add to `RDD` to run parameter server tasks:* {code} // run the provided `func`
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497511#comment-14497511 ] Reza Zadeh commented on SPARK-4590: --- I agree IndexedRDD is not the best way forward as it won't have the desired throughput. It is worth mentioning that at least for linear models, we have figured out how to train without the need for a parameter server. See SPARK-6567 We are currently leaning towards doing option (1) with some default implementation, but first we should evaluate how far we can get without parameter servers. Most of our needs could be satisfied with SPARK-6567, at a fraction of the infrastructure building cost. Early investigation of parameter server --- Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497520#comment-14497520 ] Reza Zadeh commented on SPARK-6932: --- Hi Qiping, thank you for this proposal. I agree that using IndexedRDD won't have the required throughput. However, we should evaluate SPARK-6567 concurrently, as it may satisfy many of our large-model needs without requiring a full-blown parameter server. I will evaluate this particular interface and will update again on Friday. A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use `reduceFunc` to reduce these `delta`s frist. def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit // update multiple parameters at the same time, use the same `reduceFunc`. def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = T: Unit // advance clock to indicate that current iteration is finished. def clock(): Unit // block until all workers have reached this line of code. def sync(): Unit {code} *`PSContext` provides following functions to use on driver:* {code} // load parameters from existing rdd. def loadPSModel[T](model: RDD[String, T]) // fetch parameters from parameter server to construct model. def fetchPSModel[T](keys: Array[String]): Array[T] {code} *A new function has been add to `RDD` to run parameter server tasks:* {code} // run the provided `func` on each partition of this RDD. // This function can use data of this partition(the first argument) // and a parameter server client(the second argument). // See the following Logistic Regression for an example. def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U] {code} h2. Example Here is an example of using our prototype to implement logistic regression:
[jira] [Updated] (SPARK-6713) Iterators in columnSimilarities to allow flatMap spill
[ https://issues.apache.org/jira/browse/SPARK-6713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-6713: -- Description: We should use Iterators in columnSimilarities to allow mapPartitionsWithIndex to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark. (was: We should use Iterators in columnSimilarities to allow flatMap to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark.) Iterators in columnSimilarities to allow flatMap spill -- Key: SPARK-6713 URL: https://issues.apache.org/jira/browse/SPARK-6713 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh We should use Iterators in columnSimilarities to allow mapPartitionsWithIndex to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6713) Iterators in columnSimilarities to allow flatMap spill
Reza Zadeh created SPARK-6713: - Summary: Iterators in columnSimilarities to allow flatMap spill Key: SPARK-6713 URL: https://issues.apache.org/jira/browse/SPARK-6713 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh We should use Iterators in columnSimilarities to allow flatMap to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6567) Large linear model parallelism via a join and reduceByKey
Reza Zadeh created SPARK-6567: - Summary: Large linear model parallelism via a join and reduceByKey Key: SPARK-6567 URL: https://issues.apache.org/jira/browse/SPARK-6567 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Reza Zadeh To train a linear model, each training point in the training set needs its dot product computed against the model, per iteration. If the model is large (too large to fit in memory on a single machine) then SPARK-4590 proposes using parameter server. There is an easier way to achieve this without parameter servers. In particular, if the data is held as a BlockMatrix and the model as an RDD, then each block can be joined with the relevant part of the model, followed by a reduceByKey to compute the dot products. This obviates the need for a parameter server, at least for linear models. However, it's unclear how it compares performance-wise to parameter servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385032#comment-14385032 ] Reza Zadeh commented on SPARK-4590: --- The umbrella JIRA for IndexedRDD is at SPARK-2365 Early investigation of parameter server --- Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365581#comment-14365581 ] Reza Zadeh commented on SPARK-4590: --- Hi Qiping, We are waiting for IndexedRDD to be optimized and merged into Spark master. Early investigation of parameter server --- Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6346) Use faster converging optimization method in MLlib
[ https://issues.apache.org/jira/browse/SPARK-6346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-6346: -- Description: According to experiments in SPARK-1503, the LBFGS algorithm converges much faster than our current proximal gradient, which is used throughout MLlib. This ticket is to track replacing slower-converging algorithms, with faster components e.g. LBFGS This needs unification of the Optimization interface. For example, the LBFGS implementation should not know about RDDs. was:According to experiments in SPARK-1503, the LBFGS algorithm converges much faster than our current proximal gradient, which is used throughout MLlib. This ticket is to track replacing slower-converging algorithms, with faster components e.g. LBFGS Use faster converging optimization method in MLlib -- Key: SPARK-6346 URL: https://issues.apache.org/jira/browse/SPARK-6346 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Reza Zadeh According to experiments in SPARK-1503, the LBFGS algorithm converges much faster than our current proximal gradient, which is used throughout MLlib. This ticket is to track replacing slower-converging algorithms, with faster components e.g. LBFGS This needs unification of the Optimization interface. For example, the LBFGS implementation should not know about RDDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6346) Use faster converging optimization method in MLlib
Reza Zadeh created SPARK-6346: - Summary: Use faster converging optimization method in MLlib Key: SPARK-6346 URL: https://issues.apache.org/jira/browse/SPARK-6346 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Reza Zadeh According to experiments in SPARK-1503, the LBFGS algorithm converges much faster than our current proximal gradient, which is used throughout MLlib. This ticket is to track replacing slower-converging algorithms, with faster components e.g. LBFGS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition
[ https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300848#comment-14300848 ] Reza Zadeh commented on SPARK-4981: --- Another option: see slide 31 to solve the problem using IndexedRDDs, thanks to Ankur's nice slides and work on IndexedRDD: https://issues.apache.org/jira/secure/attachment/12656374/2014-07-07-IndexedRDD-design-review.pdf Add a streaming singular value decomposition Key: SPARK-4981 URL: https://issues.apache.org/jira/browse/SPARK-4981 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman This is for tracking WIP on a streaming singular value decomposition implementation. This will likely be more complex than the existing streaming algorithms (k-means, regression), but should be possible using the family of sequential update rule outlined in this paper: Fast low-rank modifications of the thin singular value decomposition by Matthew Brand http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition
[ https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300849#comment-14300849 ] Reza Zadeh commented on SPARK-4981: --- Another option: see slide 31 to solve the problem using IndexedRDDs, thanks to Ankur's nice slides and work on IndexedRDD: https://issues.apache.org/jira/secure/attachment/12656374/2014-07-07-IndexedRDD-design-review.pdf Add a streaming singular value decomposition Key: SPARK-4981 URL: https://issues.apache.org/jira/browse/SPARK-4981 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman This is for tracking WIP on a streaming singular value decomposition implementation. This will likely be more complex than the existing streaming algorithms (k-means, regression), but should be possible using the family of sequential update rule outlined in this paper: Fast low-rank modifications of the thin singular value decomposition by Matthew Brand http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4981) Add a streaming singular value decomposition
[ https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-4981: -- Comment: was deleted (was: Another option: see slide 31 to solve the problem using IndexedRDDs, thanks to Ankur's nice slides and work on IndexedRDD: https://issues.apache.org/jira/secure/attachment/12656374/2014-07-07-IndexedRDD-design-review.pdf) Add a streaming singular value decomposition Key: SPARK-4981 URL: https://issues.apache.org/jira/browse/SPARK-4981 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman This is for tracking WIP on a streaming singular value decomposition implementation. This will likely be more complex than the existing streaming algorithms (k-means, regression), but should be possible using the family of sequential update rule outlined in this paper: Fast low-rank modifications of the thin singular value decomposition by Matthew Brand http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition
[ https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299666#comment-14299666 ] Reza Zadeh commented on SPARK-4981: --- To be model parallel, we can simply warm-start the current ALS implementation in org.apache.spark.mllib.recommendation The work involved would be to expose a warm-start option in ALS, and then redo training with say 2 iterations instead of 10, with each batch of RDDs. The stream would be over batches of Ratings. This should be the simplest option. Add a streaming singular value decomposition Key: SPARK-4981 URL: https://issues.apache.org/jira/browse/SPARK-4981 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman This is for tracking WIP on a streaming singular value decomposition implementation. This will likely be more complex than the existing streaming algorithms (k-means, regression), but should be possible using the family of sequential update rule outlined in this paper: Fast low-rank modifications of the thin singular value decomposition by Matthew Brand http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
Reza Zadeh created SPARK-5301: - Summary: Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix Key: SPARK-5301 URL: https://issues.apache.org/jira/browse/SPARK-5301 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
[ https://issues.apache.org/jira/browse/SPARK-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-5301: -- Description: 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so it should be there) 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion method to be added) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix - Key: SPARK-5301 URL: https://issues.apache.org/jira/browse/SPARK-5301 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so it should be there) 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion method to be added) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition
[ https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259789#comment-14259789 ] Reza Zadeh commented on SPARK-4981: --- We could do matrix completion (least squares objective, reqularized, note that this is not SVD) in a streaming fashion using Stochastic Gradient Descent. See the update equations in Algorithm 1: http://stanford.edu/~rezab/papers/factorbird.pdf The stream is over individual entries (as opposed a whole row/column). We should probably do streaming matrix completion before streaming SVD. Add a streaming singular value decomposition Key: SPARK-4981 URL: https://issues.apache.org/jira/browse/SPARK-4981 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman This is for tracking WIP on a streaming singular value decomposition implementation. This will likely be more complex than the existing streaming algorithms (k-means, regression), but should be possible using the family of sequential update rule outlined in this paper: Fast low-rank modifications of the thin singular value decomposition by Matthew Brand http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4981) Add a streaming singular value decomposition
[ https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259789#comment-14259789 ] Reza Zadeh edited comment on SPARK-4981 at 12/29/14 2:04 AM: - We could do matrix completion (least squares objective, regularized, note that this is not SVD) in a streaming fashion using Stochastic Gradient Descent. See the update equations in Algorithm 1: http://stanford.edu/~rezab/papers/factorbird.pdf The stream is over individual entries (as opposed a whole row/column). We should probably do streaming matrix completion before streaming SVD. was (Author: rezazadeh): We could do matrix completion (least squares objective, reqularized, note that this is not SVD) in a streaming fashion using Stochastic Gradient Descent. See the update equations in Algorithm 1: http://stanford.edu/~rezab/papers/factorbird.pdf The stream is over individual entries (as opposed a whole row/column). We should probably do streaming matrix completion before streaming SVD. Add a streaming singular value decomposition Key: SPARK-4981 URL: https://issues.apache.org/jira/browse/SPARK-4981 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman This is for tracking WIP on a streaming singular value decomposition implementation. This will likely be more complex than the existing streaming algorithms (k-means, regression), but should be possible using the family of sequential update rule outlined in this paper: Fast low-rank modifications of the thin singular value decomposition by Matthew Brand http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255308#comment-14255308 ] Reza Zadeh commented on SPARK-4590: --- Here is an initial review of current systems and two ways to go forward: https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing Early investigation of parameter server --- Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243133#comment-14243133 ] Reza Zadeh commented on SPARK-4823: --- Given that we're talking about RowMatrices, computing rowSimilarities the same way as columnSimilarities would require transposing the matrix, which is dangerous when the original matrix has many rows. RowMatrix assumes a single row should fit in memory on a single machine, but this might not happen after transposing a RowMatrix. rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4823) rowSimilarities
Reza Zadeh created SPARK-4823: - Summary: rowSimilarities Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227023#comment-14227023 ] Reza Zadeh commented on SPARK-1503: --- Thanks Aaron. From an implementation perspective, it's probably easier to implement a constant step size first. From there you can see if there is any finicky behavior and compare to the unaccelerated proximal gradient already in Spark. If it works well enough, we should commit the first PR without backtracking, and then experiment with backtracking, otherwise if you see strange behavior then you can decide if backtracking would solve it. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224117#comment-14224117 ] Reza Zadeh commented on SPARK-4590: --- Some starting points: - http://stanford.edu/~rezab/papers/factorbird.pdf - http://parameterserver.org/ More detailed comparisons coming. Early investigation of parameter server --- Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3254) Streaming K-Means
[ https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178825#comment-14178825 ] Reza Zadeh commented on SPARK-3254: --- Will this make it to 1.2? Streaming K-Means - Key: SPARK-3254 URL: https://issues.apache.org/jira/browse/SPARK-3254 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Xiangrui Meng Assignee: Jeremy Freeman Streaming K-Means with proper decay settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175557#comment-14175557 ] Reza Zadeh commented on SPARK-3434: --- Thanks Shivaram! As discussed over the phone, we will use your design and build upon it, so that you can focus on the linear algebraic operations such as TSQR. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Shivaram Venkataraman This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3974) Block matrix abstracitons and partitioners
Reza Zadeh created SPARK-3974: - Summary: Block matrix abstracitons and partitioners Key: SPARK-3974 URL: https://issues.apache.org/jira/browse/SPARK-3974 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh We need abstractions for block matrices with fixed block sizes, with each block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3975) Block Matrix addition and multiplication
Reza Zadeh created SPARK-3975: - Summary: Block Matrix addition and multiplication Key: SPARK-3975 URL: https://issues.apache.org/jira/browse/SPARK-3975 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Block matrix addition and multiplication, for the case when partitioning schemes match. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3976) Detect block matrix partitioning schemes
Reza Zadeh created SPARK-3976: - Summary: Detect block matrix partitioning schemes Key: SPARK-3976 URL: https://issues.apache.org/jira/browse/SPARK-3976 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Provide repartitioning methods for block matrices to repartition matrix for add/multiply of non-identically partitioned matrices -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3977) Conversions between {Row, Coordinate}Matrix - BlockMatrix
Reza Zadeh created SPARK-3977: - Summary: Conversions between {Row, Coordinate}Matrix - BlockMatrix Key: SPARK-3977 URL: https://issues.apache.org/jira/browse/SPARK-3977 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Build conversion functions between {Row, Coordinate}Matrix - BlockMatrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3974) Block matrix abstracitons and partitioners
[ https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3974: -- Component/s: MLlib Block matrix abstracitons and partitioners -- Key: SPARK-3974 URL: https://issues.apache.org/jira/browse/SPARK-3974 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh We need abstractions for block matrices with fixed block sizes, with each block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3976) Detect block matrix partitioning schemes
[ https://issues.apache.org/jira/browse/SPARK-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3976: -- Component/s: MLlib Detect block matrix partitioning schemes Key: SPARK-3976 URL: https://issues.apache.org/jira/browse/SPARK-3976 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Provide repartitioning methods for block matrices to repartition matrix for add/multiply of non-identically partitioned matrices -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3977) Conversions between {Row, Coordinate}Matrix - BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3977: -- Component/s: MLlib Conversions between {Row, Coordinate}Matrix - BlockMatrix --- Key: SPARK-3977 URL: https://issues.apache.org/jira/browse/SPARK-3977 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Build conversion functions between {Row, Coordinate}Matrix - BlockMatrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3820) Specialize columnSimilarity() without any threshold
[ https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162814#comment-14162814 ] Reza Zadeh commented on SPARK-3820: --- I will do an experiment to see if the random number generation is adding significant overhead, and if it is, then add a flag to avoid it when threshold zero is given. Specialize columnSimilarity() without any threshold --- Key: SPARK-3820 URL: https://issues.apache.org/jira/browse/SPARK-3820 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to compute the exact cosine similarities. It still requires sampling, which is unnecessary for this case. We should have a specialized version for it, in order to have a fair comparison with DIMSUM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3820) Specialize columnSimilarity() without any threshold
[ https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3820: -- Comment: was deleted (was: See previous comment on resolution.) Specialize columnSimilarity() without any threshold --- Key: SPARK-3820 URL: https://issues.apache.org/jira/browse/SPARK-3820 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to compute the exact cosine similarities. It still requires sampling, which is unnecessary for this case. We should have a specialized version for it, in order to have a fair comparison with DIMSUM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3820) Specialize columnSimilarity() without any threshold
[ https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh resolved SPARK-3820. --- Resolution: Fixed See previous comment on resolution. Specialize columnSimilarity() without any threshold --- Key: SPARK-3820 URL: https://issues.apache.org/jira/browse/SPARK-3820 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to compute the exact cosine similarities. It still requires sampling, which is unnecessary for this case. We should have a specialized version for it, in order to have a fair comparison with DIMSUM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3820) Specialize columnSimilarity() without any threshold
[ https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162899#comment-14162899 ] Reza Zadeh commented on SPARK-3820: --- I ran columnSimilarities(0.0) with the random number generation commented, and uncommented, and didn't observe any difference in timing for completion of stage mapPartitionsWithIndex. Specialize columnSimilarity() without any threshold --- Key: SPARK-3820 URL: https://issues.apache.org/jira/browse/SPARK-3820 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to compute the exact cosine similarities. It still requires sampling, which is unnecessary for this case. We should have a specialized version for it, in order to have a fair comparison with DIMSUM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3790) CosineSimilarity via DIMSUM example
Reza Zadeh created SPARK-3790: - Summary: CosineSimilarity via DIMSUM example Key: SPARK-3790 URL: https://issues.apache.org/jira/browse/SPARK-3790 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Create an example that gives approximation error for DIMSUM using arbitrary RowMatrix given via commandline. PR tracking this: https://github.com/apache/spark/pull/2622 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158949#comment-14158949 ] Reza Zadeh commented on SPARK-3434: --- Any updates Shivaraman? Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152275#comment-14152275 ] Reza Zadeh commented on SPARK-3434: --- It looks like Shivaram Venkataraman from the AMPlab has started work on this. I will be meeting with him to see if we can reuse some his work. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118427#comment-14118427 ] Reza Zadeh commented on SPARK-2885: --- Hi Clive, Can you please post the code you used to generate the error? I am having trouble reproducing this. There is a test for sparse vectors in the PR, which is not catching this. Reza All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118445#comment-14118445 ] Reza Zadeh commented on SPARK-2885: --- I don't see your code. Looking at the exception however, when generating the random sparse vectors, did you remember to sort by indices? This is a requirement of breeze, and it looks like the example you have violate it: http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118464#comment-14118464 ] Reza Zadeh commented on SPARK-2885: --- Yes, looking at your code it looks like you need to sort the indices. All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118464#comment-14118464 ] Reza Zadeh edited comment on SPARK-2885 at 9/2/14 6:20 PM: --- Yes, looking at your code it looks like you need to sort by indices. was (Author: rezazadeh): Yes, looking at your code it looks like you need to sort the indices. All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-2885: -- Description: Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf was: Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that
[jira] [Created] (SPARK-2885) All-pairs similarity via DIMSUM
Reza Zadeh created SPARK-2885: - Summary: All-pairs similarity via DIMSUM Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. Current PR for this is WIP: https://github.com/apache/spark/pull/1778 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-2885: -- Description: Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. Current PR: https://github.com/apache/spark/pull/1778 was: Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. Current PR for this is WIP: https://github.com/apache/spark/pull/1778 All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. Current PR: https://github.com/apache/spark/pull/1778 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-2885: -- Description: Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix was: Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. Current PR: https://github.com/apache/spark/pull/1778 All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix -- This message was sent by Atlassian JIRA (v6.2#6252)