[jira] [Created] (SPARK-10654) Add columnSimilarities to IndexedRowMatrix

2015-09-16 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-10654:
--

 Summary: Add columnSimilarities to IndexedRowMatrix
 Key: SPARK-10654
 URL: https://issues.apache.org/jira/browse/SPARK-10654
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh


Add columnSimilarities to IndexedRowMatrix.

In another JIRA adding rowSimilarities to IndexedRowMatrix, tracked by 
SPARK-4823



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2015-04-24 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511621#comment-14511621
 ] 

Reza Zadeh commented on SPARK-4590:
---

I am resolving this ticket as it has served its purpose. We can continue the 
discussion in two places:

1) SPARK-6932 a prototype parameter server
2) SPARK-6567 to avoid using parameter servers for linear models and use joins 
instead

 Early investigation of parameter server
 ---

 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 In the currently implementation of GLM solvers, we save intermediate models 
 on the driver node and update it through broadcast and aggregation. Even with 
 torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
 ~10 million features. This JIRA is for investigating the parameter server 
 approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4590) Early investigation of parameter server

2015-04-24 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh resolved SPARK-4590.
---
Resolution: Fixed

 Early investigation of parameter server
 ---

 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 In the currently implementation of GLM solvers, we save intermediate models 
 on the driver node and update it through broadcast and aggregation. Even with 
 torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
 ~10 million features. This JIRA is for investigating the parameter server 
 approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

2015-04-21 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504583#comment-14504583
 ] 

Reza Zadeh commented on SPARK-6567:
---

Hi Hucheng,
Yes this is what I have in mind. Do you have a sample implementation that is 
public?
Thanks,
Reza

 Large linear model parallelism via a join and reduceByKey
 -

 Key: SPARK-6567
 URL: https://issues.apache.org/jira/browse/SPARK-6567
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Reza Zadeh
 Attachments: model-parallelism.pptx


 To train a linear model, each training point in the training set needs its 
 dot product computed against the model, per iteration. If the model is large 
 (too large to fit in memory on a single machine) then SPARK-4590 proposes 
 using parameter server.
 There is an easier way to achieve this without parameter servers. In 
 particular, if the data is held as a BlockMatrix and the model as an RDD, 
 then each block can be joined with the relevant part of the model, followed 
 by a reduceByKey to compute the dot products.
 This obviates the need for a parameter server, at least for linear models. 
 However, it's unclear how it compares performance-wise to parameter servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

2015-04-21 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504583#comment-14504583
 ] 

Reza Zadeh edited comment on SPARK-6567 at 4/21/15 8:32 AM:


Hi Hucheng,
Yes this is what I have in mind. Do you have a sample implementation that is 
public? Are you interested in working on this?
Thanks,
Reza


was (Author: rezazadeh):
Hi Hucheng,
Yes this is what I have in mind. Do you have a sample implementation that is 
public?
Thanks,
Reza

 Large linear model parallelism via a join and reduceByKey
 -

 Key: SPARK-6567
 URL: https://issues.apache.org/jira/browse/SPARK-6567
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Reza Zadeh
 Attachments: model-parallelism.pptx


 To train a linear model, each training point in the training set needs its 
 dot product computed against the model, per iteration. If the model is large 
 (too large to fit in memory on a single machine) then SPARK-4590 proposes 
 using parameter server.
 There is an easier way to achieve this without parameter servers. In 
 particular, if the data is held as a BlockMatrix and the model as an RDD, 
 then each block can be joined with the relevant part of the model, followed 
 by a reduceByKey to compute the dot products.
 This obviates the need for a parameter server, at least for linear models. 
 However, it's unclear how it compares performance-wise to parameter servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2015-04-17 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500850#comment-14500850
 ] 

Reza Zadeh commented on SPARK-6932:
---

Hi Qiping,
Thank you for making this very well designed proposal. 

Because this is such a large change, I think it makes sense to release some 
kind of separate package first that we can test thoroughly before any code is 
committed to master.

Also, we've decided to first try implementing linear models without parameter 
servers as in SPARK-6567. However, we still need a key-value store in Spark, so 
this proposal is helpful for that. If you start building this proposal in a 
separate package, we can work concurrently. 

Your proposal generalizes to a key-value store with limited fault-tolerance 
(i.e. just checkpointing). This is definitely a useful abstraction. Do you 
think you could turn it into a package first?

Best,
Reza

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib, Spark Core
Reporter: Qiping Li

  h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use `reduceFunc` to reduce these `delta`s frist.
 def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit
 // update multiple parameters at the same time, use the same `reduceFunc`.
 def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = 
 T: Unit
 
 // advance clock to indicate that current iteration is finished.
 def clock(): Unit
  
 // block until all workers have reached this line of code.
 def sync(): Unit
 {code}
 *`PSContext` provides following functions to use on driver:*
 {code}
 // load parameters from existing rdd.
 def loadPSModel[T](model: RDD[String, T]) 
 // fetch parameters from parameter server to construct model.
 def fetchPSModel[T](keys: Array[String]): Array[T]
 {code} 
 
 *A new function has been add to `RDD` to run parameter server tasks:*
 {code}
 // run the provided `func` 

[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2015-04-15 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497511#comment-14497511
 ] 

Reza Zadeh commented on SPARK-4590:
---

I agree IndexedRDD is not the best way forward as it won't have the desired 
throughput.

It is worth mentioning that at least for linear models, we have figured out how 
to train without the need for a parameter server. See SPARK-6567

We are currently leaning towards doing option (1) with some default 
implementation, but first we should evaluate how far we can get without 
parameter servers. Most of our needs could be satisfied with SPARK-6567, at a 
fraction of the infrastructure building cost.

 Early investigation of parameter server
 ---

 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 In the currently implementation of GLM solvers, we save intermediate models 
 on the driver node and update it through broadcast and aggregation. Even with 
 torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
 ~10 million features. This JIRA is for investigating the parameter server 
 approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2015-04-15 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497520#comment-14497520
 ] 

Reza Zadeh commented on SPARK-6932:
---

Hi Qiping, thank you for this proposal. I agree that using IndexedRDD won't 
have the required throughput. However, we should evaluate SPARK-6567 
concurrently, as it may satisfy many of our large-model needs without requiring 
a full-blown parameter server. 

I will evaluate this particular interface and will update again on Friday.

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Qiping Li

 h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use `reduceFunc` to reduce these `delta`s frist.
 def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit
 // update multiple parameters at the same time, use the same `reduceFunc`.
 def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = 
 T: Unit
 
 // advance clock to indicate that current iteration is finished.
 def clock(): Unit
  
 // block until all workers have reached this line of code.
 def sync(): Unit
 {code}
 *`PSContext` provides following functions to use on driver:*
 {code}
 // load parameters from existing rdd.
 def loadPSModel[T](model: RDD[String, T]) 
 // fetch parameters from parameter server to construct model.
 def fetchPSModel[T](keys: Array[String]): Array[T]
 {code} 
 
 *A new function has been add to `RDD` to run parameter server tasks:*
 {code}
 // run the provided `func` on each partition of this RDD. 
 // This function can use data of this partition(the first argument) 
 // and a parameter server client(the second argument). 
 // See the following Logistic Regression for an example.
 def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U]

 {code}
 h2. Example
 Here is an example of using our prototype to implement logistic regression:
 

[jira] [Updated] (SPARK-6713) Iterators in columnSimilarities to allow flatMap spill

2015-04-05 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-6713:
--
Description: We should use Iterators in columnSimilarities to allow 
mapPartitionsWithIndex to spill to disk. This could happen in a dense and large 
column - this way Spark can spill the pairs onto disk instead of building all 
the pairs before handing them to Spark.  (was: We should use Iterators in 
columnSimilarities to allow flatMap to spill to disk. This could happen in a 
dense and large column - this way Spark can spill the pairs onto disk instead 
of building all the pairs before handing them to Spark.)

 Iterators in columnSimilarities to allow flatMap spill
 --

 Key: SPARK-6713
 URL: https://issues.apache.org/jira/browse/SPARK-6713
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 We should use Iterators in columnSimilarities to allow mapPartitionsWithIndex 
 to spill to disk. This could happen in a dense and large column - this way 
 Spark can spill the pairs onto disk instead of building all the pairs before 
 handing them to Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6713) Iterators in columnSimilarities to allow flatMap spill

2015-04-05 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-6713:
-

 Summary: Iterators in columnSimilarities to allow flatMap spill
 Key: SPARK-6713
 URL: https://issues.apache.org/jira/browse/SPARK-6713
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh


We should use Iterators in columnSimilarities to allow flatMap to spill to 
disk. This could happen in a dense and large column - this way Spark can spill 
the pairs onto disk instead of building all the pairs before handing them to 
Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

2015-03-27 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-6567:
-

 Summary: Large linear model parallelism via a join and reduceByKey
 Key: SPARK-6567
 URL: https://issues.apache.org/jira/browse/SPARK-6567
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Reza Zadeh


To train a linear model, each training point in the training set needs its dot 
product computed against the model, per iteration. If the model is large (too 
large to fit in memory on a single machine) then SPARK-4590 proposes using 
parameter server.

There is an easier way to achieve this without parameter servers. In 
particular, if the data is held as a BlockMatrix and the model as an RDD, then 
each block can be joined with the relevant part of the model, followed by a 
reduceByKey to compute the dot products.

This obviates the need for a parameter server, at least for linear models. 
However, it's unclear how it compares performance-wise to parameter servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2015-03-27 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385032#comment-14385032
 ] 

Reza Zadeh commented on SPARK-4590:
---

The umbrella JIRA for IndexedRDD is at SPARK-2365

 Early investigation of parameter server
 ---

 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 In the currently implementation of GLM solvers, we save intermediate models 
 on the driver node and update it through broadcast and aggregation. Even with 
 torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
 ~10 million features. This JIRA is for investigating the parameter server 
 approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2015-03-17 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365581#comment-14365581
 ] 

Reza Zadeh commented on SPARK-4590:
---

Hi Qiping, We are waiting for IndexedRDD to be optimized and merged into Spark 
master.


 Early investigation of parameter server
 ---

 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 In the currently implementation of GLM solvers, we save intermediate models 
 on the driver node and update it through broadcast and aggregation. Even with 
 torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
 ~10 million features. This JIRA is for investigating the parameter server 
 approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6346) Use faster converging optimization method in MLlib

2015-03-16 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-6346:
--
Description: 
According to experiments in SPARK-1503, the LBFGS algorithm converges much 
faster than our current proximal gradient, which is used throughout MLlib. This 
ticket is to track replacing slower-converging algorithms, with faster 
components e.g. LBFGS

This needs unification of the Optimization interface. For example, the LBFGS 
implementation should not know about RDDs.

  was:According to experiments in SPARK-1503, the LBFGS algorithm converges 
much faster than our current proximal gradient, which is used throughout MLlib. 
This ticket is to track replacing slower-converging algorithms, with faster 
components e.g. LBFGS


 Use faster converging optimization method in MLlib
 --

 Key: SPARK-6346
 URL: https://issues.apache.org/jira/browse/SPARK-6346
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Reza Zadeh

 According to experiments in SPARK-1503, the LBFGS algorithm converges much 
 faster than our current proximal gradient, which is used throughout MLlib. 
 This ticket is to track replacing slower-converging algorithms, with faster 
 components e.g. LBFGS
 This needs unification of the Optimization interface. For example, the LBFGS 
 implementation should not know about RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6346) Use faster converging optimization method in MLlib

2015-03-15 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-6346:
-

 Summary: Use faster converging optimization method in MLlib
 Key: SPARK-6346
 URL: https://issues.apache.org/jira/browse/SPARK-6346
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Reza Zadeh


According to experiments in SPARK-1503, the LBFGS algorithm converges much 
faster than our current proximal gradient, which is used throughout MLlib. This 
ticket is to track replacing slower-converging algorithms, with faster 
components e.g. LBFGS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition

2015-02-01 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300848#comment-14300848
 ] 

Reza Zadeh commented on SPARK-4981:
---

Another option: see slide 31 to solve the problem using IndexedRDDs, thanks to 
Ankur's nice slides and work on IndexedRDD:

https://issues.apache.org/jira/secure/attachment/12656374/2014-07-07-IndexedRDD-design-review.pdf

 Add a streaming singular value decomposition
 

 Key: SPARK-4981
 URL: https://issues.apache.org/jira/browse/SPARK-4981
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 This is for tracking WIP on a streaming singular value decomposition 
 implementation. This will likely be more complex than the existing streaming 
 algorithms (k-means, regression), but should be possible using the family of 
 sequential update rule outlined in this paper:
 Fast low-rank modifications of the thin singular value decomposition
 by Matthew Brand
 http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition

2015-02-01 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300849#comment-14300849
 ] 

Reza Zadeh commented on SPARK-4981:
---

Another option: see slide 31 to solve the problem using IndexedRDDs, thanks to 
Ankur's nice slides and work on IndexedRDD:

https://issues.apache.org/jira/secure/attachment/12656374/2014-07-07-IndexedRDD-design-review.pdf

 Add a streaming singular value decomposition
 

 Key: SPARK-4981
 URL: https://issues.apache.org/jira/browse/SPARK-4981
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 This is for tracking WIP on a streaming singular value decomposition 
 implementation. This will likely be more complex than the existing streaming 
 algorithms (k-means, regression), but should be possible using the family of 
 sequential update rule outlined in this paper:
 Fast low-rank modifications of the thin singular value decomposition
 by Matthew Brand
 http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4981) Add a streaming singular value decomposition

2015-02-01 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-4981:
--
Comment: was deleted

(was: Another option: see slide 31 to solve the problem using IndexedRDDs, 
thanks to Ankur's nice slides and work on IndexedRDD:

https://issues.apache.org/jira/secure/attachment/12656374/2014-07-07-IndexedRDD-design-review.pdf)

 Add a streaming singular value decomposition
 

 Key: SPARK-4981
 URL: https://issues.apache.org/jira/browse/SPARK-4981
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 This is for tracking WIP on a streaming singular value decomposition 
 implementation. This will likely be more complex than the existing streaming 
 algorithms (k-means, regression), but should be possible using the family of 
 sequential update rule outlined in this paper:
 Fast low-rank modifications of the thin singular value decomposition
 by Matthew Brand
 http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition

2015-01-30 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299666#comment-14299666
 ] 

Reza Zadeh commented on SPARK-4981:
---

To be model parallel, we can simply warm-start the current ALS implementation 
in org.apache.spark.mllib.recommendation

The work involved would be to expose a warm-start option in ALS, and then redo 
training with say 2 iterations instead of 10, with each batch of RDDs.

The stream would be over batches of Ratings.

This should be the simplest option.


 Add a streaming singular value decomposition
 

 Key: SPARK-4981
 URL: https://issues.apache.org/jira/browse/SPARK-4981
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 This is for tracking WIP on a streaming singular value decomposition 
 implementation. This will likely be more complex than the existing streaming 
 algorithms (k-means, regression), but should be possible using the family of 
 sequential update rule outlined in this paper:
 Fast low-rank modifications of the thin singular value decomposition
 by Matthew Brand
 http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix

2015-01-17 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-5301:
-

 Summary: Add missing linear algebra utilities to IndexedRowMatrix 
and CoordinateMatrix
 Key: SPARK-5301
 URL: https://issues.apache.org/jira/browse/SPARK-5301
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix

2015-01-17 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-5301:
--
Description: 
1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so it 
should be there)
2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion 
method to be added)

 Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
 -

 Key: SPARK-5301
 URL: https://issues.apache.org/jira/browse/SPARK-5301
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so 
 it should be there)
 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion 
 method to be added)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition

2014-12-28 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259789#comment-14259789
 ] 

Reza Zadeh commented on SPARK-4981:
---

We could do matrix completion (least squares objective, reqularized, note that 
this is not SVD) in a streaming fashion using Stochastic Gradient Descent.

See the update equations in Algorithm 1:
http://stanford.edu/~rezab/papers/factorbird.pdf

The stream is over individual entries (as opposed a whole row/column).

We should probably do streaming matrix completion before streaming SVD.

 Add a streaming singular value decomposition
 

 Key: SPARK-4981
 URL: https://issues.apache.org/jira/browse/SPARK-4981
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 This is for tracking WIP on a streaming singular value decomposition 
 implementation. This will likely be more complex than the existing streaming 
 algorithms (k-means, regression), but should be possible using the family of 
 sequential update rule outlined in this paper:
 Fast low-rank modifications of the thin singular value decomposition
 by Matthew Brand
 http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4981) Add a streaming singular value decomposition

2014-12-28 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259789#comment-14259789
 ] 

Reza Zadeh edited comment on SPARK-4981 at 12/29/14 2:04 AM:
-

We could do matrix completion (least squares objective, regularized, note that 
this is not SVD) in a streaming fashion using Stochastic Gradient Descent.

See the update equations in Algorithm 1:
http://stanford.edu/~rezab/papers/factorbird.pdf

The stream is over individual entries (as opposed a whole row/column).

We should probably do streaming matrix completion before streaming SVD.


was (Author: rezazadeh):
We could do matrix completion (least squares objective, reqularized, note that 
this is not SVD) in a streaming fashion using Stochastic Gradient Descent.

See the update equations in Algorithm 1:
http://stanford.edu/~rezab/papers/factorbird.pdf

The stream is over individual entries (as opposed a whole row/column).

We should probably do streaming matrix completion before streaming SVD.

 Add a streaming singular value decomposition
 

 Key: SPARK-4981
 URL: https://issues.apache.org/jira/browse/SPARK-4981
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 This is for tracking WIP on a streaming singular value decomposition 
 implementation. This will likely be more complex than the existing streaming 
 algorithms (k-means, regression), but should be possible using the family of 
 sequential update rule outlined in this paper:
 Fast low-rank modifications of the thin singular value decomposition
 by Matthew Brand
 http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2014-12-21 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255308#comment-14255308
 ] 

Reza Zadeh commented on SPARK-4590:
---

Here is an initial review of current systems and two ways to go forward: 

https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing

 Early investigation of parameter server
 ---

 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 In the currently implementation of GLM solvers, we save intermediate models 
 on the driver node and update it through broadcast and aggregation. Even with 
 torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
 ~10 million features. This JIRA is for investigating the parameter server 
 approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243133#comment-14243133
 ] 

Reza Zadeh commented on SPARK-4823:
---

Given that we're talking about RowMatrices, computing rowSimilarities the same 
way as columnSimilarities would require transposing the matrix, which is 
dangerous when the original matrix has many rows. RowMatrix assumes a single 
row should fit in memory on a single machine, but this might not happen after 
transposing a RowMatrix.

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4823) rowSimilarities

2014-12-10 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-4823:
-

 Summary: rowSimilarities
 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh


RowMatrix has a columnSimilarities method to find cosine similarities between 
columns.

A rowSimilarities method would be useful to find similarities between rows.

This is JIRA is to investigate which algorithms are suitable for such a method, 
better than brute-forcing it. Note that when there are many rows ( 10^6), it 
is unlikely that brute-force will be feasible, since the output will be of 
order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227023#comment-14227023
 ] 

Reza Zadeh commented on SPARK-1503:
---

Thanks Aaron. From an implementation perspective, it's probably easier to 
implement a constant step size first. From there you can see if there is any 
finicky behavior and compare to the unaccelerated proximal gradient already in 
Spark. If it works well enough, we should commit the first PR without 
backtracking, and then experiment with backtracking, otherwise if you see 
strange behavior then you can decide if backtracking would solve it. 

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2014-11-24 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224117#comment-14224117
 ] 

Reza Zadeh commented on SPARK-4590:
---

Some starting points: 
- http://stanford.edu/~rezab/papers/factorbird.pdf
- http://parameterserver.org/

More detailed comparisons coming.


 Early investigation of parameter server
 ---

 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 In the currently implementation of GLM solvers, we save intermediate models 
 on the driver node and update it through broadcast and aggregation. Even with 
 torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
 ~10 million features. This JIRA is for investigating the parameter server 
 approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3254) Streaming K-Means

2014-10-21 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178825#comment-14178825
 ] 

Reza Zadeh commented on SPARK-3254:
---

Will this make it to 1.2?

 Streaming K-Means
 -

 Key: SPARK-3254
 URL: https://issues.apache.org/jira/browse/SPARK-3254
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Xiangrui Meng
Assignee: Jeremy Freeman

 Streaming K-Means with proper decay settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-17 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175557#comment-14175557
 ] 

Reza Zadeh commented on SPARK-3434:
---

Thanks Shivaram! As discussed over the phone, we will use your design and build 
upon it, so that you can focus on the linear algebraic operations such as TSQR.

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Shivaram Venkataraman

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3974) Block matrix abstracitons and partitioners

2014-10-16 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-3974:
-

 Summary: Block matrix abstracitons and partitioners
 Key: SPARK-3974
 URL: https://issues.apache.org/jira/browse/SPARK-3974
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


We need abstractions for block matrices with fixed block sizes, with each block 
being dense. Partitioners along both rows and columns required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3975) Block Matrix addition and multiplication

2014-10-16 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-3975:
-

 Summary: Block Matrix addition and multiplication
 Key: SPARK-3975
 URL: https://issues.apache.org/jira/browse/SPARK-3975
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


Block matrix addition and multiplication, for the case when partitioning 
schemes match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3976) Detect block matrix partitioning schemes

2014-10-16 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-3976:
-

 Summary: Detect block matrix partitioning schemes
 Key: SPARK-3976
 URL: https://issues.apache.org/jira/browse/SPARK-3976
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


Provide repartitioning methods for block matrices to repartition matrix for 
add/multiply of non-identically partitioned matrices



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3977) Conversions between {Row, Coordinate}Matrix - BlockMatrix

2014-10-16 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-3977:
-

 Summary: Conversions between {Row, Coordinate}Matrix - 
BlockMatrix
 Key: SPARK-3977
 URL: https://issues.apache.org/jira/browse/SPARK-3977
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


Build conversion functions between {Row, Coordinate}Matrix - BlockMatrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3974) Block matrix abstracitons and partitioners

2014-10-16 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-3974:
--
Component/s: MLlib

 Block matrix abstracitons and partitioners
 --

 Key: SPARK-3974
 URL: https://issues.apache.org/jira/browse/SPARK-3974
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 We need abstractions for block matrices with fixed block sizes, with each 
 block being dense. Partitioners along both rows and columns required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3976) Detect block matrix partitioning schemes

2014-10-16 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-3976:
--
Component/s: MLlib

 Detect block matrix partitioning schemes
 

 Key: SPARK-3976
 URL: https://issues.apache.org/jira/browse/SPARK-3976
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 Provide repartitioning methods for block matrices to repartition matrix for 
 add/multiply of non-identically partitioned matrices



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3977) Conversions between {Row, Coordinate}Matrix - BlockMatrix

2014-10-16 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-3977:
--
Component/s: MLlib

 Conversions between {Row, Coordinate}Matrix - BlockMatrix
 ---

 Key: SPARK-3977
 URL: https://issues.apache.org/jira/browse/SPARK-3977
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 Build conversion functions between {Row, Coordinate}Matrix - BlockMatrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3820) Specialize columnSimilarity() without any threshold

2014-10-07 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162814#comment-14162814
 ] 

Reza Zadeh commented on SPARK-3820:
---

I will do an experiment to see if the random number generation is adding 
significant overhead, and if it is, then add a flag to avoid it when threshold 
zero is given.

 Specialize columnSimilarity() without any threshold
 ---

 Key: SPARK-3820
 URL: https://issues.apache.org/jira/browse/SPARK-3820
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to 
 compute the exact cosine similarities. It still requires sampling, which is 
 unnecessary for this case. We should have a specialized version for it, in 
 order to have a fair comparison with DIMSUM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3820) Specialize columnSimilarity() without any threshold

2014-10-07 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-3820:
--
Comment: was deleted

(was: See previous comment on resolution.)

 Specialize columnSimilarity() without any threshold
 ---

 Key: SPARK-3820
 URL: https://issues.apache.org/jira/browse/SPARK-3820
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to 
 compute the exact cosine similarities. It still requires sampling, which is 
 unnecessary for this case. We should have a specialized version for it, in 
 order to have a fair comparison with DIMSUM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3820) Specialize columnSimilarity() without any threshold

2014-10-07 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh resolved SPARK-3820.
---
Resolution: Fixed

See previous comment on resolution.

 Specialize columnSimilarity() without any threshold
 ---

 Key: SPARK-3820
 URL: https://issues.apache.org/jira/browse/SPARK-3820
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to 
 compute the exact cosine similarities. It still requires sampling, which is 
 unnecessary for this case. We should have a specialized version for it, in 
 order to have a fair comparison with DIMSUM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3820) Specialize columnSimilarity() without any threshold

2014-10-07 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162899#comment-14162899
 ] 

Reza Zadeh commented on SPARK-3820:
---

I ran columnSimilarities(0.0) with the random number generation commented, and 
uncommented, and didn't observe any difference in timing for completion of 
stage mapPartitionsWithIndex. 

 Specialize columnSimilarity() without any threshold
 ---

 Key: SPARK-3820
 URL: https://issues.apache.org/jira/browse/SPARK-3820
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh

 `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to 
 compute the exact cosine similarities. It still requires sampling, which is 
 unnecessary for this case. We should have a specialized version for it, in 
 order to have a fair comparison with DIMSUM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3790) CosineSimilarity via DIMSUM example

2014-10-04 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-3790:
-

 Summary: CosineSimilarity via DIMSUM example
 Key: SPARK-3790
 URL: https://issues.apache.org/jira/browse/SPARK-3790
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


Create an example that gives approximation error for DIMSUM using arbitrary 
RowMatrix given via commandline.

PR tracking this:
https://github.com/apache/spark/pull/2622



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-03 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158949#comment-14158949
 ] 

Reza Zadeh commented on SPARK-3434:
---

Any updates Shivaraman?

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-09-29 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152275#comment-14152275
 ] 

Reza Zadeh commented on SPARK-3434:
---

It looks like Shivaram Venkataraman from the AMPlab has started work on this. I 
will be meeting with him to see if we can reuse some his work.

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118427#comment-14118427
 ] 

Reza Zadeh commented on SPARK-2885:
---

Hi Clive,
Can you please post the code you used to generate the error? I am having 
trouble reproducing this. There is a test for sparse vectors in the PR, which 
is not catching this.
Reza

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh

 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118445#comment-14118445
 ] 

Reza Zadeh commented on SPARK-2885:
---

I don't see your code. Looking at the exception however, when generating the 
random sparse vectors, did you remember to sort by indices? This is a 
requirement of breeze, and it looks like the example you have violate it: 
http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Attachments: SimilarItemsSmallTest.java


 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118464#comment-14118464
 ] 

Reza Zadeh commented on SPARK-2885:
---

Yes, looking at your code it looks like you need to sort the indices.

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Attachments: SimilarItemsSmallTest.java


 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118464#comment-14118464
 ] 

Reza Zadeh edited comment on SPARK-2885 at 9/2/14 6:20 PM:
---

Yes, looking at your code it looks like you need to sort by indices.


was (Author: rezazadeh):
Yes, looking at your code it looks like you need to sort the indices.

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Attachments: SimilarItemsSmallTest.java


 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

2014-08-30 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-2885:
--

Description: 
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
computation vs accuracy from low computation to high accuracy. For a very large 
gamma, all cosine similarities are computed exactly with no sampling.

Current PR:
https://github.com/apache/spark/pull/1778

Justification for adding to MLlib:
- All-pairs similarity is missing from MLlib and has been requested several 
times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
https://github.com/apache/spark/pull/1778#issuecomment-51300825)

- Algorithm is used in large-scale production at Twitter. e.g. see 
https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
 . Twitter also open-sourced their version in scalding: 
https://github.com/twitter/scalding/pull/833

- When used with the gamma parameter set high, this algorithm becomes the 
normalized gramian matrix, which is useful in RowMatrix alongside the 
computeGramianMatrix method already in RowMatrix

More details about usage at Twitter: 
https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum

For correctness proof, see Theorem 4.3 in 
http://stanford.edu/~rezab/papers/dimsum.pdf

  was:
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
computation vs accuracy from low computation to high accuracy. For a very large 
gamma, all cosine similarities are computed exactly with no sampling.

Current PR:
https://github.com/apache/spark/pull/1778

Justification for adding to MLlib:
- All-pairs similarity is missing from MLlib and has been requested several 
times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
https://github.com/apache/spark/pull/1778#issuecomment-51300825)

- Algorithm is used in large-scale production at Twitter. e.g. see 
https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
 . Twitter also open-sourced their version in scalding: 
https://github.com/twitter/scalding/pull/833

- When used with the gamma parameter set high, this algorithm becomes the 
normalized gramian matrix, which is useful in RowMatrix alongside the 
computeGramianMatrix method already in RowMatrix


 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh

 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that 

[jira] [Created] (SPARK-2885) All-pairs similarity via DIMSUM

2014-08-06 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-2885:
-

 Summary: All-pairs similarity via DIMSUM
 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh


Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

Current PR for this is WIP:
https://github.com/apache/spark/pull/1778



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

2014-08-06 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-2885:
--

Description: 
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

Current PR:
https://github.com/apache/spark/pull/1778

  was:
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

Current PR for this is WIP:
https://github.com/apache/spark/pull/1778


 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh

 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 Current PR:
 https://github.com/apache/spark/pull/1778



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

2014-08-06 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-2885:
--

Description: 
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
computation vs accuracy from low computation to high accuracy. For a very large 
gamma, all cosine similarities are computed exactly with no sampling.

Current PR:
https://github.com/apache/spark/pull/1778

Justification for adding to MLlib:
- All-pairs similarity is missing from MLlib and has been requested several 
times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
https://github.com/apache/spark/pull/1778#issuecomment-51300825)

- Algorithm is used in large-scale production at Twitter. e.g. see 
https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
 . Twitter also open-sourced their version in scalding: 
https://github.com/twitter/scalding/pull/833

- When used with the gamma parameter set high, this algorithm becomes the 
normalized gramian matrix, which is useful in RowMatrix alongside the 
computeGramianMatrix method already in RowMatrix

  was:
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

Current PR:
https://github.com/apache/spark/pull/1778


 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh

 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix



--
This message was sent by Atlassian JIRA
(v6.2#6252)