ScaledML 2020 Spark Speakers and Promo

2019-12-01 Thread Reza Zadeh
Spark Users, You are all welcome to join us at ScaledML 2020: http://scaledml.org A very steep discount is available for this list, using this link . We'd love to see you there. Best, Reza

Re: [MLlib] DIMSUM row similarity?

2015-08-31 Thread Reza Zadeh
This is ongoing work tracked by SPARK-4823 with a PR for it here: PR6213 - unfortunately the PR submitter didn't make it for Spark 1.5. On Mon, Aug 31, 2015 at 4:17 AM, Maandy wrote:

Re: Duplicate entries in output of mllib column similarities

2015-05-12 Thread Reza Zadeh
Great! Reza On Tue, May 12, 2015 at 7:42 AM, Richard Bolkey rbol...@gmail.com wrote: Hi Reza, That was the fix we needed. After sorting, the transposed entries are gone! Thanks a bunch, rick On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh r...@databricks.com wrote: Hi Richard, One reason

Re: Duplicate entries in output of mllib column similarities

2015-05-07 Thread Reza Zadeh
This shouldn't be happening, do you have an example to reproduce it? On Thu, May 7, 2015 at 4:17 PM, rbolkey rbol...@gmail.com wrote: Hi, I have a question regarding one of the oddities we encountered while running mllib's column similarities operation. When we examine the output, we find

Re: Understanding Spark/MLlib failures

2015-04-23 Thread Reza Zadeh
Hi Andrew, The .principalComponents feature of RowMatrix is currently constrained to tall and skinny matrices. Your matrix is barely above the skinny requirement (10k columns), though the number of rows is fine. What are you looking to do with the principal components? If unnormalized PCA is OK

Re: Benchmaking col vs row similarities

2015-04-10 Thread Reza Zadeh
You should pull in this PR: https://github.com/apache/spark/pull/5364 It should resolve that. It is in master. Best, Reza On Fri, Apr 10, 2015 at 8:32 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am benchmarking row vs col similarity flow on 60M x 10M matrices... Details are in

Re: Using DIMSUM with ids

2015-04-06 Thread Reza Zadeh
Right now dimsum is meant to be used for tall and skinny matrices, and so columnSimilarities() returns similar columns, not rows. We are working on adding an efficient row similarity as well, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Reza On Mon, Apr 6, 2015 at 6:08

Re: Need a spark mllib tutorial

2015-04-02 Thread Reza Zadeh
Here's one: https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html Reza On Thu, Apr 2, 2015 at 12:51 PM, Phani Yadavilli -X (pyadavil) pyada...@cisco.com wrote: Hi, I am new to the spark MLLib and I was browsing through the internet for good tutorials advanced

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Reza Zadeh
How many dimensions does your data have? The size of the k-means model is k * d, where d is the dimension of the data. Since you're using k=1000, if your data has dimension higher than say, 10,000, you will have trouble, because k*d doubles have to fit in the driver. Reza On Sat, Mar 28, 2015

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Reza Zadeh
, 2015 at 6:36 PM, Reza Zadeh r...@databricks.com wrote: If you want to do a nonstandard (or uncentered) PCA, you can call computeSVD on RowMatrix, and look at the resulting 'V' Matrix. That should match the output of the other two systems. Reza On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread Reza Zadeh
If you want to do a nonstandard (or uncentered) PCA, you can call computeSVD on RowMatrix, and look at the resulting 'V' Matrix. That should match the output of the other two systems. Reza On Tue, Mar 24, 2015 at 3:53 AM, Sean Owen so...@cloudera.com wrote: Those implementations are computing

Re: How to do nested foreach with RDD

2015-03-22 Thread Reza Zadeh
You can do this with the 'cartesian' product method on RDD. For example: val rdd1 = ... val rdd2 = ... val combinations = rdd1.cartesian(rdd2).filter{ case (a,b) = a b } Reza On Sat, Mar 21, 2015 at 10:37 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have two big RDD, and I need to do

Re: Column Similarity using DIMSUM

2015-03-19 Thread Reza Zadeh
test it on larger machines? Regards, Manish *From:* Manish Gupta 8 [mailto:mgupt...@sapient.com] *Sent:* Wednesday, March 18, 2015 11:20 PM *To:* Reza Zadeh *Cc:* user@spark.apache.org *Subject:* RE: Column Similarity using DIMSUM Hi Reza, I have tried threshold to be only

Re: Column Similarity using DIMSUM

2015-03-18 Thread Reza Zadeh
Hi Manish, Did you try calling columnSimilarities(threshold) with different threshold values? You try threshold values of 0.1, 0.5, 1, and 20, and higher. Best, Reza On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mgupt...@sapient.com wrote: Hi, I am running Column Similarity (All Pairs

Re: SVD transform of large matrix with MLlib

2015-03-11 Thread Reza Zadeh
Answers: databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html Reza On Wed, Mar 11, 2015 at 2:33 PM, sergunok ser...@gmail.com wrote: Does somebody used SVD from MLlib for very large (like 10^6 x 10^7) sparse matrix? What time did it take? What

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Reza Zadeh
Hi Sab, The current method is optimized for having many rows and few columns. In your case it is exactly the opposite. We are working on your case, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Your case is very common, so I will put some time into building it. In the

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sab, In this dense case, the output will contain 1 x 1 entries, i.e. 100 million doubles, which doesn't fit in 1GB with overheads. For a dense matrix, similarColumns() scales quadratically in the number of columns, so you need more memory across the cluster. Reza On Sun, Mar 1, 2015

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sabarish, Works fine for me with less than those settings (30x1000 dense matrix, 1GB driver, 1GB executor): bin/spark-shell --driver-memory 1G --executor-memory 1G Then running the following finished without trouble and in a few seconds. Are you sure your driver is actually getting the RAM

Re: Is spark streaming +MlLib for online learning?

2015-02-18 Thread Reza Zadeh
This feature request is already being tracked: https://issues.apache.org/jira/browse/SPARK-4981 Aiming for 1.4 Best, Reza On Wed, Feb 18, 2015 at 2:40 AM, mucaho muc...@yahoo.com wrote: Hi What is the general consensus/roadmap for implementing additional online / streamed trainable models?

Re: what is behind matrix multiplications?

2015-02-11 Thread Reza Zadeh
Yes, the local matrix is broadcast to each worker. Here is the code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L407 In 1.3 we will have Block matrix multiplication too, which will allow distributed matrix

Re: foreachActive functionality

2015-01-25 Thread Reza Zadeh
The idea is to unify the code path for dense and sparse vector operations, which makes the codebase easier to maintain. By handling (index, value) tuples, you can let the foreachActive method take care of checking if the vector is sparse or dense, and running a foreach over the values. On Sun,

Re: Row similarities

2015-01-17 Thread Reza Zadeh
*To:* Reza Zadeh r...@databricks.com *Cc:* user user@spark.apache.org *Sent:* Saturday, January 17, 2015 11:29 AM *Subject:* Re: Row similarities Thanks Reza, interesting approach. I think what I actually want is to calculate pair-wise distance, on second thought. Is there a pattern

Re: Row similarities

2015-01-17 Thread Reza Zadeh
it has a lot of uses and can at very least be optimized for output matrix symmetry. On Jan 17, 2015, at 11:44 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Yeah okay, thanks. On Jan 17, 2015, at 11:15 AM, Reza Zadeh r...@databricks.com wrote: Pat, columnSimilarities is what that blog

Re: Row similarities

2015-01-16 Thread Reza Zadeh
You can use K-means https://spark.apache.org/docs/latest/mllib-clustering.html with a suitably large k. Each cluster should correspond to rows that are similar to one another. On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: What's a good way to calculate

Re: Broadcast joins on RDD

2015-01-12 Thread Reza Zadeh
First, you should collect().toMap() the small RDD, then you should use broadcast followed by a map to do a map-side join http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf (slide 10 has an example). Spark SQL also does it by default for tables

Re: [mllib] GradientDescent requires huge memory for storing weight vector

2015-01-12 Thread Reza Zadeh
I guess you're not using too many features (e.g. 10m), just that hashing the index makes it look that way, is that correct? If so, the simple dictionary that maps your feature index - rank can be broadcast and used everywhere, so you can pass mllib just the feature's rank as its index. Reza On

Re: RowMatrix.multiply() ?

2015-01-09 Thread Reza Zadeh
-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-td12562.html Which presents https://issues.apache.org/jira/browse/SPARK-3434 which is still in work at this time. Is this the correct Jira issue for the transpose operation? ETA? Thanks a lot! -A *From:* Reza Zadeh [mailto:r

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-02 Thread Reza Zadeh
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4981 On Fri, Jan 2, 2015 at 8:28 PM, Peng Cheng rhw...@gmail.com wrote: I was under the impression that ALS wasn't designed for it :- The famous ebay online recommender uses SGD However, you can try using the previous model

Re: how to do incremental model updates using spark streaming and mllib

2014-12-26 Thread Reza Zadeh
As of Spark 1.2 you can do Streaming k-means, see examples here: http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1 Best, Reza On Fri, Dec 26, 2014 at 1:36 AM, vishnu johnfedrickena...@gmail.com wrote: Hi, Say I have created a clustering model using KMeans for 100million

Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Reza Zadeh
As Sean mentioned, you would be computing similar features then. If you want to find similar users, I suggest running k-means with some fixed number of clusters. It's not reasonable to try and compute all pairs of similarities between 1bn items, so k-means with fixed k is more suitable here.

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Reza Zadeh
? On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh r...@databricks.com wrote: See this thread for examples of sparse matrix x sparse matrix: https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA We thought about providing matrix multiplies on CoordinateMatrix, however, the matrices have

Re: sparse x sparse matrix multiplication

2014-11-06 Thread Reza Zadeh
See this thread for examples of sparse matrix x sparse matrix: https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA We thought about providing matrix multiplies on CoordinateMatrix, however, the matrices have to be very dense for the overhead of having many little (i, j, value) objects

Re: Measuring execution time

2014-10-24 Thread Reza Zadeh
The Spark UI has timing information. When running locally, it is at http://localhost:4040 Otherwise the url to the UI is printed out onto the console when you startup spark shell or run a job. Reza On Fri, Oct 24, 2014 at 5:51 AM, shahab shahab.mok...@gmail.com wrote: Hi, I just wonder if

Re: mllib CoordinateMatrix

2014-10-14 Thread Reza Zadeh
Hello, CoordinateMatrix is in its infancy, and right now is only a placeholder. To get/set the value at (i,j), you should map the entries rdd using the usual rdd map operation, and change the relevant entries. To get the values on a specific row, you can call toIndexedRowMatrix(), which returns

Re: Huge matrix

2014-09-18 Thread Reza Zadeh
to generate jaccard but I had to run it twice due to the design of RowMatrix / CoordinateMatrix...I feel we should modify RowMatrix and CoordinateMatrix to be templated on the value... Are you considering this in your design ? Thanks. Deb On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh r

Re: Huge matrix

2014-09-18 Thread Reza Zadeh
noticing some weird behavior as different runs are changing the results... Also can columnMagnitudes produce non-deterministic results ? Thanks. Deb On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh r...@databricks.com wrote: Hi Deb, I am not templating RowMatrix/CoordinateMatrix since

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
stable code and then test dimsum... Thanks. Deb On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh r...@databricks.com wrote: I will add dice, overlap, and jaccard similarity in a future PR, probably still for 1.2 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com wrote

Re: Huge matrix

2014-09-09 Thread Reza Zadeh
Better to do it in a PR of your own, it's not sufficiently related to dimsum On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com wrote: Cool...can I add loadRowMatrix in your PR ? Thanks. Deb On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote: Hi Deb

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
that goes to matrix factorization) so I don't think joining and group-by on (product,product) will be a big issue for me... Does it make sense to add all pair similarities as well with dimsum based similarity ? Thanks. Deb On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh r...@databricks.com

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
: Ohh coolall-pairs brute force is also part of this PR ? Let me pull it in and test on our dataset... Thanks. Deb On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com wrote: Hi Deb, We are adding all-pairs and thresholded all-pairs via dimsum in this PR: https://github.com

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
will run fine... But for tall and wide, what do you suggest ? can dimsum handle it ? I might need jaccard as well...can I plug that in the PR ? On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh r...@databricks.com wrote: You might want to wait until Wednesday since the interface will be changing

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
it over the weekend. On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh r...@databricks.com wrote: For 60M x 10K brute force and dimsum thresholding should be fine. For 60M x 10M probably brute force won't work depending on the cluster's power, and dimsum thresholding should work with appropriate

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
that will be useful) ? I guess it makes sense to add some similarity measures in mllib... On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com wrote: Yes you're right, calling dimsum with gamma as PositiveInfinity turns it into the usual brute force algorithm for cosine similarity

Re: Does MLlib in spark 1.0.2 only work for tall-and-skinny matrix?

2014-08-10 Thread Reza Zadeh
Hi Andy, That is the case in Spark 1.0, yes. However, as of Spark 1.1 which is coming out very soon, you will be able to run SVD on non-TS matrices. If you try to apply the current algorithm to a matrix with more than 10,000 columns, you will overburden the master node, which has to compute a 10k

Re: No Intercept for Python

2014-06-18 Thread Reza Zadeh
Hi Naftali, Yes you're right. For now please add a column of ones. We are working on adding a weighted regularization term, and exposing the scala intercept option in the python binding. Best, Reza On Mon, Jun 16, 2014 at 12:19 PM, Naftali Harris naft...@affirm.com wrote: Hi everyone, The

Re: Huge matrix

2014-04-11 Thread Reza Zadeh
Hi Xiaoli, There is a PR currently in progress to allow this, via the sampling scheme described in this paper: stanford.edu/~rezab/papers/dimsum.pdf The PR is at https://github.com/apache/spark/pull/336 though it will need refactoring given the recent changes to matrix interface in MLlib. You