Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Debasish Das
SparkSQL was built to improve upon Hive on Spark runtime further... On Tue, May 19, 2015 at 10:37 PM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hive on Spark and SparkSQL which should be better , and what are the key characteristics and the advantages and the disadvantages

IndexedRowMatrix semantics

2015-05-20 Thread Debasish Das
Hi, For indexedrowmatrix and rowmatrix, both take RDD(vector)is it possible that it has intermixed dense and sparse vectorbasically I am considering a gemv flow when indexedrowmatrix has dense flag true, dot flow otherwise... Thanks. Deb

Re: Find KNN in Spark SQL

2015-05-19 Thread Debasish Das
The batch version of this is part of rowSimilarities JIRA 4823 ...if your query points can fit in memory there is broadcast version which we are experimenting with internallywe are using brute force KNN right now in the PR...based on flann paper lsh did not work well but before you go to

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-05-19 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551294#comment-14551294 ] Debasish Das commented on SPARK-6323: - Petuum paper that got released today mentioned

[jira] [Commented] (SPARK-4823) rowSimilarities

2015-05-17 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547318#comment-14547318 ] Debasish Das commented on SPARK-4823: - I opened up a PR that worked well for our

Re: How can I do pair-wise computation between RDD feature columns?

2015-05-16 Thread Debasish Das
I opened it up today but it should help you: https://github.com/apache/spark/pull/6213 On Sat, May 16, 2015 at 6:18 PM, Chunnan Yao yaochun...@gmail.com wrote: Hi all, Recently I've ran into a scenario to conduct two sample tests between all paired combination of columns of an RDD. But the

[jira] [Updated] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2015-05-02 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-4231: Affects Version/s: (was: 1.2.0) 1.4.0 Add RankingMetrics

[jira] [Reopened] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2015-05-02 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das reopened SPARK-4231: - The code was not part of SPARK-3066 and so reopening... Add RankingMetrics to examples.MovieLensALS

Re: Compute pairwise distance

2015-04-29 Thread Debasish Das
Cross Join shuffle space might not be needed since most likely through application specific logic (topK etc) you can cut the shuffle space...Also most likely the brute force approach will be a benchmark tool to see how better is your clustering based KNN solution since there are several ways you

[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-04-25 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512843#comment-14512843 ] Debasish Das commented on SPARK-5992: - Did someone compared algebird LSH with spark

Re: [GitHub] spark pull request: Add dropout regularization to logistic regress...

2015-04-15 Thread Debasish Das
If there is L1 from DB's OWLQN development, why do we need dropout regularization ? On Wed, Apr 15, 2015 at 8:59 PM, rakeshchalasani g...@git.apache.org wrote: GitHub user rakeshchalasani opened a pull request: https://github.com/apache/spark/pull/5539 Add dropout regularization to

Re: Benchmaking col vs row similarities

2015-04-10 Thread Debasish Das
, Debasish Das debasish.da...@gmail.com wrote: Hi, I am benchmarking row vs col similarity flow on 60M x 10M matrices... Details are in this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 For testing I am using Netflix data since the structure is very similar: 50k x 17K near dense

RDD union

2015-04-09 Thread Debasish Das
Hi, I have some code that creates ~ 80 RDD and then a sc.union is applied to combine all 80 into one for the next step (to run topByKey for example)... While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3 hrs (I am validating these numbers)... Is there any checkpoint

Re: Using DIMSUM with ids

2015-04-07 Thread Debasish Das
I have a version that works well for Netflix data but now I am validating on internal datasets..this code will work on matrix factors and sparse matrices that has rows = 100* columnsif columns are much smaller than rows then col based flow works well...basically we need both flows... I did

[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2015-04-07 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484646#comment-14484646 ] Debasish Das commented on SPARK-3987: - @mengxr for this testcase it was fixed but I

[jira] [Comment Edited] (SPARK-3987) NNLS generates incorrect result

2015-04-07 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484646#comment-14484646 ] Debasish Das edited comment on SPARK-3987 at 4/8/15 3:31 AM

Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Debasish Das
sorted list by using a priority queue and dequeuing top N values. In the end, I get a record for each segment with N max values for each segment. Regards, Aung On Fri, Mar 27, 2015 at 4:27 PM, Debasish Das debasish.da...@gmail.com wrote: In that case you can directly use count-min

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-31 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388128#comment-14388128 ] Debasish Das commented on SPARK-5564: - [~sparks] we are trying to access the EC2

[jira] [Comment Edited] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-31 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389973#comment-14389973 ] Debasish Das edited comment on SPARK-3066 at 4/1/15 4:28 AM

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-31 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389973#comment-14389973 ] Debasish Das commented on SPARK-3066: - Also unless the raw flow runs there is no way

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180 ] Debasish Das commented on SPARK-5564: - Cool...I will run my experiments on the same

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 6:52 PM

Re: mllib.recommendation Design

2015-03-30 Thread Debasish Das
as I see the result. I am not sure if it is supported by public packages like graphlab or scikit but the plsa papers show interesting results. On Mar 30, 2015 2:31 PM, Xiangrui Meng men...@gmail.com wrote: On Wed, Mar 25, 2015 at 7:59 AM, Debasish Das debasish.da...@gmail.com wrote: Hi

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das commented on SPARK-5564: - [~josephkb] could you please point me

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 12:31 AM

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 12:30 AM

[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-28 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-2426: Affects Version/s: (was: 1.3.0) 1.4.0 Quadratic Minimization for MLlib

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
You can do it in-memory as wellget 10% topK elements from each partition and use merge from any sort algorithm like timsortbasically aggregateBy Your version uses shuffle but this version is 0 shuffle..assuming your data set is cached you will be using in-memory allReduce through

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Debasish Das
for your suggestions. In-memory version is quite useful. I do not quite understand how you can use aggregateBy to get 10% top K elements. Can you please give an example? Thanks, Aung On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.com wrote: You can do it in-memory as well

Re: mllib.recommendation Design

2015-03-25 Thread Debasish Das
is that ALM will support MAP (and may be KL divergence loss) with sparsity constraints (probability simplex and bounds are fine for what I am focused at right now)... Thanks. Deb On Tue, Feb 17, 2015 at 4:40 PM, Debasish Das debasish.da...@gmail.com wrote: There is a usability difference...I am not sure

LogisticGradient Design

2015-03-25 Thread Debasish Das
Hi, Right now LogisticGradient implements both binary and multi-class in the same class using an if-else statement which is a bit convoluted. For Generalized matrix factorization, if the data has distinct ratings I want to use LeastSquareGradient (regression has given best results to date) but

Re: LogisticGradient Design

2015-03-25 Thread Debasish Das
multiclass logistic loss/gradient. If it's not a big hit, then it might be simpler from an outside API perspective to keep them in 1 class (even if it's more complicated within). Joseph On Wed, Mar 25, 2015 at 8:15 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Right now

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 3:23 PM

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 3:23 PM

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-24 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378062#comment-14378062 ] Debasish Das commented on SPARK-6323: - I did some more reading and realized that even

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das commented on SPARK-2426: - [~acopich] From your comment before Anyway, l2

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 6:11 AM

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 6:11 AM

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 6:13 AM

[jira] [Commented] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS

2015-03-23 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376046#comment-14376046 ] Debasish Das commented on SPARK-3735: - We might want to consider doing some

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375325#comment-14375325 ] Debasish Das commented on SPARK-2426: - [~acopich] There's a completely different loss

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which scales

Re: Which linear algebra interface to use within Spark MLlib?

2015-03-21 Thread Debasish Das
to track this here: SPARK-6442 https://issues.apache.org/jira/browse/SPARK-6442 The design doc is here: http://goo.gl/sf5LCE We would very much appreciate your feedback and input. Best, Burak On Thu, Mar 19, 2015 at 3:06 PM, Debasish Das debasish.da...@gmail.com wrote: Yeah

Re: Apache Spark ALS recommendations approach

2015-03-18 Thread Debasish Das
There is also a batch prediction API in PR https://github.com/apache/spark/pull/3098 Idea here is what Sean said...don't try to reconstruct the whole matrix which will be dense but pick a set of users and calculate topk recommendations for them using dense level 3 blas.we are going to merge

Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-18 Thread Debasish Das
Hi David, We are stress testing breeze.optimize.proximal and nnls...if you are cutting a release now, we will need another release soon once we get the runtime optimizations in place and merged to breeze. Thanks. Deb On Mar 15, 2015 9:39 PM, David Hall david.lw.h...@gmail.com wrote: snapshot

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-16 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956 ] Debasish Das edited comment on SPARK-6323 at 3/16/15 6:30 PM

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-15 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956 ] Debasish Das edited comment on SPARK-6323 at 3/15/15 4:29 PM

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-15 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956 ] Debasish Das edited comment on SPARK-6323 at 3/15/15 4:26 PM

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-14 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361981#comment-14361981 ] Debasish Das commented on SPARK-6323: - By the way I can close the JIRA

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956 ] Debasish Das commented on SPARK-6323: - g(z) is not regularization...we support

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361005#comment-14361005 ] Debasish Das edited comment on SPARK-6323 at 3/13/15 7:48 PM

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361005#comment-14361005 ] Debasish Das commented on SPARK-6323: - There are some other interesting cases

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only

[jira] [Created] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
Debasish Das created SPARK-6323: --- Summary: Large rank matrix factorization with Nonlinear loss and constraints Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which scales

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only

[jira] [Resolved] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das resolved SPARK-4231. - Resolution: Duplicate Add RankingMetrics to examples.MovieLensALS

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which scales

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-12 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359892#comment-14359892 ] Debasish Das commented on SPARK-3066: - We use the non-level 3 BLAS code in our

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-07 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351839#comment-14351839 ] Debasish Das commented on SPARK-2426: - [~mengxr] NNLS and QuadraticMinimizer are both

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-07 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351945#comment-14351945 ] Debasish Das commented on SPARK-3066: - [~josephkb] do you mean knn

[jira] [Commented] (SPARK-4823) rowSimilarities

2015-03-07 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351948#comment-14351948 ] Debasish Das commented on SPARK-4823: - [~mengxr] I need level 3 BLAS for this JIRA

[jira] [Comment Edited] (SPARK-4823) rowSimilarities

2015-03-07 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351948#comment-14351948 ] Debasish Das edited comment on SPARK-4823 at 3/8/15 6:42 AM

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Debasish Das
Column based similarities work well if the columns are mild (10K, 100K, we actually scaled it to 1.5M columns but it really stress tests the shuffle and it needs to tune the shuffle parameters)...You can either use dimsum sampling or come up with your own threshold based on your application that

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das edited comment on SPARK-5564 at 3/1/15 4:41 PM

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das edited comment on SPARK-5564 at 3/1/15 4:51 PM

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das commented on SPARK-5564: - I am right now using the following PR to do

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das edited comment on SPARK-5564 at 3/1/15 4:20 PM

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das edited comment on SPARK-5564 at 3/1/15 4:19 PM

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342312#comment-14342312 ] Debasish Das commented on SPARK-5564: - By the way the following step

Re: Have Friedman's glmnet algo running in Spark

2015-02-25 Thread Debasish Das
Any reason why the regularization path cannot be implemented using current owlqn pr ? We can change owlqn in breeze to fit your needs... On Feb 24, 2015 3:27 PM, Joseph Bradley jos...@databricks.com wrote: Hi Mike, I'm not aware of a standard big dataset, but there are a number available:

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
to use DIMSUM. Try to increase the threshold and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am running brute force similarity from RowMatrix on a job with 5M x 1.5M sparse matrix with 800M entries. With 200M

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
with 1.5m columns, because the output can potentially have 2.25 x 10^12 entries, which is a lot. (squares 1.5m) Best, Reza On Wed, Feb 25, 2015 at 10:13 AM, Debasish Das debasish.da...@gmail.com wrote: Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5 m and I made

Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
that the key would be filtered. And then after, run a flatMap or something to make Option[B] into B. On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, Before I send out the keys for network shuffle, in reduceByKey after map + combine are done, I would like

Filtering keys after map+combine

2015-02-19 Thread Debasish Das
Hi, Before I send out the keys for network shuffle, in reduceByKey after map + combine are done, I would like to filter the keys based on some threshold... Is there a way to get the key, value after map+combine stages so that I can run a filter on the keys ? Thanks. Deb

Re: Filtering keys after map+combine

2015-02-19 Thread Debasish Das
partitions and apply your filtering. Then you can finish with a reduceByKey. On Thu, Feb 19, 2015 at 9:21 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Before I send out the keys for network shuffle, in reduceByKey after map + combine are done, I would like to filter the keys based

If job fails shuffle space is not cleaned

2015-02-18 Thread Debasish Das
Hi, Some of my jobs failed due to no space left on device and on those jobs I was monitoring the shuffle space...when the job failed shuffle space did not clean and I had to manually clean it... Is there a JIRA already tracking this issue ? If no one has been assigned to it, I can take a look.

Re: WARN from Similarity Calculation

2015-02-18 Thread Debasish Das
by GC pause. Did you check the GC time in the Spark UI? -Xiangrui On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager

Re: Batch prediciton for ALS

2015-02-17 Thread Debasish Das
another pass on your PR today. -Xiangrui On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Will it be possible to merge this PR to 1.3 ? https://github.com/apache/spark/pull/3098 The batch prediction API in ALS will be useful for us who want

Re: mllib.recommendation Design

2015-02-17 Thread Debasish Das
. For a general matrix factorization package, let's make a JIRA and move our discussion there. -Xiangrui On Fri, Feb 13, 2015 at 7:46 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am bit confused on the mllib design in the master. I thought that core algorithms will stay

Large Similarity Job failing

2015-02-17 Thread Debasish Das
Hi, I am running brute force similarity from RowMatrix on a job with 5M x 1.5M sparse matrix with 800M entries. With 200M entries the job run fine but with 800M I am getting exceptions like too many files open and no space left on device... Seems like I need more nodes or use dimsum sampling ?

WARN from Similarity Calculation

2015-02-15 Thread Debasish Das
Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, abc.com, 48419, 0) with no recent heart beats: 66435ms exceeds 45000ms Do I need to increase the default 45 s to larger values for cases

Re: can we insert and update with spark sql

2015-02-12 Thread Debasish Das
... Neither play nor spray is being used in Spark right nowso it brings dependencies and we already know about the akka conflicts...thriftserver on the other hand is already integrated for JDBC access On Tue, Feb 10, 2015 at 3:43 PM, Debasish Das debasish.da...@gmail.com wrote: Also I wanted

Batch prediciton for ALS

2015-02-10 Thread Debasish Das
Hi, Will it be possible to merge this PR to 1.3 ? https://github.com/apache/spark/pull/3098 The batch prediction API in ALS will be useful for us who want to cross validate on prec@k and MAP... Thanks. Deb

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
Hi Michael, I want to cache a RDD and define get() and set() operators on it. Basically like memcached. Is it possible to build a memcached like distributed cache using Spark SQL ? If not what do you suggest we should use for such operations... Thanks. Deb On Fri, Jul 18, 2014 at 1:00 PM,

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
-indexedrdd On Tue, Feb 10, 2015 at 2:27 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Michael, I want to cache a RDD and define get() and set() operators on it. Basically like memcached. Is it possible to build a memcached like distributed cache using Spark SQL ? If not what do you

Re: can we insert and update with spark sql

2015-02-10 Thread Debasish Das
PM, Debasish Das debasish.da...@gmail.com wrote: Thanks...this is what I was looking for... It will be great if Ankur can give brief details about it...Basically how does it contrast with memcached for example... On Tue, Feb 10, 2015 at 2:32 PM, Michael Armbrust mich...@databricks.com wrote

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-02-03 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302932#comment-14302932 ] Debasish Das commented on SPARK-2426: - [~mengxr] [~coderxiang] David is out in Feb

Re: Welcoming three new committers

2015-02-03 Thread Debasish Das
Congratulations ! Keep helping the community :-) On Tue, Feb 3, 2015 at 5:34 PM, Denny Lee denny.g@gmail.com wrote: Awesome stuff - congratulations! :) On Tue Feb 03 2015 at 5:34:06 PM Chao Chen crazy...@gmail.com wrote: Congratulations guys, well done! 在 15-2-4 上午9:26, Nan Zhu

Re: Low Level Kafka Consumer for Spark

2015-01-16 Thread Debasish Das
Hi Dib, For our usecase I want my spark job1 to read from hdfs/cache and write to kafka queues. Similarly spark job2 should read from kafka queues and write to kafka queues. Is writing to kafka queues from spark job supported in your code ? Thanks Deb On Jan 15, 2015 11:21 PM, Akhil Das

Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
For CDH this works well for me...tested till 5.1... ./make-distribution -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -Phive -DskipTests To build with hive thriftserver support for spark-sql On Fri, Dec 12, 2014 at 1:41 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – we’re

Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
protobuf comes from missing -Phadoop2.3 On Fri, Dec 12, 2014 at 2:34 PM, Sean Owen so...@cloudera.com wrote: What errors do you see? protobuf errors usually mean you didn't build for the right version of Hadoop, but if you are using -Phadoop-2.3 or better -Phadoop-2.4 that should be fine.

[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-11 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243026#comment-14243026 ] Debasish Das commented on SPARK-4675: - Is there a metric like MAP / AUC kind

[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243048#comment-14243048 ] Debasish Das commented on SPARK-4823: - [~srowen] did you implement map-reduce row

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-11 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243149#comment-14243149 ] Debasish Das commented on SPARK-2426: - [~mengxr] as per our discussion

[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243207#comment-14243207 ] Debasish Das commented on SPARK-4823: - Even for matrix factorization userFactors

<    1   2   3   4   5   >