[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-11 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243456#comment-14243456 ] Debasish Das commented on SPARK-2426: - [~akopich] I got good MAP results

[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-10 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241535#comment-14241535 ] Debasish Das commented on SPARK-4675: - There are few issues: 1. Batch API for topK

[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-10 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242031#comment-14242031 ] Debasish Das commented on SPARK-4823: - I am considering coming up with a baseline

[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-10 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242034#comment-14242034 ] Debasish Das commented on SPARK-4675: - [~josephkb] how do we validate that low

Row Similarity

2014-12-10 Thread Debasish Das
Hi, It seems there are multiple places where we would like to compute row similarity (accurate or approximate similarities) Basically through RowMatrix columnSimilarities we can compute column similarities of a tall skinny matrix Similarly we should have an API in RowMatrix called

Re: Row Similarity

2014-12-10 Thread Debasish Das
of a matrix A (i.e. computing AA^T, which is expensive). There is a JIRA to track handling (1) and (2) more efficiently than computing all pairs: https://issues.apache.org/jira/browse/SPARK-3066 On Wed, Dec 10, 2014 at 2:44 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, It seems

Re: DIMSUM and ColumnSimilarity use case ?

2014-12-10 Thread Debasish Das
If you have tall x skinny matrix of m users and n products, column similarity will give you a n x n matrix (product x product matrix)...this is also called product correlation matrix...it can be cosine, pearson or other kind of correlations...Note that if the entry is unobserved (user Joanary did

Re: Learning rate or stepsize automation

2014-12-08 Thread Debasish Das
Hi Bui, Please use BFGS based solvers...For BFGS you don't have to specify step size since the line search will find sufficient decrease each time... Regularization you still have to do grid search...it's not possible to automate that but on master you will find nice ways to automate grid

Re: Market Basket Analysis

2014-12-05 Thread Debasish Das
Apriori can be thought as a post-processing on product similarity graph...I call it product similarity but for each product you build a node which keeps distinct users visiting the product and two product nodes are connected by an edge if the intersection 0...you are assuming if no one user

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Debasish Das
rdd.top collects it on master... If you want topk for a key run map / mappartition and use a bounded priority queue and reducebykey the queues. I experimented with topk from algebird and bounded priority queue wrapped over jpriority queue ( spark default)...bpq is faster Code example is here:

Re: Using Breeze in the Scala Shell

2014-11-27 Thread Debasish Das
I have used breeze fine with scala shell: scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-24 Thread Debasish Das
with Jellyfish code http://i.stanford.edu/hazy/victor/Hogwild/), will reproduce the failure... https://issues.apache.org/jira/browse/SPARK-4231 The failed job I will debug more and figure out the real cause. If needed I will open up new JIRAs. On Sun, Nov 23, 2014 at 9:50 AM, Debasish Das

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-23 Thread Debasish Das
-1 from me...same FetchFailed issue as what Hector saw... I am running Netflix dataset and dumping out recommendation for all users. It shuffles around 100 GB data on disk to run a reduceByKey per user on utils.BoundedPriorityQueue...The code runs fine with MovieLens1m dataset... I gave Spark 10

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024 ] Debasish Das commented on SPARK-1405: - We need a larger dataset as well where topics

[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024 ] Debasish Das edited comment on SPARK-1405 at 11/22/14 4:22 PM

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222027#comment-14222027 ] Debasish Das commented on SPARK-1405: - [~pedrorodriguez] did you write the metric

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222089#comment-14222089 ] Debasish Das commented on SPARK-1405: - NIPS dataset is common for PLSA and additive

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108 ] Debasish Das commented on SPARK-1405: - @sparks that will be awesome...I should be fine

[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108 ] Debasish Das edited comment on SPARK-1405 at 11/22/14 6:40 PM

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2014-11-21 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221379#comment-14221379 ] Debasish Das commented on SPARK-3066: - I did experiments on MovieLens dataset

[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-21 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221505#comment-14221505 ] Debasish Das edited comment on SPARK-1405 at 11/21/14 10:28 PM

[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2014-11-20 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219799#comment-14219799 ] Debasish Das commented on SPARK-4231: - [~srowen] I added batch predict APIs for user

[jira] [Comment Edited] (SPARK-3066) Support recommendAll in matrix factorization model

2014-11-19 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218667#comment-14218667 ] Debasish Das edited comment on SPARK-3066 at 11/19/14 10:59 PM

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2014-11-19 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218667#comment-14218667 ] Debasish Das commented on SPARK-3066: - @mengxr as per our discussions, I added APIs

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-19 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218845#comment-14218845 ] Debasish Das commented on SPARK-1405: - I would like to compare the LSA formulations

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-11-19 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218891#comment-14218891 ] Debasish Das commented on SPARK-2426: - With the MAP measures being added

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-11-19 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218891#comment-14218891 ] Debasish Das edited comment on SPARK-2426 at 11/20/14 2:13 AM

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-19 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218941#comment-14218941 ] Debasish Das commented on SPARK-1405: - For LSA you can find references on the PR

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
and appears in test, we can simply ignore it. -Xiangrui On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das debasish.da...@gmail.com wrote: Sean, I thought sampleByKey (stratified sampling) in 1.1 was designed to solve the problem that randomSplit can't sample by key... Xiangrui, What's

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Debasish Das
groupByKey does not run a combiner so be careful about the performance...groupByKey does shuffle even for local groups... reduceByKey and aggregateByKey does run a combiner but if you want a separate function for each key, you can have a key to closure map that you can broadcast and use it in

Re: Is there a way to create key based on counts in Spark

2014-11-18 Thread Debasish Das
Use zipWithIndex but cache the data before you run zipWithIndex...that way your ordering will be consistent (unless the bug has been fixed where you don't have to cache the data)... Normally these operations are used for dictionary building and so I am hoping you can cache the dictionary of

Re: Spark on YARN

2014-11-18 Thread Debasish Das
I run my Spark on YARN jobs as: HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit --master yarn --jars test-job.jar --executor-cores 4 --num-executors 10 --executor-memory 16g --driver-memory 4g --class TestClass test.jar It uses HADOOP_CONF_DIR to schedule executors and

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Debasish Das
Andrew, I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap followed by groupBy...My cluster memory is less than the memory I need and therefore flatMap does around 400 GB of shuffle...memory is around 120 GB... 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in

Using sampleByKey

2014-11-17 Thread Debasish Das
Hi, I have a rdd whose key is a userId and value is (movieId, rating)... I want to sample 80% of the (movieId,rating) that each userId has seen for train, rest is for test... val indexedRating = sc.textFile(...).map{x= Rating(x(0), x(1), x(2)) val keyedRatings = indexedRating.map{x =

Re: flatMap followed by mapPartitions

2014-11-14 Thread Debasish Das
only if output RDD is expected to be partitioned by some key. RDD[X].flatmap(X=RDD[Y]) If it has to shuffle it should be local. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das

Kryo serialization in examples.streaming.TwitterAlgebirdCMS/HLL

2014-11-14 Thread Debasish Das
Hi, If I look inside algebird Monoid implementation it uses java.io.Serializable... But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't see a KryoRegistrator for CMS and HLL monoid... In these examples we will run with Kryo serialization on CMS and HLL or they will be java

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2014-11-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209936#comment-14209936 ] Debasish Das commented on SPARK-3066: - On our internal datasets, flatMap is slow...I

TimSort in 1.2

2014-11-13 Thread Debasish Das
Hi, I am noticing the first step for Spark jobs does a TimSort in 1.2 branch...and there is some time spent doing the TimSort...Is this assigning the RDD blocks to different nodes based on a sort order ? Could someone please point to a JIRA about this change so that I can read more about it ?

flatMap followed by mapPartitions

2014-11-12 Thread Debasish Das
Hi, I am doing a flatMap followed by mapPartitions to do some blocked operation...flatMap is shuffling data but this shuffle is strictly shuffling to disk and not over the network right ? Thanks. Deb

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2014-11-11 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207298#comment-14207298 ] Debasish Das commented on SPARK-3066: - [~mengxr] I am testing recommendAllUsers

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-10 Thread Debasish Das
/SPARK-3066 The easiest case is when one side is small. If both sides are large, this is a super-expensive operation. We can do block-wise cross product and then find top-k for each user. Best, Xiangrui On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com wrote

[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2014-11-06 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200373#comment-14200373 ] Debasish Das commented on SPARK-4231: - [~coderxiang] [~mengxr] [~srowen] I looked

[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2014-11-06 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200426#comment-14200426 ] Debasish Das commented on SPARK-4231: - [~srowen] I need a standard metric to report

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Debasish Das
+1 The app to track PRs based on component is a great idea... On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: +1 Sean On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse

[jira] [Created] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2014-11-04 Thread Debasish Das (JIRA)
Debasish Das created SPARK-4231: --- Summary: Add RankingMetrics to examples.MovieLensALS Key: SPARK-4231 URL: https://issues.apache.org/jira/browse/SPARK-4231 Project: Spark Issue Type

Fwd: Master example.MovielensALS

2014-11-04 Thread Debasish Das
Hi, I just built the master today and I was testing the IR metrics (MAP and prec@k) on Movielens data to establish a baseline... I am getting a weird error which I have not seen before: MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example mllib.MovieLensALS --kryo --lambda 0.065

[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-11-03 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-2426: Affects Version/s: (was: 1.0.0) 1.2.0 Quadratic Minimization for MLlib

[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-11-03 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-2426: Affects Version/s: (was: 1.2.0) 1.3.0 Quadratic Minimization for MLlib

Re: matrix factorization cross validation

2014-11-03 Thread Debasish Das
:24 PM, Sean Owen so...@cloudera.com wrote: MAP is effectively an average over all k from 1 to min(# recommendations, # items rated) Getting first recommendations right is more important than the last. On Thu, Oct 30, 2014 at 10:21 PM, Debasish Das debasish.da...@gmail.com wrote

MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Debasish Das
Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to

[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2014-10-31 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191502#comment-14191502 ] Debasish Das commented on SPARK-3987: - Nope...standard ALS...same as netflix params

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-31 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191551#comment-14191551 ] Debasish Das commented on SPARK-2426: - [~mengxr] The matlab comparison scripts

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-31 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191551#comment-14191551 ] Debasish Das edited comment on SPARK-2426 at 10/31/14 8:04 AM

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-31 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095232#comment-14095232 ] Debasish Das edited comment on SPARK-2426 at 10/31/14 4:20 PM

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-31 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191997#comment-14191997 ] Debasish Das commented on SPARK-2426: - Matlab comparisons of MOSEK, ECOS, PDCO

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-31 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192935#comment-14192935 ] Debasish Das commented on SPARK-2426: - Refactored QuadraticMinimizer and NNLS from

[jira] [Reopened] (SPARK-3987) NNLS generates incorrect result

2014-10-30 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das reopened SPARK-3987: - I can send you a further list of failures...this is one more example...I strongly suggest moving

[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2014-10-30 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191245#comment-14191245 ] Debasish Das commented on SPARK-3987: - NNLS iters 36 result

[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2014-10-30 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191277#comment-14191277 ] Debasish Das commented on SPARK-3987: - Was there more changes that step size in your

[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2014-10-30 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191387#comment-14191387 ] Debasish Das commented on SPARK-3987: - [~mengxr] this came out of an internal dataset

Re: matrix factorization cross validation

2014-10-30 Thread Debasish Das
wonder if it is possible to extend the DIMSUM idea to computing top K matrix multiply between the user and item factor matrices, as opposed to all-pairs similarity of one matrix? On Thu, Oct 30, 2014 at 5:28 AM, Debasish Das debasish.da...@gmail.com wrote: Is there an example of how to use

Re: matrix factorization cross validation

2014-10-30 Thread Debasish Das
any of the topic modeling algorithms as well... Is there a better place for it other than mllib examples ? On Thu, Oct 30, 2014 at 8:13 AM, Debasish Das debasish.da...@gmail.com wrote: I thought topK will save us...for each user we have 1xrank...now our movie factor is a RDD...we pick topK movie

matrix factorization cross validation

2014-10-29 Thread Debasish Das
Hi, In the current factorization flow, we cross validate on the test dataset using the RMSE number but there are some other measures which are worth looking into. If we consider the problem as a regression problem and the ratings 1-5 are considered as 5 classes, it is possible to generate a

Re: matrix factorization cross validation

2014-10-29 Thread Debasish Das
, Debasish Das debasish.da...@gmail.com wrote: Hi, In the current factorization flow, we cross validate on the test dataset using the RMSE number but there are some other measures which are worth looking into. If we consider the problem as a regression problem and the ratings 1-5

Re: matrix factorization cross validation

2014-10-29 Thread Debasish Das
to examples.MovielensALS. ROC should be good to add as well. -Xiangrui On Wed, Oct 29, 2014 at 11:23 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, In the current factorization flow, we cross validate on the test dataset using the RMSE number but there are some other measures which are worth

Re: Spark LIBLINEAR

2014-10-27 Thread Debasish Das
:33 PM, Chih-Jen Lin cj...@csie.ntu.edu.tw wrote: Debasish Das writes: If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David already has a version of TRON in breeze but someone needs to validate it for linear SVM

Re: Spark LIBLINEAR

2014-10-24 Thread Debasish Das
@dbtsai for condition number what did you use ? Diagonal preconditioning of the inverse of B matrix ? But then B matrix keeps on changing...did u condition it after every few iterations ? Will it be possible to put that code in Breeze since it will be very useful to condition other solvers as

[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2014-10-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180001#comment-14180001 ] Debasish Das commented on SPARK-3987: - I will test it but this is how I called NNLS

[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2014-10-22 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180314#comment-14180314 ] Debasish Das commented on SPARK-3987: - [~coderxiang] changing to 1e-6 to 1e-7 fixes

Re: Solving linear equations

2014-10-22 Thread Debasish Das
Hi Martin, This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;] and x is what you want to find..B is 0 in this case...For mllib normally this is labelbasically create a labeledPoint where label is 0 always... Use mllib's linear regression and solve the following

Re: Oryx + Spark mllib

2014-10-20 Thread Debasish Das
the architecture. It has all the things you are thinking about:) Thanks, Jayant On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either Streaming

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-19 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176358#comment-14176358 ] Debasish Das commented on SPARK-2426: - [~mengxr] I thought more on it and one

Re: Oryx + Spark mllib

2014-10-19 Thread Debasish Das
wrote: Oryx 2 seems to be geared for Spark https://github.com/OryxProject/oryx 2014-10-18 11:46 GMT-04:00 Debasish Das debasish.da...@gmail.com: Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either

Fwd: Oryx + Spark mllib

2014-10-18 Thread Debasish Das
Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either Streaming data / Batch data in HDFS and cross validated with mllib APIs but the model serving layer will give API endpoints like Oryx and read the models may be from

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-17 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175167#comment-14175167 ] Debasish Das commented on SPARK-2426: - 1. [~mengxr] Our legal was clear that Stanford

NNLS bug

2014-10-17 Thread Debasish Das
Hi, I am validating the proximal algorithm for positive and bound constrained ALS and I came across the bug detailed in the JIRA while running ALS with NNLS: https://issues.apache.org/jira/browse/SPARK-3987 ADMM based proximal algorithm came up with correct result... Thanks. Deb

[jira] [Created] (SPARK-3987) NNLS generates incorrect result

2014-10-16 Thread Debasish Das (JIRA)
Debasish Das created SPARK-3987: --- Summary: NNLS generates incorrect result Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug

Re: Issues with ALS positive definite

2014-10-16 Thread Debasish Das
in a different implementation and it has worked fine. Now I have to go hunt for how the QR decomposition is exposed in BLAS... Looks like its GEQRF which JBLAS helpfully exposes. Debasish you could try it for fun at least. On Oct 15, 2014 8:06 PM, Debasish Das debasish.da...@gmail.com wrote: But do

Re: Issues with ALS positive definite

2014-10-16 Thread Debasish Das
Just checked, QR is exposed by netlib: import org.netlib.lapack.Dgeqrf For the equality and bound version, I will use QR...it will be faster than the LU that I am using through jblas.solveSymmetric... On Thu, Oct 16, 2014 at 8:34 AM, Debasish Das debasish.da...@gmail.com wrote: @xiangrui

Issues with ALS positive definite

2014-10-15 Thread Debasish Das
Hi, If I take the Movielens data and run the default ALS with regularization as 0.0, I am hitting exception from LAPACK that the gram matrix is not positive definite. This is on the master branch. This is how I run it : ./bin/spark-submit --total-executor-cores 1 --master spark://

Re: Issues with ALS positive definite

2014-10-15 Thread Debasish Das
, 2014 at 5:01 PM, Liquan Pei liquan...@gmail.com wrote: Hi Debaish, I think ||r - wi'hj||^{2} is semi-positive definite. Thanks, Liquan On Wed, Oct 15, 2014 at 4:57 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, If I take the Movielens data and run the default ALS with regularization

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Debasish Das
Awesome news Matei ! Congratulations to the databricks team and all the community members... On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which

Re: protobuf error running spark on hadoop 2.4

2014-10-08 Thread Debasish Das
I have faced this in the past and I have to put a profile -Phadoop2.3... mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DskipTests install On Wed, Oct 8, 2014 at 1:40 PM, Chuang Liu liuchuan...@gmail.com wrote: Hi: I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple

Local tests logging to log4j

2014-10-07 Thread Debasish Das
Hi, I have added some changes to ALS tests and I am re-running tests as: mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DwildcardSuites=org.apache.spark.mllib.recommendation.ALSSuite test I have some INFO logs in the code which I want to see on my console. They work fine if I add

Re: Local tests logging to log4j

2014-10-07 Thread Debasish Das
=ERROR log4j.logger.org.apache.zookeeper=WARN log4j.logger.org.eclipse.jetty=WARN log4j.logger.org.I0Itec.zkclient=WARN On Tue, Oct 7, 2014 at 7:42 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I have added some changes to ALS tests and I am re-running tests as: mvn

Re: lazy evaluation of RDD transformation

2014-10-06 Thread Debasish Das
Another rule of thumb is that definitely cache the RDD over which you need to do iterative analysis... For rest of them only cache if you have lot of free memory ! On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen so...@cloudera.com wrote: I think you mean that data2 is a function of data1 in the

Impala comparisons

2014-10-04 Thread Debasish Das
Hi, We write the output of models and other information as parquet files and later we let data APIs run SQL queries on the columnar data... SparkSQL is used to dump the data in parquet format and now we are considering whether using SparkSQL or Impala to read it back... I came across this

Re: Spark AccumulatorParam generic

2014-10-01 Thread Debasish Das
Can't you extend a class in place of object which can be generic ? class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] { } On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg johanstenber...@gmail.com wrote: Just realized that, of course, objects can't be generic, but how do I create a

Re: MLLib: Missing value imputation

2014-10-01 Thread Debasish Das
If the missing values are 0, then you can also look into implicit formulation... On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng men...@gmail.com wrote: We don't handle missing value imputation in the current version of MLlib. In future releases, we can store feature information in the

Cluster tests failing

2014-09-30 Thread Debasish Das
Hi, Inside mllib I am running tests using: mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn install The locat tests run fine but cluster tests are failing.. LBFGSClusterSuite: - task size should be small *** FAILED *** org.apache.spark.SparkException: Job aborted due to stage

Re: Cluster tests failing

2014-09-30 Thread Debasish Das
I have done mvn clean several times... Consistently all the mllib tests that are using LocalClusterSparkContext.scala, they fail !

Re: Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Debasish Das
If the tree is too big build it on graphxbut it will need thorough analysis so that the partitions are well balanced... On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw...@gmail.com wrote: Hi Boromir, Assuming the tree fits in memory, and what you want to do is parallelize the

Re: memory vs data_size

2014-09-30 Thread Debasish Das
Only fit the data in memory where you want to run the iterative algorithm For map-reduce operations, it's better not to cache if you have a memory crunch... Also schedule the persist and unpersist such that you utilize the RAM well... On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei

Re: Hyper Parameter Optimization Algorithms

2014-09-29 Thread Debasish Das
You should look into Evan Spark's talk from Spark Summit 2014 http://spark-summit.org/2014/talk/model-search-at-scale I am not sure if some of it is already open sourced through MLBase... On Mon, Sep 29, 2014 at 7:45 PM, Lochana Menikarachchi locha...@gmail.com wrote: Hi, Is there anyone

Re:

2014-09-24 Thread Debasish Das
HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge] to

<    1   2   3   4   5   >