[GitHub] incubator-spark pull request: SPARK-1129: use a predefined seed wh...

2014-02-24 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/645 SPARK-1129: use a predefined seed when seed is zero in XORShiftRandom If the seed is zero, XORShift generates all zeros, which would create unexpected result. JIRA: https://spark

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-24 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35918773 @fommil Either AL2 or MPL should work. We only need appropriate labeling for MPL, which is trivial. And thanks for the suggestion of making native libraries

[GitHub] incubator-spark pull request: Initialized the regVal for first ite...

2014-02-24 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/633#issuecomment-35915806 @dbtsai Since regVal remains 0.0 for any existing updater in MLlib, it would make more sense if this change comes with the L-BFGS PR you are working on. --- If

[GitHub] incubator-spark pull request: Principal Component Analysis

2014-02-24 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/564#discussion_r9983939 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/PCA.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...

2014-02-23 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/635#issuecomment-35849312 @markhamstra @pwendell For the use cases, this allCollect operation may be useful in the grid search for a good set of training parameters for machine learning

[GitHub] incubator-spark pull request: SPARK-1117: update accumulator docs

2014-02-21 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/631 SPARK-1117: update accumulator docs The current doc hints spark doesn't support accumulators of type `Long`, which is wrong. JIRA: https://spark-project.atlassian.net/browse/

[GitHub] incubator-spark pull request: MLLIB-25: Implicit ALS runs out of m...

2014-02-21 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/629#issuecomment-35750636 LGTM if Travis passes (no reason not). Thanks for the fix! --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] incubator-spark pull request: Principal Component Analysis

2014-02-20 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/incubator-spark/pull/564#discussion_r9899395 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/PCA.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-19 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35557645 @fommil Thanks a lot! The license JIRA is also interesting to follow ~ :) --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] incubator-spark pull request: MLLIB-24: url of "Collaborative Filt...

2014-02-19 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/619#issuecomment-35535759 DOI links are "permanent" so we don't need to worry about the link becoming invalid again. People will do a search and find the pdf easily if t

[GitHub] incubator-spark pull request: MLLIB-24: url of "Collaborative Filt...

2014-02-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/619#issuecomment-35470291 Do you mind using the DOI link of the paper: http://dx.doi.org/10.1109/ICDM.2008.22 ? --- If your project is set up for it, you can reply to this email and have

[GitHub] incubator-spark pull request: SPARK-1106: check key name and ident...

2014-02-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/617#issuecomment-35459427 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your

[GitHub] incubator-spark pull request: check key name and identity file bef...

2014-02-18 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/617 check key name and identity file before launch a cluster I launched an EC2 cluster without providing a key name and an identity file. The error showed up after two minutes. It would be good

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35449886 @fommil @MLnick I included MTJ into the benchmarks (see the updated comment above). Basically it performs very similar to breeze. @martinjaggi Gradient

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35442089 @fommil I have the native vecLib BLAS/LAPACK shipped with Mac OS X and OpenBLAS installed for testing. OpenBLAS is not on the search path. I deleted both and re

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35433131 Thanks all for the suggestions! @srowen @giyengar I updated the small benchmark suite to include commons-math3. It seems to me commons-math3 has couple

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

2014-02-17 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/572#discussion_r9800999 I'm not sure which style to use. @rxin ? I prefer the following: ~~~ map { fold => ( // "((&

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-14 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35132098 @fommil Yes, I mentioned the benchmark suite from Peter to @srowen in my previous comment, but it is designed for dense linear algebra. I put some of the code I

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35038739 @fommil I don't quite understand what "roll their own" means exactly here. I didn't propose to re-implement one or half linear algebra library

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35030774 @fommil MTJ use LGPL. See http://www.apache.org/legal/resolved.html

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35018212 @MLnick MTJ is not an option because of its license.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-13 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35017848 @shivaram @srowen @giyengar Thanks for keeping the discussion running! @shivaram The requirement is to add sparse data support in all existing MLlib

[GitHub] incubator-spark pull request: SPARK-1076: [Fix #578] add @transien...

2014-02-12 Thread mengxr
Github user mengxr closed the pull request at: https://github.com/apache/incubator-spark/pull/591

[GitHub] incubator-spark pull request: SPARK-1076: [Fix #578] add @transien...

2014-02-12 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/591 SPARK-1076: [Fix #578] add @transient to some vals I'll try to be more careful next time. You can merge this pull request into a Git repository by running: $ git pull

[GitHub] incubator-spark pull request: SPARK-1076: Convert Int to Long to a...

2014-02-12 Thread mengxr
Github user mengxr closed the pull request at: https://github.com/apache/incubator-spark/pull/589

[GitHub] incubator-spark pull request: SPARK-1076: Convert Int to Long to a...

2014-02-12 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/589#issuecomment-34906057 I will make another PR for the second commit. Next time we should leave the PR open for a day or half before merge.

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

2014-02-12 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/572#issuecomment-34901227 @holdenk How about splitting this PR into two? One contains the k-fold splitting method in mllib and the fix to BernoulliSampler, and the other contains the

[GitHub] incubator-spark pull request: SPARK-1076: Convert Int to Long to a...

2014-02-12 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/589 SPARK-1076: Convert Int to Long to avoid overflow Patch for PR #578. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator

[GitHub] incubator-spark pull request: SPARK-1076: zipWithIndex and zipWith...

2014-02-12 Thread mengxr
Github user mengxr closed the pull request at: https://github.com/apache/incubator-spark/pull/578

[GitHub] incubator-spark pull request: Adding assignRanks and assignUniqueI...

2014-02-11 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/578#issuecomment-34845415 The link is at the bottom of the PR description.

[GitHub] incubator-spark pull request: Adding assignRanks and assignUniqueI...

2014-02-11 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/578#issuecomment-34842407 @rxin Thanks! Please see the updated code.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34714242 @srowen Thanks for the information! I believe native BLAS/LAPACK libraries performs much better than Java implementation for level 2 and level 3 operations, but

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34712992 @debasish83 Are you speaking of the benchmark I posted to the JIRA? BLAS/LAPACK cannot be used for dense vector + sparse vector. Those are designed for dense

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-34707127 @sscdotopen @debasish83 , I'm okay with copying VectorWritable and remove mahout-core from dependencies. @srowen Just as you mentioned, the sparse v

[GitHub] incubator-spark pull request: Adding assignRanks and assignUniqueI...

2014-02-10 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/578 Adding assignRanks and assignUniqueIds to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

2014-02-10 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/572#issuecomment-34668194 @holdenk , the PartitionwiseSampledRDD was designed with this use case in mind. Both the folded RDD and its complement can be represented by

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

2014-02-10 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/incubator-spark/pull/575 [Proposal] Adding sparse data support and update KMeans This is a proposal for sparse data support in mllib (https://spark-project.atlassian.net/browse/MLLIB-18). The idea of the

[GitHub] incubator-spark pull request: Support negative implicit input in A...

2014-02-09 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/500#issuecomment-34591039 LGTM and thanks for fixing some existing errors!