[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-53217627
  
LGTM. Merged into master and branch-1.1! Thanks for helping on the 
documentation!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2068


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-22 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-53138329
  
@atalwalkar and @mengxr I just addressed the merge conflict. I think it's 
ready to merge. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-53138489
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19088/consoleFull)
 for   PR 2068 at commit 
[`109f324`](https://github.com/apache/spark/commit/109f32403a7395002a4eab9da46841d88f62d7cc).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-53141048
  
**Tests timed out** after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/2068#discussion_r16561045
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) - synonyms) {
 /div
 /div
 
-## TFIDF
\ No newline at end of file
+## TFIDF
+
+## StandardScaler
+
+Standardizes features by scaling to unit variance and/or removing the mean 
using column summary
+statistics on the samples in the training set. For example, RBF kernel of 
Support Vector Machines
+or the L1 and L2 regularized linear models typically assume that all 
features have unit variance
+and/or zero mean.
--- End diff --

How about I say
For example, RBF kernel of Support Vector Machines
or the L1 and L2 regularized linear models typically works better when all 
features have unit variance
and/or zero mean.

I actually have this statement from scikit documentation.  

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-52970122
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19065/consoleFull)
 for   PR 2068 at commit 
[`0a8fd34`](https://github.com/apache/spark/commit/0a8fd34dcfb45be4e0cbae0078ff7bd5b97814bc).
 * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-52981186
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19065/consoleFull)
 for   PR 2068 at commit 
[`0a8fd34`](https://github.com/apache/spark/commit/0a8fd34dcfb45be4e0cbae0078ff7bd5b97814bc).
 * This patch **fails** unit tests.
 * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-21 Thread atalwalkar
Github user atalwalkar commented on a diff in the pull request:

https://github.com/apache/spark/pull/2068#discussion_r16581683
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) - synonyms) {
 /div
 /div
 
-## TFIDF
\ No newline at end of file
+## TFIDF
+
+## StandardScaler
+
+Standardizes features by scaling to unit variance and/or removing the mean 
using column summary
+statistics on the samples in the training set. For example, RBF kernel of 
Support Vector Machines
+or the L1 and L2 regularized linear models typically assume that all 
features have unit variance
+and/or zero mean.
--- End diff --

Your suggestion sounds good to me!  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/2068

[SPARK-2841][MLlib] Documentation for feature transformations

Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AlpineNow/spark transformer-documentation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2068


commit e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31
Author: DB Tsai dbt...@alpinenow.com
Date:   2014-08-20T22:21:26Z

documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-52853909
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19004/consoleFull)
 for   PR 2068 at commit 
[`e339f64`](https://github.com/apache/spark/commit/e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-52858796
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19004/consoleFull)
 for   PR 2068 at commit 
[`e339f64`](https://github.com/apache/spark/commit/e339f64fbc35ad97a1ba021a6bf03bb6d0e06f31).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  shift # Ignore main class (org.apache.spark.deploy.SparkSubmit) and 
use our own`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2068#issuecomment-52858975
  
copy @atalwalkar


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread atalwalkar
Github user atalwalkar commented on a diff in the pull request:

https://github.com/apache/spark/pull/2068#discussion_r16514111
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) - synonyms) {
 /div
 /div
 
-## TFIDF
\ No newline at end of file
+## TFIDF
+
+## StandardScaler
+
+Standardizes features by scaling to unit variance and/or removing the mean 
using column summary
+statistics on the samples in the training set. For example, RBF kernel of 
Support Vector Machines
+or the L1 and L2 regularized linear models typically assume that all 
features have unit variance
+and/or zero mean.
--- End diff --

This is too strong of a statement.  Why not just say Normalizing features 
to have unit variance and/or zero mean is very a common preprocessing step.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread atalwalkar
Github user atalwalkar commented on a diff in the pull request:

https://github.com/apache/spark/pull/2068#discussion_r16514371
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) - synonyms) {
 /div
 /div
 
-## TFIDF
\ No newline at end of file
+## TFIDF
+
+## StandardScaler
+
+Standardizes features by scaling to unit variance and/or removing the mean 
using column summary
+statistics on the samples in the training set. For example, RBF kernel of 
Support Vector Machines
+or the L1 and L2 regularized linear models typically assume that all 
features have unit variance
+and/or zero mean.
+
+Standardization can not only improve the convergence rate during the 
optimization process, but also
+avoid the problem that when training linear models with regularization 
against a feature having
+a variance that is orders of magnitude larger than others, it might 
dominate the objective function
+and make the estimator unable to learn from other features correctly as 
expected.
--- End diff --

Suggested edit: Standardization can improve the convergence rate during 
the optimization process, and also prevents against features with very large 
variances exerting an overly large influence during model training.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2841][MLlib] Documentation for feature ...

2014-08-20 Thread atalwalkar
Github user atalwalkar commented on a diff in the pull request:

https://github.com/apache/spark/pull/2068#discussion_r16514387
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -70,4 +70,110 @@ for((synonym, cosineSimilarity) - synonyms) {
 /div
 /div
 
-## TFIDF
\ No newline at end of file
+## TFIDF
+
+## StandardScaler
+
+Standardizes features by scaling to unit variance and/or removing the mean 
using column summary
+statistics on the samples in the training set. For example, RBF kernel of 
Support Vector Machines
+or the L1 and L2 regularized linear models typically assume that all 
features have unit variance
+and/or zero mean.
+
+Standardization can not only improve the convergence rate during the 
optimization process, but also
+avoid the problem that when training linear models with regularization 
against a feature having
+a variance that is orders of magnitude larger than others, it might 
dominate the objective function
+and make the estimator unable to learn from other features correctly as 
expected.
+
+### Model Fitting
+

+[`StandardScaler`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler)
 has the
+following parameters in the constructor,
--- End diff --

, - :


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org