spark git commit: [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer

2016-06-24 Thread mlnick
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 201d5e8db -> 76741b570


[SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and 
CountVectorizer

## What changes were proposed in this pull request?

Made changes to HashingTF,QuantileVectorizer and CountVectorizer

Author: GayathriMurali 

Closes #13745 from GayathriMurali/SPARK-15997.

(cherry picked from commit be88383e15a86d094963de5f7e8792510bc990de)
Signed-off-by: Nick Pentreath 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76741b57
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76741b57
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76741b57

Branch: refs/heads/branch-2.0
Commit: 76741b570e20eb7957ada28ad3c5babc0abb738f
Parents: 201d5e8
Author: GayathriMurali 
Authored: Fri Jun 24 13:25:40 2016 +0200
Committer: Nick Pentreath 
Committed: Fri Jun 24 13:26:28 2016 +0200

--
 docs/ml-features.md | 29 
 .../ml/JavaQuantileDiscretizerExample.java  |  7 -
 .../python/ml/quantile_discretizer_example.py   | 11 ++--
 .../ml/QuantileDiscretizerExample.scala |  9 --
 4 files changed, 38 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/76741b57/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 3cb2644..88fd291 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -46,14 +46,18 @@ In MLlib, we separate TF and IDF to make them flexible.
 `HashingTF` is a `Transformer` which takes sets of terms and converts those 
sets into 
 fixed-length feature vectors.  In text processing, a "set of terms" might be a 
bag of words.
 `HashingTF` utilizes the [hashing 
trick](http://en.wikipedia.org/wiki/Feature_hashing).
-A raw feature is mapped into an index (term) by applying a hash function. Then 
term frequencies 
+A raw feature is mapped into an index (term) by applying a hash function. The 
hash function
+used here is [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash). Then 
term frequencies
 are calculated based on the mapped indices. This approach avoids the need to 
compute a global 
 term-to-index map, which can be expensive for a large corpus, but it suffers 
from potential hash 
 collisions, where different raw features may become the same term after 
hashing. To reduce the 
 chance of collision, we can increase the target feature dimension, i.e. the 
number of buckets 
 of the hash table. Since a simple modulo is used to transform the hash 
function to a column index, 
 it is advisable to use a power of two as the feature dimension, otherwise the 
features will 
-not be mapped evenly to the columns. The default feature dimension is `$2^{18} 
= 262,144$`. 
+not be mapped evenly to the columns. The default feature dimension is `$2^{18} 
= 262,144$`.
+An optional binary toggle parameter controls term frequency counts. When set 
to true all nonzero
+frequency counts are set to 1. This is especially useful for discrete 
probabilistic models that
+model binary, rather than integer, counts.
 
 `CountVectorizer` converts text documents to vectors of term counts. Refer to 
[CountVectorizer
 ](ml-features.html#countvectorizer) for more details.
@@ -145,9 +149,11 @@ for more details on the API.
  passed to other algorithms like LDA.
 
  During the fitting process, `CountVectorizer` will select the top `vocabSize` 
words ordered by
- term frequency across the corpus. An optional parameter "minDF" also affects 
the fitting process
+ term frequency across the corpus. An optional parameter `minDF` also affects 
the fitting process
  by specifying the minimum number (or fraction if < 1.0) of documents a term 
must appear in to be
- included in the vocabulary.
+ included in the vocabulary. Another optional binary toggle parameter controls 
the output vector.
+ If set to true all nonzero counts are set to 1. This is especially useful for 
discrete probabilistic
+ models that model binary, rather than integer, counts.
 
 **Examples**
 
@@ -1096,14 +1102,13 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs a 
column with binned
-categorical features.
-The bin ranges are chosen by taking a sample of the data and dividing it into 
roughly equal parts.
-The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering 
all real values.
-This attempts to find `numBuckets` partitions based on a sample of the given 
input data, but it may
-find fewer depending on the data sample values.
-
-Note that the result may be different every time 

spark git commit: [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer

2016-06-24 Thread mlnick
Repository: spark
Updated Branches:
  refs/heads/master 158af162e -> be88383e1


[SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and 
CountVectorizer

## What changes were proposed in this pull request?

Made changes to HashingTF,QuantileVectorizer and CountVectorizer

Author: GayathriMurali 

Closes #13745 from GayathriMurali/SPARK-15997.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/be88383e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/be88383e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/be88383e

Branch: refs/heads/master
Commit: be88383e15a86d094963de5f7e8792510bc990de
Parents: 158af16
Author: GayathriMurali 
Authored: Fri Jun 24 13:25:40 2016 +0200
Committer: Nick Pentreath 
Committed: Fri Jun 24 13:25:40 2016 +0200

--
 docs/ml-features.md | 29 
 .../ml/JavaQuantileDiscretizerExample.java  |  7 -
 .../python/ml/quantile_discretizer_example.py   | 11 ++--
 .../ml/QuantileDiscretizerExample.scala |  9 --
 4 files changed, 38 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/be88383e/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 3cb2644..88fd291 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -46,14 +46,18 @@ In MLlib, we separate TF and IDF to make them flexible.
 `HashingTF` is a `Transformer` which takes sets of terms and converts those 
sets into 
 fixed-length feature vectors.  In text processing, a "set of terms" might be a 
bag of words.
 `HashingTF` utilizes the [hashing 
trick](http://en.wikipedia.org/wiki/Feature_hashing).
-A raw feature is mapped into an index (term) by applying a hash function. Then 
term frequencies 
+A raw feature is mapped into an index (term) by applying a hash function. The 
hash function
+used here is [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash). Then 
term frequencies
 are calculated based on the mapped indices. This approach avoids the need to 
compute a global 
 term-to-index map, which can be expensive for a large corpus, but it suffers 
from potential hash 
 collisions, where different raw features may become the same term after 
hashing. To reduce the 
 chance of collision, we can increase the target feature dimension, i.e. the 
number of buckets 
 of the hash table. Since a simple modulo is used to transform the hash 
function to a column index, 
 it is advisable to use a power of two as the feature dimension, otherwise the 
features will 
-not be mapped evenly to the columns. The default feature dimension is `$2^{18} 
= 262,144$`. 
+not be mapped evenly to the columns. The default feature dimension is `$2^{18} 
= 262,144$`.
+An optional binary toggle parameter controls term frequency counts. When set 
to true all nonzero
+frequency counts are set to 1. This is especially useful for discrete 
probabilistic models that
+model binary, rather than integer, counts.
 
 `CountVectorizer` converts text documents to vectors of term counts. Refer to 
[CountVectorizer
 ](ml-features.html#countvectorizer) for more details.
@@ -145,9 +149,11 @@ for more details on the API.
  passed to other algorithms like LDA.
 
  During the fitting process, `CountVectorizer` will select the top `vocabSize` 
words ordered by
- term frequency across the corpus. An optional parameter "minDF" also affects 
the fitting process
+ term frequency across the corpus. An optional parameter `minDF` also affects 
the fitting process
  by specifying the minimum number (or fraction if < 1.0) of documents a term 
must appear in to be
- included in the vocabulary.
+ included in the vocabulary. Another optional binary toggle parameter controls 
the output vector.
+ If set to true all nonzero counts are set to 1. This is especially useful for 
discrete probabilistic
+ models that model binary, rather than integer, counts.
 
 **Examples**
 
@@ -1096,14 +1102,13 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs a 
column with binned
-categorical features.
-The bin ranges are chosen by taking a sample of the data and dividing it into 
roughly equal parts.
-The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering 
all real values.
-This attempts to find `numBuckets` partitions based on a sample of the given 
input data, but it may
-find fewer depending on the data sample values.
-
-Note that the result may be different every time you run it, since the sample 
strategy behind it is
-non-deterministic.
+categorical features. The number of bins is set by the