Repository: spark
Updated Branches:
  refs/heads/master e1139dd60 -> 43dfc84f8


[SPARK-2830][MLLIB] doc update for 1.1

1. renamed mllib-basics to mllib-data-types
1. renamed mllib-stats to mllib-statistics
1. moved random data generation to the bottom of mllib-stats
1. updated toc accordingly

atalwalkar

Author: Xiangrui Meng <m...@databricks.com>

Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits:

0bd79f3 [Xiangrui Meng] add mllib-data-types
b64a5d7 [Xiangrui Meng] update the content list of basis statistics in 
mllib-guide
f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types
4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics
e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/43dfc84f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/43dfc84f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/43dfc84f

Branch: refs/heads/master
Commit: 43dfc84f883822ea27b6e312d4353bf301c2e7ef
Parents: e1139dd
Author: Xiangrui Meng <m...@databricks.com>
Authored: Wed Aug 27 01:19:48 2014 -0700
Committer: Xiangrui Meng <m...@databricks.com>
Committed: Wed Aug 27 01:19:48 2014 -0700

----------------------------------------------------------------------
 docs/mllib-basics.md                   | 468 ----------------------------
 docs/mllib-data-types.md               | 468 ++++++++++++++++++++++++++++
 docs/mllib-dimensionality-reduction.md |   4 +-
 docs/mllib-guide.md                    |   9 +-
 docs/mllib-statistics.md               | 457 +++++++++++++++++++++++++++
 docs/mllib-stats.md                    | 457 ---------------------------
 6 files changed, 932 insertions(+), 931 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-basics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-basics.md b/docs/mllib-basics.md
deleted file mode 100644
index 8752df4..0000000
--- a/docs/mllib-basics.md
+++ /dev/null
@@ -1,468 +0,0 @@
----
-layout: global
-title: Basics - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Basics
----
-
-* Table of contents
-{:toc}
-
-MLlib supports local vectors and matrices stored on a single machine, 
-as well as distributed matrices backed by one or more RDDs.
-Local vectors and local matrices are simple data models 
-that serve as public interfaces. The underlying linear algebra operations are 
provided by
-[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
-A training example used in supervised learning is called a "labeled point" in 
MLlib.
-
-## Local vector
-
-A local vector has integer-typed and 0-based indices and double-typed values, 
stored on a single
-machine.  MLlib supports two types of local vectors: dense and sparse.  A 
dense vector is backed by
-a double array representing its entry values, while a sparse vector is backed 
by two parallel
-arrays: indices and values.  For example, a vector `(1.0, 0.0, 3.0)` can be 
represented in dense
-format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, 
where `3` is the size
-of the vector.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-The base class of local vectors is
-[`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector), and we 
provide two
-implementations: 
[`DenseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector) 
and
-[`SparseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector).
  We recommend
-using the factory methods implemented in
-[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) to 
create local vectors.
-
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.{Vector, Vectors}
-
-// Create a dense vector (1.0, 0.0, 3.0).
-val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
-// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values 
corresponding to nonzero entries.
-val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
-// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
-val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
-{% endhighlight %}
-
-***Note:***
-Scala imports `scala.collection.immutable.Vector` by default, so you have to 
import
-`org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`.
-
-</div>
-
-<div data-lang="java" markdown="1">
-
-The base class of local vectors is
-[`Vector`](api/java/org/apache/spark/mllib/linalg/Vector.html), and we provide 
two
-implementations: 
[`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVector.html) and
-[`SparseVector`](api/java/org/apache/spark/mllib/linalg/SparseVector.html).  
We recommend
-using the factory methods implemented in
-[`Vectors`](api/java/org/apache/spark/mllib/linalg/Vector.html) to create 
local vectors.
-
-{% highlight java %}
-import org.apache.spark.mllib.linalg.Vector;
-import org.apache.spark.mllib.linalg.Vectors;
-
-// Create a dense vector (1.0, 0.0, 3.0).
-Vector dv = Vectors.dense(1.0, 0.0, 3.0);
-// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values 
corresponding to nonzero entries.
-Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
-{% endhighlight %}
-</div>
-
-<div data-lang="python" markdown="1">
-MLlib recognizes the following types as dense vectors:
-
-* NumPy's 
[`array`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html)
-* Python's list, e.g., `[1, 2, 3]`
-
-and the following as sparse vectors:
-
-* MLlib's 
[`SparseVector`](api/python/pyspark.mllib.linalg.SparseVector-class.html).
-* SciPy's
-  
[`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
-  with a single column
-
-We recommend using NumPy arrays over lists for efficiency, and using the 
factory methods implemented
-in [`Vectors`](api/python/pyspark.mllib.linalg.Vectors-class.html) to create 
sparse vectors.
-
-{% highlight python %}
-import numpy as np
-import scipy.sparse as sps
-from pyspark.mllib.linalg import Vectors
-
-# Use a NumPy array as a dense vector.
-dv1 = np.array([1.0, 0.0, 3.0])
-# Use a Python list as a dense vector.
-dv2 = [1.0, 0.0, 3.0]
-# Create a SparseVector.
-sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
-# Use a single-column SciPy csc_matrix as a sparse vector.
-sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 
2])), shape = (3, 1))
-{% endhighlight %}
-
-</div>
-</div>
-
-## Labeled point
-
-A labeled point is a local vector, either dense or sparse, associated with a 
label/response.
-In MLlib, labeled points are used in supervised learning algorithms.
-We use a double to store a label, so we can use labeled points in both 
regression and classification.
-For binary classification, a label should be either `0` (negative) or `1` 
(positive).
-For multiclass classification, labels should be class indices starting from 
zero: `0, 1, 2, ...`.
-
-<div class="codetabs">
-
-<div data-lang="scala" markdown="1">
-
-A labeled point is represented by the case class
-[`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint).
-
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.Vectors
-import org.apache.spark.mllib.regression.LabeledPoint
-
-// Create a labeled point with a positive label and a dense feature vector.
-val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
-
-// Create a labeled point with a negative label and a sparse feature vector.
-val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-A labeled point is represented by
-[`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html).
-
-{% highlight java %}
-import org.apache.spark.mllib.linalg.Vectors;
-import org.apache.spark.mllib.regression.LabeledPoint;
-
-// Create a labeled point with a positive label and a dense feature vector.
-LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
-
-// Create a labeled point with a negative label and a sparse feature vector.
-LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, 
new double[] {1.0, 3.0}));
-{% endhighlight %}
-</div>
-
-<div data-lang="python" markdown="1">
-
-A labeled point is represented by
-[`LabeledPoint`](api/python/pyspark.mllib.regression.LabeledPoint-class.html).
-
-{% highlight python %}
-from pyspark.mllib.linalg import SparseVector
-from pyspark.mllib.regression import LabeledPoint
-
-# Create a labeled point with a positive label and a dense feature vector.
-pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
-
-# Create a labeled point with a negative label and a sparse feature vector.
-neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
-{% endhighlight %}
-</div>
-</div>
-
-***Sparse data***
-
-It is very common in practice to have sparse training data.  MLlib supports 
reading training
-examples stored in `LIBSVM` format, which is the default format used by
-[`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and
-[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/).  It is a text 
format in which each line
-represents a labeled sparse feature vector using the following format:
-
-~~~
-label index1:value1 index2:value2 ...
-~~~
-
-where the indices are one-based and in ascending order. 
-After loading, the feature indices are converted to zero-based.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$)
 reads training
-examples stored in LIBSVM format.
-
-{% highlight scala %}
-import org.apache.spark.mllib.regression.LabeledPoint
-import org.apache.spark.mllib.util.MLUtils
-import org.apache.spark.rdd.RDD
-
-val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt")
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-[`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) 
reads training
-examples stored in LIBSVM format.
-
-{% highlight java %}
-import org.apache.spark.mllib.regression.LabeledPoint;
-import org.apache.spark.mllib.util.MLUtils;
-import org.apache.spark.api.java.JavaRDD;
-
-JavaRDD<LabeledPoint> examples = 
-  MLUtils.loadLibSVMFile(jsc.sc(), 
"data/mllib/sample_libsvm_data.txt").toJavaRDD();
-{% endhighlight %}
-</div>
-
-<div data-lang="python" markdown="1">
-[`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.util.MLUtils-class.html) 
reads training
-examples stored in LIBSVM format.
-
-{% highlight python %}
-from pyspark.mllib.util import MLUtils
-
-examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
-{% endhighlight %}
-</div>
-</div>
-
-## Local matrix
-
-A local matrix has integer-typed row and column indices and double-typed 
values, stored on a single
-machine.  MLlib supports dense matrices, whose entry values are stored in a 
single double array in
-column major.  For example, the following matrix `\[ \begin{pmatrix}
-1.0 & 2.0 \\
-3.0 & 4.0 \\
-5.0 & 6.0
-\end{pmatrix}
-\]`
-is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the 
matrix size `(3, 2)`.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-The base class of local matrices is
-[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we 
provide one
-implementation: 
[`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
-We recommend using the factory methods implemented
-in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) 
to create local
-matrices.
-
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.{Matrix, Matrices}
-
-// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
-val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-The base class of local matrices is
-[`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide 
one
-implementation: 
[`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html).
-We recommend using the factory methods implemented
-in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to 
create local
-matrices.
-
-{% highlight java %}
-import org.apache.spark.mllib.linalg.Matrix;
-import org.apache.spark.mllib.linalg.Matrices;
-
-// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
-Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});
-{% endhighlight %}
-</div>
-
-</div>
-
-## Distributed matrix
-
-A distributed matrix has long-typed row and column indices and double-typed 
values, stored
-distributively in one or more RDDs.  It is very important to choose the right 
format to store large
-and distributed matrices.  Converting a distributed matrix to a different 
format may require a
-global shuffle, which is quite expensive.  Three types of distributed matrices 
have been implemented
-so far.
-
-The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented 
distributed
-matrix without meaningful row indices, e.g., a collection of feature vectors.
-It is backed by an RDD of its rows, where each row is a local vector.
-We assume that the number of columns is not huge for a `RowMatrix` so that a 
single
-local vector can be reasonably communicated to the driver and can also be 
stored /
-operated on using a single node. 
-An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
-which can be used for identifying rows and executing joins.
-A `CoordinateMatrix` is a distributed matrix stored in [coordinate list 
(COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) 
format,
-backed by an RDD of its entries.
-
-***Note***
-
-The underlying RDDs of a distributed matrix must be deterministic, because we 
cache the matrix size.
-In general the use of non-deterministic RDDs can lead to errors.
-
-### RowMatrix
-
-A `RowMatrix` is a row-oriented distributed matrix without meaningful row 
indices, backed by an RDD
-of its rows, where each row is a local vector.
-Since each row is represented by a local vector, the number of columns is
-limited by the integer range but it should be much smaller in practice.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-A 
[`RowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix)
 can be
-created from an `RDD[Vector]` instance.  Then we can compute its column 
summary statistics.
-
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.Vector
-import org.apache.spark.mllib.linalg.distributed.RowMatrix
-
-val rows: RDD[Vector] = ... // an RDD of local vectors
-// Create a RowMatrix from an RDD[Vector].
-val mat: RowMatrix = new RowMatrix(rows)
-
-// Get its size.
-val m = mat.numRows()
-val n = mat.numCols()
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-A 
[`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html)
 can be
-created from a `JavaRDD<Vector>` instance.  Then we can compute its column 
summary statistics.
-
-{% highlight java %}
-import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.mllib.linalg.Vector;
-import org.apache.spark.mllib.linalg.distributed.RowMatrix;
-
-JavaRDD<Vector> rows = ... // a JavaRDD of local vectors
-// Create a RowMatrix from an JavaRDD<Vector>.
-RowMatrix mat = new RowMatrix(rows.rdd());
-
-// Get its size.
-long m = mat.numRows();
-long n = mat.numCols();
-{% endhighlight %}
-</div>
-</div>
-
-### IndexedRowMatrix
-
-An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row 
indices.  It is backed by
-an RDD of indexed rows, so that each row is represented by its index 
(long-typed) and a local vector.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-An
-[`IndexedRowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
-can be created from an `RDD[IndexedRow]` instance, where
-[`IndexedRow`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow)
 is a
-wrapper over `(Long, Vector)`.  An `IndexedRowMatrix` can be converted to a 
`RowMatrix` by dropping
-its row indices.
-
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.distributed.{IndexedRow, 
IndexedRowMatrix, RowMatrix}
-
-val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
-// Create an IndexedRowMatrix from an RDD[IndexedRow].
-val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
-
-// Get its size.
-val m = mat.numRows()
-val n = mat.numCols()
-
-// Drop its row indices.
-val rowMat: RowMatrix = mat.toRowMatrix()
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-An
-[`IndexedRowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
-can be created from an `JavaRDD<IndexedRow>` instance, where
-[`IndexedRow`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html)
 is a
-wrapper over `(long, Vector)`.  An `IndexedRowMatrix` can be converted to a 
`RowMatrix` by dropping
-its row indices.
-
-{% highlight java %}
-import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.mllib.linalg.distributed.IndexedRow;
-import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
-import org.apache.spark.mllib.linalg.distributed.RowMatrix;
-
-JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
-// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
-IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
-
-// Get its size.
-long m = mat.numRows();
-long n = mat.numCols();
-
-// Drop its row indices.
-RowMatrix rowMat = mat.toRowMatrix();
-{% endhighlight %}
-</div></div>
-
-### CoordinateMatrix
-
-A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries.  
Each entry is a tuple
-of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the 
column index, and
-`value` is the entry value.  A `CoordinateMatrix` should be used only when both
-dimensions of the matrix are huge and the matrix is very sparse.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-A
-[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
-can be created from an `RDD[MatrixEntry]` instance, where
-[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry)
 is a
-wrapper over `(Long, Long, Double)`.  A `CoordinateMatrix` can be converted to 
an `IndexedRowMatrix`
-with sparse rows by calling `toIndexedRowMatrix`.  Other computations for 
-`CoordinateMatrix` are not currently supported.
-
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, 
MatrixEntry}
-
-val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
-// Create a CoordinateMatrix from an RDD[MatrixEntry].
-val mat: CoordinateMatrix = new CoordinateMatrix(entries)
-
-// Get its size.
-val m = mat.numRows()
-val n = mat.numCols()
-
-// Convert it to an IndexRowMatrix whose rows are sparse vectors.
-val indexedRowMatrix = mat.toIndexedRowMatrix()
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-A
-[`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
-can be created from a `JavaRDD<MatrixEntry>` instance, where
-[`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html)
 is a
-wrapper over `(long, long, double)`.  A `CoordinateMatrix` can be converted to 
an `IndexedRowMatrix`
-with sparse rows by calling `toIndexedRowMatrix`. Other computations for 
-`CoordinateMatrix` are not currently supported.
-
-{% highlight java %}
-import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
-import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
-import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
-
-JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
-// Create a CoordinateMatrix from a JavaRDD<MatrixEntry>.
-CoordinateMatrix mat = new CoordinateMatrix(entries.rdd());
-
-// Get its size.
-long m = mat.numRows();
-long n = mat.numCols();
-
-// Convert it to an IndexRowMatrix whose rows are sparse vectors.
-IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix();
-{% endhighlight %}
-</div>
-</div>

http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-data-types.md
----------------------------------------------------------------------
diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md
new file mode 100644
index 0000000..101dc2f
--- /dev/null
+++ b/docs/mllib-data-types.md
@@ -0,0 +1,468 @@
+---
+layout: global
+title: Data Types - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Data Types
+---
+
+* Table of contents
+{:toc}
+
+MLlib supports local vectors and matrices stored on a single machine, 
+as well as distributed matrices backed by one or more RDDs.
+Local vectors and local matrices are simple data models 
+that serve as public interfaces. The underlying linear algebra operations are 
provided by
+[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
+A training example used in supervised learning is called a "labeled point" in 
MLlib.
+
+## Local vector
+
+A local vector has integer-typed and 0-based indices and double-typed values, 
stored on a single
+machine.  MLlib supports two types of local vectors: dense and sparse.  A 
dense vector is backed by
+a double array representing its entry values, while a sparse vector is backed 
by two parallel
+arrays: indices and values.  For example, a vector `(1.0, 0.0, 3.0)` can be 
represented in dense
+format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, 
where `3` is the size
+of the vector.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+The base class of local vectors is
+[`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector), and we 
provide two
+implementations: 
[`DenseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector) 
and
+[`SparseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector).
  We recommend
+using the factory methods implemented in
+[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) to 
create local vectors.
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+// Create a dense vector (1.0, 0.0, 3.0).
+val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
+// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values 
corresponding to nonzero entries.
+val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
+// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
+val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
+{% endhighlight %}
+
+***Note:***
+Scala imports `scala.collection.immutable.Vector` by default, so you have to 
import
+`org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`.
+
+</div>
+
+<div data-lang="java" markdown="1">
+
+The base class of local vectors is
+[`Vector`](api/java/org/apache/spark/mllib/linalg/Vector.html), and we provide 
two
+implementations: 
[`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVector.html) and
+[`SparseVector`](api/java/org/apache/spark/mllib/linalg/SparseVector.html).  
We recommend
+using the factory methods implemented in
+[`Vectors`](api/java/org/apache/spark/mllib/linalg/Vector.html) to create 
local vectors.
+
+{% highlight java %}
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+
+// Create a dense vector (1.0, 0.0, 3.0).
+Vector dv = Vectors.dense(1.0, 0.0, 3.0);
+// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values 
corresponding to nonzero entries.
+Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+MLlib recognizes the following types as dense vectors:
+
+* NumPy's 
[`array`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html)
+* Python's list, e.g., `[1, 2, 3]`
+
+and the following as sparse vectors:
+
+* MLlib's 
[`SparseVector`](api/python/pyspark.mllib.linalg.SparseVector-class.html).
+* SciPy's
+  
[`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
+  with a single column
+
+We recommend using NumPy arrays over lists for efficiency, and using the 
factory methods implemented
+in [`Vectors`](api/python/pyspark.mllib.linalg.Vectors-class.html) to create 
sparse vectors.
+
+{% highlight python %}
+import numpy as np
+import scipy.sparse as sps
+from pyspark.mllib.linalg import Vectors
+
+# Use a NumPy array as a dense vector.
+dv1 = np.array([1.0, 0.0, 3.0])
+# Use a Python list as a dense vector.
+dv2 = [1.0, 0.0, 3.0]
+# Create a SparseVector.
+sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
+# Use a single-column SciPy csc_matrix as a sparse vector.
+sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 
2])), shape = (3, 1))
+{% endhighlight %}
+
+</div>
+</div>
+
+## Labeled point
+
+A labeled point is a local vector, either dense or sparse, associated with a 
label/response.
+In MLlib, labeled points are used in supervised learning algorithms.
+We use a double to store a label, so we can use labeled points in both 
regression and classification.
+For binary classification, a label should be either `0` (negative) or `1` 
(positive).
+For multiclass classification, labels should be class indices starting from 
zero: `0, 1, 2, ...`.
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+
+A labeled point is represented by the case class
+[`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint).
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.regression.LabeledPoint
+
+// Create a labeled point with a positive label and a dense feature vector.
+val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
+
+// Create a labeled point with a negative label and a sparse feature vector.
+val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+A labeled point is represented by
+[`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html).
+
+{% highlight java %}
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.mllib.regression.LabeledPoint;
+
+// Create a labeled point with a positive label and a dense feature vector.
+LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
+
+// Create a labeled point with a negative label and a sparse feature vector.
+LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, 
new double[] {1.0, 3.0}));
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+
+A labeled point is represented by
+[`LabeledPoint`](api/python/pyspark.mllib.regression.LabeledPoint-class.html).
+
+{% highlight python %}
+from pyspark.mllib.linalg import SparseVector
+from pyspark.mllib.regression import LabeledPoint
+
+# Create a labeled point with a positive label and a dense feature vector.
+pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
+
+# Create a labeled point with a negative label and a sparse feature vector.
+neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
+{% endhighlight %}
+</div>
+</div>
+
+***Sparse data***
+
+It is very common in practice to have sparse training data.  MLlib supports 
reading training
+examples stored in `LIBSVM` format, which is the default format used by
+[`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and
+[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/).  It is a text 
format in which each line
+represents a labeled sparse feature vector using the following format:
+
+~~~
+label index1:value1 index2:value2 ...
+~~~
+
+where the indices are one-based and in ascending order. 
+After loading, the feature indices are converted to zero-based.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$)
 reads training
+examples stored in LIBSVM format.
+
+{% highlight scala %}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, 
"data/mllib/sample_libsvm_data.txt")
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) 
reads training
+examples stored in LIBSVM format.
+
+{% highlight java %}
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.util.MLUtils;
+import org.apache.spark.api.java.JavaRDD;
+
+JavaRDD<LabeledPoint> examples = 
+  MLUtils.loadLibSVMFile(jsc.sc(), 
"data/mllib/sample_libsvm_data.txt").toJavaRDD();
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.util.MLUtils-class.html) 
reads training
+examples stored in LIBSVM format.
+
+{% highlight python %}
+from pyspark.mllib.util import MLUtils
+
+examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+{% endhighlight %}
+</div>
+</div>
+
+## Local matrix
+
+A local matrix has integer-typed row and column indices and double-typed 
values, stored on a single
+machine.  MLlib supports dense matrices, whose entry values are stored in a 
single double array in
+column major.  For example, the following matrix `\[ \begin{pmatrix}
+1.0 & 2.0 \\
+3.0 & 4.0 \\
+5.0 & 6.0
+\end{pmatrix}
+\]`
+is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the 
matrix size `(3, 2)`.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+The base class of local matrices is
+[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we 
provide one
+implementation: 
[`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
+We recommend using the factory methods implemented
+in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) 
to create local
+matrices.
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.{Matrix, Matrices}
+
+// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
+val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+The base class of local matrices is
+[`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide 
one
+implementation: 
[`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html).
+We recommend using the factory methods implemented
+in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to 
create local
+matrices.
+
+{% highlight java %}
+import org.apache.spark.mllib.linalg.Matrix;
+import org.apache.spark.mllib.linalg.Matrices;
+
+// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
+Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});
+{% endhighlight %}
+</div>
+
+</div>
+
+## Distributed matrix
+
+A distributed matrix has long-typed row and column indices and double-typed 
values, stored
+distributively in one or more RDDs.  It is very important to choose the right 
format to store large
+and distributed matrices.  Converting a distributed matrix to a different 
format may require a
+global shuffle, which is quite expensive.  Three types of distributed matrices 
have been implemented
+so far.
+
+The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented 
distributed
+matrix without meaningful row indices, e.g., a collection of feature vectors.
+It is backed by an RDD of its rows, where each row is a local vector.
+We assume that the number of columns is not huge for a `RowMatrix` so that a 
single
+local vector can be reasonably communicated to the driver and can also be 
stored /
+operated on using a single node. 
+An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
+which can be used for identifying rows and executing joins.
+A `CoordinateMatrix` is a distributed matrix stored in [coordinate list 
(COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) 
format,
+backed by an RDD of its entries.
+
+***Note***
+
+The underlying RDDs of a distributed matrix must be deterministic, because we 
cache the matrix size.
+In general the use of non-deterministic RDDs can lead to errors.
+
+### RowMatrix
+
+A `RowMatrix` is a row-oriented distributed matrix without meaningful row 
indices, backed by an RDD
+of its rows, where each row is a local vector.
+Since each row is represented by a local vector, the number of columns is
+limited by the integer range but it should be much smaller in practice.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+A 
[`RowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix)
 can be
+created from an `RDD[Vector]` instance.  Then we can compute its column 
summary statistics.
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+
+val rows: RDD[Vector] = ... // an RDD of local vectors
+// Create a RowMatrix from an RDD[Vector].
+val mat: RowMatrix = new RowMatrix(rows)
+
+// Get its size.
+val m = mat.numRows()
+val n = mat.numCols()
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+A 
[`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html)
 can be
+created from a `JavaRDD<Vector>` instance.  Then we can compute its column 
summary statistics.
+
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.distributed.RowMatrix;
+
+JavaRDD<Vector> rows = ... // a JavaRDD of local vectors
+// Create a RowMatrix from an JavaRDD<Vector>.
+RowMatrix mat = new RowMatrix(rows.rdd());
+
+// Get its size.
+long m = mat.numRows();
+long n = mat.numCols();
+{% endhighlight %}
+</div>
+</div>
+
+### IndexedRowMatrix
+
+An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row 
indices.  It is backed by
+an RDD of indexed rows, so that each row is represented by its index 
(long-typed) and a local vector.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+An
+[`IndexedRowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
+can be created from an `RDD[IndexedRow]` instance, where
+[`IndexedRow`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow)
 is a
+wrapper over `(Long, Vector)`.  An `IndexedRowMatrix` can be converted to a 
`RowMatrix` by dropping
+its row indices.
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.distributed.{IndexedRow, 
IndexedRowMatrix, RowMatrix}
+
+val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
+// Create an IndexedRowMatrix from an RDD[IndexedRow].
+val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
+
+// Get its size.
+val m = mat.numRows()
+val n = mat.numCols()
+
+// Drop its row indices.
+val rowMat: RowMatrix = mat.toRowMatrix()
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+An
+[`IndexedRowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
+can be created from an `JavaRDD<IndexedRow>` instance, where
+[`IndexedRow`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html)
 is a
+wrapper over `(long, Vector)`.  An `IndexedRowMatrix` can be converted to a 
`RowMatrix` by dropping
+its row indices.
+
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.mllib.linalg.distributed.IndexedRow;
+import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
+import org.apache.spark.mllib.linalg.distributed.RowMatrix;
+
+JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
+// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
+IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
+
+// Get its size.
+long m = mat.numRows();
+long n = mat.numCols();
+
+// Drop its row indices.
+RowMatrix rowMat = mat.toRowMatrix();
+{% endhighlight %}
+</div></div>
+
+### CoordinateMatrix
+
+A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries.  
Each entry is a tuple
+of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the 
column index, and
+`value` is the entry value.  A `CoordinateMatrix` should be used only when both
+dimensions of the matrix are huge and the matrix is very sparse.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+A
+[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
+can be created from an `RDD[MatrixEntry]` instance, where
+[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry)
 is a
+wrapper over `(Long, Long, Double)`.  A `CoordinateMatrix` can be converted to 
an `IndexedRowMatrix`
+with sparse rows by calling `toIndexedRowMatrix`.  Other computations for 
+`CoordinateMatrix` are not currently supported.
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, 
MatrixEntry}
+
+val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
+// Create a CoordinateMatrix from an RDD[MatrixEntry].
+val mat: CoordinateMatrix = new CoordinateMatrix(entries)
+
+// Get its size.
+val m = mat.numRows()
+val n = mat.numCols()
+
+// Convert it to an IndexRowMatrix whose rows are sparse vectors.
+val indexedRowMatrix = mat.toIndexedRowMatrix()
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+A
+[`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
+can be created from a `JavaRDD<MatrixEntry>` instance, where
+[`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html)
 is a
+wrapper over `(long, long, double)`.  A `CoordinateMatrix` can be converted to 
an `IndexedRowMatrix`
+with sparse rows by calling `toIndexedRowMatrix`. Other computations for 
+`CoordinateMatrix` are not currently supported.
+
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
+import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
+import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
+
+JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
+// Create a CoordinateMatrix from a JavaRDD<MatrixEntry>.
+CoordinateMatrix mat = new CoordinateMatrix(entries.rdd());
+
+// Get its size.
+long m = mat.numRows();
+long n = mat.numCols();
+
+// Convert it to an IndexRowMatrix whose rows are sparse vectors.
+IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix();
+{% endhighlight %}
+</div>
+</div>

http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-dimensionality-reduction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-dimensionality-reduction.md 
b/docs/mllib-dimensionality-reduction.md
index 9f2cf6d..21cb35b 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -11,7 +11,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - 
Dimensionality Reduction
 of reducing the number of variables under consideration.
 It can be used to extract latent features from raw and noisy features
 or compress data while maintaining the structure.
-MLlib provides support for dimensionality reduction on the <a 
href="mllib-basics.html#rowmatrix">RowMatrix</a> class.
+MLlib provides support for dimensionality reduction on the <a 
href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.
 
 ## Singular value decomposition (SVD)
 
@@ -58,7 +58,7 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage 
on the driver.
 ### SVD Example
  
 MLlib provides SVD functionality to row-oriented matrices, provided in the
-<a href="mllib-basics.html#rowmatrix">RowMatrix</a> class. 
+<a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class. 
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">

http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-guide.md
----------------------------------------------------------------------
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 4d4198b..d3a510b 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -7,12 +7,13 @@ MLlib is Spark's scalable machine learning library consisting 
of common learning
 including classification, regression, clustering, collaborative
 filtering, dimensionality reduction, as well as underlying optimization 
primitives, as outlined below:
 
-* [Data types](mllib-basics.html)
-* [Basic statistics](mllib-stats.html)
-  * random data generation  
-  * stratified sampling
+* [Data types](mllib-data-types.html)
+* [Basic statistics](mllib-statistics.html)
   * summary statistics
+  * correlations
+  * stratified sampling
   * hypothesis testing
+  * random data generation  
 * [Classification and regression](mllib-classification-regression.html)
   * [linear models (SVMs, logistic regression, linear 
regression)](mllib-linear-methods.html)
   * [decision trees](mllib-decision-tree.html)

http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-statistics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
new file mode 100644
index 0000000..c463241
--- /dev/null
+++ b/docs/mllib-statistics.md
@@ -0,0 +1,457 @@
+---
+layout: global
+title: Basic Statistics - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Basic Statistics 
+---
+
+* Table of contents
+{:toc}
+
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}} 
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}} 
+\newcommand{\ind}{\mathbf{1}} 
+\newcommand{\0}{\mathbf{0}} 
+\newcommand{\unit}{\mathbf{e}} 
+\newcommand{\one}{\mathbf{1}} 
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+## Summary statistics 
+
+We provide column summary statistics for `RDD[Vector]` through the function 
`colStats` 
+available in `Statistics`.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) 
returns an instance of
+[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
+which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
+total count.
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
+
+val observations: RDD[Vector] = ... // an RDD of Vectors
+
+// Compute column summary statistics.
+val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
+println(summary.mean) // a dense vector containing the mean value for each 
column
+println(summary.variance) // column-wise variance
+println(summary.numNonzeros) // number of nonzeros in each column
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns 
an instance of
+[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
+which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
+total count.
+
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
+import org.apache.spark.mllib.stat.Statistics;
+
+JavaSparkContext jsc = ...
+
+JavaRDD<Vector> mat = ... // an RDD of Vectors
+
+// Compute column summary statistics.
+MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
+System.out.println(summary.mean()); // a dense vector containing the mean 
value for each column
+System.out.println(summary.variance()); // column-wise variance
+System.out.println(summary.numNonzeros()); // number of nonzeros in each column
+
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`colStats()`](api/python/pyspark.mllib.stat.Statistics-class.html#colStats) 
returns an instance of
+[`MultivariateStatisticalSummary`](api/python/pyspark.mllib.stat.MultivariateStatisticalSummary-class.html),
+which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
+total count.
+
+{% highlight python %}
+from pyspark.mllib.stat import Statistics
+
+sc = ... # SparkContext
+
+mat = ... # an RDD of Vectors
+
+# Compute column summary statistics.
+summary = Statistics.colStats(mat)
+print summary.mean()
+print summary.variance()
+print summary.numNonzeros()
+
+{% endhighlight %}
+</div>
+
+</div>
+
+## Correlations
+
+Calculating the correlation between two series of data is a common operation 
in Statistics. In MLlib
+we provide the flexibility to calculate pairwise correlations among many 
series. The supported 
+correlation methods are currently Pearson's and Spearman's correlation.
+ 
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) 
provides methods to 
+calculate correlations between series. Depending on the type of input, two 
`RDD[Double]`s or 
+an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` 
respectively.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.mllib.stat.Statistics
+
+val sc: SparkContext = ...
+
+val seriesX: RDD[Double] = ... // a series
+val seriesY: RDD[Double] = ... // must have the same number of partitions and 
cardinality as seriesX
+
+// compute the correlation using Pearson's method. Enter "spearman" for 
Spearman's method. If a 
+// method is not specified, Pearson's method will be used by default. 
+val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
+
+val data: RDD[Vector] = ... // note that each Vector is a row and not a column
+
+// calculate the correlation matrix using Pearson's method. Use "spearman" for 
Spearman's method.
+// If a method is not specified, Pearson's method will be used by default. 
+val correlMatrix: Matrix = Statistics.corr(data, "pearson")
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides 
methods to 
+calculate correlations between series. Depending on the type of input, two 
`JavaDoubleRDD`s or 
+a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` 
respectively.
+
+{% highlight java %}
+import org.apache.spark.api.java.JavaDoubleRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.*;
+import org.apache.spark.mllib.stat.Statistics;
+
+JavaSparkContext jsc = ...
+
+JavaDoubleRDD seriesX = ... // a series
+JavaDoubleRDD seriesY = ... // must have the same number of partitions and 
cardinality as seriesX
+
+// compute the correlation using Pearson's method. Enter "spearman" for 
Spearman's method. If a 
+// method is not specified, Pearson's method will be used by default. 
+Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), 
"pearson");
+
+JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
+
+// calculate the correlation matrix using Pearson's method. Use "spearman" for 
Spearman's method.
+// If a method is not specified, Pearson's method will be used by default. 
+Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
+
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`Statistics`](api/python/pyspark.mllib.stat.Statistics-class.html) provides 
methods to 
+calculate correlations between series. Depending on the type of input, two 
`RDD[Double]`s or 
+an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` 
respectively.
+
+{% highlight python %}
+from pyspark.mllib.stat import Statistics
+
+sc = ... # SparkContext
+
+seriesX = ... # a series
+seriesY = ... # must have the same number of partitions and cardinality as 
seriesX
+
+# Compute the correlation using Pearson's method. Enter "spearman" for 
Spearman's method. If a 
+# method is not specified, Pearson's method will be used by default. 
+print Statistics.corr(seriesX, seriesY, method="pearson")
+
+data = ... # an RDD of Vectors
+# calculate the correlation matrix using Pearson's method. Use "spearman" for 
Spearman's method.
+# If a method is not specified, Pearson's method will be used by default. 
+print Statistics.corr(data, method="pearson")
+
+{% endhighlight %}
+</div>
+
+</div>
+
+## Stratified sampling
+
+Unlike the other statistics functions, which reside in MLLib, stratified 
sampling methods, 
+`sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of key-value 
pairs. For stratified
+sampling, the keys can be thought of as a label and the value as a specific 
attribute. For example 
+the key can be man or woman, or document ids, and the respective values can be 
the list of ages 
+of the people in the population or the list of words in the documents. The 
`sampleByKey` method 
+will flip a coin to decide whether an observation will be sampled or not, 
therefore requires one 
+pass over the data, and provides an *expected* sample size. `sampleByKeyExact` 
requires significant 
+more resources than the per-stratum simple random sampling used in 
`sampleByKey`, but will provide
+the exact sampling size with 99.99% confidence. `sampleByKeyExact` is 
currently not supported in 
+python.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions)
 allows users to
+sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where 
$f_k$ is the desired 
+fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and 
$K$ is the set of
+keys. Sampling without replacement requires one additional pass over the RDD 
to guarantee sample 
+size, whereas sampling with replacement requires two additional passes.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.SparkContext._
+import org.apache.spark.rdd.PairRDDFunctions
+
+val sc: SparkContext = ...
+
+val data = ... // an RDD[(K, V)] of any key value pairs
+val fractions: Map[K, Double] = ... // specify the exact fraction desired from 
each key
+
+// Get an exact sample from each stratum
+val approxSample = data.sampleByKey(withReplacement = false, fractions)
+val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) 
allows users to
+sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where 
$f_k$ is the desired 
+fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and 
$K$ is the set of
+keys. Sampling without replacement requires one additional pass over the RDD 
to guarantee sample 
+size, whereas sampling with replacement requires two additional passes.
+
+{% highlight java %}
+import java.util.Map;
+
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+JavaSparkContext jsc = ...
+
+JavaPairRDD<K, V> data = ... // an RDD of any key value pairs
+Map<K, Object> fractions = ... // specify the exact fraction desired from each 
key
+
+// Get an exact sample from each stratum
+JavaPairRDD<K, V> approxSample = data.sampleByKey(false, fractions);
+JavaPairRDD<K, V> exactSample = data.sampleByKeyExact(false, fractions);
+
+{% endhighlight %}
+</div>
+<div data-lang="python" markdown="1">
+[`sampleByKey()`](api/python/pyspark.rdd.RDD-class.html#sampleByKey) allows 
users to
+sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, 
where $f_k$ is the 
+desired fraction for key $k$, $n_k$ is the number of key-value pairs for key 
$k$, and $K$ is the 
+set of keys.
+
+*Note:* `sampleByKeyExact()` is currently not supported in Python.
+
+{% highlight python %}
+
+sc = ... # SparkContext
+
+data = ... # an RDD of any key value pairs
+fractions = ... # specify the exact fraction desired from each key as a 
dictionary
+
+approxSample = data.sampleByKey(False, fractions);
+
+{% endhighlight %}
+</div>
+
+</div>
+
+## Hypothesis testing
+
+Hypothesis testing is a powerful tool in statistics to determine whether a 
result is statistically 
+significant, whether this result occurred by chance or not. MLlib currently 
supports Pearson's 
+chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input 
data types determine 
+whether the goodness of fit or the independence test is conducted. The 
goodness of fit test requires 
+an input type of `Vector`, whereas the independence test requires a `Matrix` 
as input.
+
+MLlib also supports the input type `RDD[LabeledPoint]` to enable feature 
selection via chi-squared 
+independence tests.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) 
provides methods to 
+run Pearson's chi-squared tests. The following example demonstrates how to run 
and interpret 
+hypothesis tests.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.stat.Statistics._
+
+val sc: SparkContext = ...
+
+val vec: Vector = ... // a vector composed of the frequencies of events
+
+// compute the goodness of fit. If a second vector to test against is not 
supplied as a parameter, 
+// the test runs against a uniform distribution.  
+val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
+println(goodnessOfFitTestResult) // summary of the test including the p-value, 
degrees of freedom, 
+                                 // test statistic, the method used, and the 
null hypothesis.
+
+val mat: Matrix = ... // a contingency matrix
+
+// conduct Pearson's independence test on the input contingency matrix
+val independenceTestResult = Statistics.chiSqTest(mat) 
+println(independenceTestResult) // summary of the test including the p-value, 
degrees of freedom...
+
+val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
+
+// The contingency table is constructed from the raw (feature, label) pairs 
and used to conduct
+// the independence test. Returns an array containing the ChiSquaredTestResult 
for every feature 
+// against the label.
+val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
+var i = 1
+featureTestResults.foreach { result =>
+    println(s"Column $i:\n$result")
+    i += 1
+} // summary of the test 
+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides 
methods to 
+run Pearson's chi-squared tests. The following example demonstrates how to run 
and interpret 
+hypothesis tests.
+
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.*;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.stat.Statistics;
+import org.apache.spark.mllib.stat.test.ChiSqTestResult;
+
+JavaSparkContext jsc = ...
+
+Vector vec = ... // a vector composed of the frequencies of events
+
+// compute the goodness of fit. If a second vector to test against is not 
supplied as a parameter, 
+// the test runs against a uniform distribution.  
+ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);
+// summary of the test including the p-value, degrees of freedom, test 
statistic, the method used, 
+// and the null hypothesis.
+System.out.println(goodnessOfFitTestResult);
+
+Matrix mat = ... // a contingency matrix
+
+// conduct Pearson's independence test on the input contingency matrix
+ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);
+// summary of the test including the p-value, degrees of freedom...
+System.out.println(independenceTestResult);
+
+JavaRDD<LabeledPoint> obs = ... // an RDD of labeled points
+
+// The contingency table is constructed from the raw (feature, label) pairs 
and used to conduct
+// the independence test. Returns an array containing the ChiSquaredTestResult 
for every feature 
+// against the label.
+ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd());
+int i = 1;
+for (ChiSqTestResult result : featureTestResults) {
+    System.out.println("Column " + i + ":");
+    System.out.println(result); // summary of the test
+    i++;
+}
+
+{% endhighlight %}
+</div>
+
+</div>
+
+## Random data generation
+
+Random data generation is useful for randomized algorithms, prototyping, and 
performance testing.
+MLlib supports generating random RDDs with i.i.d. values drawn from a given 
distribution:
+uniform, standard normal, or Poisson.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) 
provides factory
+methods to generate random double RDDs or vector RDDs.
+The following example generates a random double RDD, whose values follows the 
standard normal
+distribution `N(0, 1)`, and then map it to `N(1, 4)`.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.random.RandomRDDs._
+
+val sc: SparkContext = ...
+
+// Generate a random double RDD that contains 1 million i.i.d. values drawn 
from the
+// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
+val u = normalRDD(sc, 1000000L, 10)
+// Apply a transform to get a random double RDD following `N(1, 4)`.
+val v = u.map(x => 1.0 + 2.0 * x)
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) 
provides factory
+methods to generate random double RDDs or vector RDDs.
+The following example generates a random double RDD, whose values follows the 
standard normal
+distribution `N(0, 1)`, and then map it to `N(1, 4)`.
+
+{% highlight java %}
+import org.apache.spark.SparkContext;
+import org.apache.spark.api.JavaDoubleRDD;
+import static org.apache.spark.mllib.random.RandomRDDs.*;
+
+JavaSparkContext jsc = ...
+
+// Generate a random double RDD that contains 1 million i.i.d. values drawn 
from the
+// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
+JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
+// Apply a transform to get a random double RDD following `N(1, 4)`.
+JavaDoubleRDD v = u.map(
+  new Function<Double, Double>() {
+    public Double call(Double x) {
+      return 1.0 + 2.0 * x;
+    }
+  });
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides 
factory
+methods to generate random double RDDs or vector RDDs.
+The following example generates a random double RDD, whose values follows the 
standard normal
+distribution `N(0, 1)`, and then map it to `N(1, 4)`.
+
+{% highlight python %}
+from pyspark.mllib.random import RandomRDDs
+
+sc = ... # SparkContext
+
+# Generate a random double RDD that contains 1 million i.i.d. values drawn 
from the
+# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
+u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
+# Apply a transform to get a random double RDD following `N(1, 4)`.
+v = u.map(lambda x: 1.0 + 2.0 * x)
+{% endhighlight %}
+</div>
+
+</div>

http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-stats.md
----------------------------------------------------------------------
diff --git a/docs/mllib-stats.md b/docs/mllib-stats.md
deleted file mode 100644
index 511a9fb..0000000
--- a/docs/mllib-stats.md
+++ /dev/null
@@ -1,457 +0,0 @@
----
-layout: global
-title: Statistics Functionality - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality 
----
-
-* Table of contents
-{:toc}
-
-
-`\[
-\newcommand{\R}{\mathbb{R}}
-\newcommand{\E}{\mathbb{E}} 
-\newcommand{\x}{\mathbf{x}}
-\newcommand{\y}{\mathbf{y}}
-\newcommand{\wv}{\mathbf{w}}
-\newcommand{\av}{\mathbf{\alpha}}
-\newcommand{\bv}{\mathbf{b}}
-\newcommand{\N}{\mathbb{N}}
-\newcommand{\id}{\mathbf{I}} 
-\newcommand{\ind}{\mathbf{1}} 
-\newcommand{\0}{\mathbf{0}} 
-\newcommand{\unit}{\mathbf{e}} 
-\newcommand{\one}{\mathbf{1}} 
-\newcommand{\zero}{\mathbf{0}}
-\]`
-
-## Summary Statistics 
-
-We provide column summary statistics for `RDD[Vector]` through the function 
`colStats` 
-available in `Statistics`.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) 
returns an instance of
-[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
-which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
-total count.
-
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.Vector
-import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
-
-val observations: RDD[Vector] = ... // an RDD of Vectors
-
-// Compute column summary statistics.
-val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
-println(summary.mean) // a dense vector containing the mean value for each 
column
-println(summary.variance) // column-wise variance
-println(summary.numNonzeros) // number of nonzeros in each column
-
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns 
an instance of
-[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
-which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
-total count.
-
-{% highlight java %}
-import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-import org.apache.spark.mllib.linalg.Vector;
-import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
-import org.apache.spark.mllib.stat.Statistics;
-
-JavaSparkContext jsc = ...
-
-JavaRDD<Vector> mat = ... // an RDD of Vectors
-
-// Compute column summary statistics.
-MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
-System.out.println(summary.mean()); // a dense vector containing the mean 
value for each column
-System.out.println(summary.variance()); // column-wise variance
-System.out.println(summary.numNonzeros()); // number of nonzeros in each column
-
-{% endhighlight %}
-</div>
-
-<div data-lang="python" markdown="1">
-[`colStats()`](api/python/pyspark.mllib.stat.Statistics-class.html#colStats) 
returns an instance of
-[`MultivariateStatisticalSummary`](api/python/pyspark.mllib.stat.MultivariateStatisticalSummary-class.html),
-which contains the column-wise max, min, mean, variance, and number of 
nonzeros, as well as the
-total count.
-
-{% highlight python %}
-from pyspark.mllib.stat import Statistics
-
-sc = ... # SparkContext
-
-mat = ... # an RDD of Vectors
-
-# Compute column summary statistics.
-summary = Statistics.colStats(mat)
-print summary.mean()
-print summary.variance()
-print summary.numNonzeros()
-
-{% endhighlight %}
-</div>
-
-</div>
-
-## Random data generation
-
-Random data generation is useful for randomized algorithms, prototyping, and 
performance testing.
-MLlib supports generating random RDDs with i.i.d. values drawn from a given 
distribution:
-uniform, standard normal, or Poisson.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) 
provides factory
-methods to generate random double RDDs or vector RDDs.
-The following example generates a random double RDD, whose values follows the 
standard normal
-distribution `N(0, 1)`, and then map it to `N(1, 4)`.
-
-{% highlight scala %}
-import org.apache.spark.SparkContext
-import org.apache.spark.mllib.random.RandomRDDs._
-
-val sc: SparkContext = ...
-
-// Generate a random double RDD that contains 1 million i.i.d. values drawn 
from the
-// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
-val u = normalRDD(sc, 1000000L, 10)
-// Apply a transform to get a random double RDD following `N(1, 4)`.
-val v = u.map(x => 1.0 + 2.0 * x)
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) 
provides factory
-methods to generate random double RDDs or vector RDDs.
-The following example generates a random double RDD, whose values follows the 
standard normal
-distribution `N(0, 1)`, and then map it to `N(1, 4)`.
-
-{% highlight java %}
-import org.apache.spark.SparkContext;
-import org.apache.spark.api.JavaDoubleRDD;
-import static org.apache.spark.mllib.random.RandomRDDs.*;
-
-JavaSparkContext jsc = ...
-
-// Generate a random double RDD that contains 1 million i.i.d. values drawn 
from the
-// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
-JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
-// Apply a transform to get a random double RDD following `N(1, 4)`.
-JavaDoubleRDD v = u.map(
-  new Function<Double, Double>() {
-    public Double call(Double x) {
-      return 1.0 + 2.0 * x;
-    }
-  });
-{% endhighlight %}
-</div>
-
-<div data-lang="python" markdown="1">
-[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides 
factory
-methods to generate random double RDDs or vector RDDs.
-The following example generates a random double RDD, whose values follows the 
standard normal
-distribution `N(0, 1)`, and then map it to `N(1, 4)`.
-
-{% highlight python %}
-from pyspark.mllib.random import RandomRDDs
-
-sc = ... # SparkContext
-
-# Generate a random double RDD that contains 1 million i.i.d. values drawn 
from the
-# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
-u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
-# Apply a transform to get a random double RDD following `N(1, 4)`.
-v = u.map(lambda x: 1.0 + 2.0 * x)
-{% endhighlight %}
-</div>
-
-</div>
-
-## Correlations calculation
-
-Calculating the correlation between two series of data is a common operation 
in Statistics. In MLlib
-we provide the flexibility to calculate pairwise correlations among many 
series. The supported 
-correlation methods are currently Pearson's and Spearman's correlation.
- 
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) 
provides methods to 
-calculate correlations between series. Depending on the type of input, two 
`RDD[Double]`s or 
-an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` 
respectively.
-
-{% highlight scala %}
-import org.apache.spark.SparkContext
-import org.apache.spark.mllib.linalg._
-import org.apache.spark.mllib.stat.Statistics
-
-val sc: SparkContext = ...
-
-val seriesX: RDD[Double] = ... // a series
-val seriesY: RDD[Double] = ... // must have the same number of partitions and 
cardinality as seriesX
-
-// compute the correlation using Pearson's method. Enter "spearman" for 
Spearman's method. If a 
-// method is not specified, Pearson's method will be used by default. 
-val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
-
-val data: RDD[Vector] = ... // note that each Vector is a row and not a column
-
-// calculate the correlation matrix using Pearson's method. Use "spearman" for 
Spearman's method.
-// If a method is not specified, Pearson's method will be used by default. 
-val correlMatrix: Matrix = Statistics.corr(data, "pearson")
-
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides 
methods to 
-calculate correlations between series. Depending on the type of input, two 
`JavaDoubleRDD`s or 
-a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` 
respectively.
-
-{% highlight java %}
-import org.apache.spark.api.java.JavaDoubleRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-import org.apache.spark.mllib.linalg.*;
-import org.apache.spark.mllib.stat.Statistics;
-
-JavaSparkContext jsc = ...
-
-JavaDoubleRDD seriesX = ... // a series
-JavaDoubleRDD seriesY = ... // must have the same number of partitions and 
cardinality as seriesX
-
-// compute the correlation using Pearson's method. Enter "spearman" for 
Spearman's method. If a 
-// method is not specified, Pearson's method will be used by default. 
-Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), 
"pearson");
-
-JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
-
-// calculate the correlation matrix using Pearson's method. Use "spearman" for 
Spearman's method.
-// If a method is not specified, Pearson's method will be used by default. 
-Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
-
-{% endhighlight %}
-</div>
-
-<div data-lang="python" markdown="1">
-[`Statistics`](api/python/pyspark.mllib.stat.Statistics-class.html) provides 
methods to 
-calculate correlations between series. Depending on the type of input, two 
`RDD[Double]`s or 
-an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` 
respectively.
-
-{% highlight python %}
-from pyspark.mllib.stat import Statistics
-
-sc = ... # SparkContext
-
-seriesX = ... # a series
-seriesY = ... # must have the same number of partitions and cardinality as 
seriesX
-
-# Compute the correlation using Pearson's method. Enter "spearman" for 
Spearman's method. If a 
-# method is not specified, Pearson's method will be used by default. 
-print Statistics.corr(seriesX, seriesY, method="pearson")
-
-data = ... # an RDD of Vectors
-# calculate the correlation matrix using Pearson's method. Use "spearman" for 
Spearman's method.
-# If a method is not specified, Pearson's method will be used by default. 
-print Statistics.corr(data, method="pearson")
-
-{% endhighlight %}
-</div>
-
-</div>
-
-## Stratified sampling
-
-Unlike the other statistics functions, which reside in MLLib, stratified 
sampling methods, 
-`sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of key-value 
pairs. For stratified
-sampling, the keys can be thought of as a label and the value as a specific 
attribute. For example 
-the key can be man or woman, or document ids, and the respective values can be 
the list of ages 
-of the people in the population or the list of words in the documents. The 
`sampleByKey` method 
-will flip a coin to decide whether an observation will be sampled or not, 
therefore requires one 
-pass over the data, and provides an *expected* sample size. `sampleByKeyExact` 
requires significant 
-more resources than the per-stratum simple random sampling used in 
`sampleByKey`, but will provide
-the exact sampling size with 99.99% confidence. `sampleByKeyExact` is 
currently not supported in 
-python.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions)
 allows users to
-sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where 
$f_k$ is the desired 
-fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and 
$K$ is the set of
-keys. Sampling without replacement requires one additional pass over the RDD 
to guarantee sample 
-size, whereas sampling with replacement requires two additional passes.
-
-{% highlight scala %}
-import org.apache.spark.SparkContext
-import org.apache.spark.SparkContext._
-import org.apache.spark.rdd.PairRDDFunctions
-
-val sc: SparkContext = ...
-
-val data = ... // an RDD[(K, V)] of any key value pairs
-val fractions: Map[K, Double] = ... // specify the exact fraction desired from 
each key
-
-// Get an exact sample from each stratum
-val approxSample = data.sampleByKey(withReplacement = false, fractions)
-val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)
-
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) 
allows users to
-sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where 
$f_k$ is the desired 
-fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and 
$K$ is the set of
-keys. Sampling without replacement requires one additional pass over the RDD 
to guarantee sample 
-size, whereas sampling with replacement requires two additional passes.
-
-{% highlight java %}
-import java.util.Map;
-
-import org.apache.spark.api.java.JavaPairRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-
-JavaSparkContext jsc = ...
-
-JavaPairRDD<K, V> data = ... // an RDD of any key value pairs
-Map<K, Object> fractions = ... // specify the exact fraction desired from each 
key
-
-// Get an exact sample from each stratum
-JavaPairRDD<K, V> approxSample = data.sampleByKey(false, fractions);
-JavaPairRDD<K, V> exactSample = data.sampleByKeyExact(false, fractions);
-
-{% endhighlight %}
-</div>
-<div data-lang="python" markdown="1">
-[`sampleByKey()`](api/python/pyspark.rdd.RDD-class.html#sampleByKey) allows 
users to
-sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, 
where $f_k$ is the 
-desired fraction for key $k$, $n_k$ is the number of key-value pairs for key 
$k$, and $K$ is the 
-set of keys.
-
-*Note:* `sampleByKeyExact()` is currently not supported in Python.
-
-{% highlight python %}
-
-sc = ... # SparkContext
-
-data = ... # an RDD of any key value pairs
-fractions = ... # specify the exact fraction desired from each key as a 
dictionary
-
-approxSample = data.sampleByKey(False, fractions);
-
-{% endhighlight %}
-</div>
-
-</div>
-
-## Hypothesis testing
-
-Hypothesis testing is a powerful tool in statistics to determine whether a 
result is statistically 
-significant, whether this result occurred by chance or not. MLlib currently 
supports Pearson's 
-chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input 
data types determine 
-whether the goodness of fit or the independence test is conducted. The 
goodness of fit test requires 
-an input type of `Vector`, whereas the independence test requires a `Matrix` 
as input.
-
-MLlib also supports the input type `RDD[LabeledPoint]` to enable feature 
selection via chi-squared 
-independence tests.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) 
provides methods to 
-run Pearson's chi-squared tests. The following example demonstrates how to run 
and interpret 
-hypothesis tests.
-
-{% highlight scala %}
-import org.apache.spark.SparkContext
-import org.apache.spark.mllib.linalg._
-import org.apache.spark.mllib.regression.LabeledPoint
-import org.apache.spark.mllib.stat.Statistics._
-
-val sc: SparkContext = ...
-
-val vec: Vector = ... // a vector composed of the frequencies of events
-
-// compute the goodness of fit. If a second vector to test against is not 
supplied as a parameter, 
-// the test runs against a uniform distribution.  
-val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
-println(goodnessOfFitTestResult) // summary of the test including the p-value, 
degrees of freedom, 
-                                 // test statistic, the method used, and the 
null hypothesis.
-
-val mat: Matrix = ... // a contingency matrix
-
-// conduct Pearson's independence test on the input contingency matrix
-val independenceTestResult = Statistics.chiSqTest(mat) 
-println(independenceTestResult) // summary of the test including the p-value, 
degrees of freedom...
-
-val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
-
-// The contingency table is constructed from the raw (feature, label) pairs 
and used to conduct
-// the independence test. Returns an array containing the ChiSquaredTestResult 
for every feature 
-// against the label.
-val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
-var i = 1
-featureTestResults.foreach { result =>
-    println(s"Column $i:\n$result")
-    i += 1
-} // summary of the test 
-
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides 
methods to 
-run Pearson's chi-squared tests. The following example demonstrates how to run 
and interpret 
-hypothesis tests.
-
-{% highlight java %}
-import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-import org.apache.spark.mllib.linalg.*;
-import org.apache.spark.mllib.regression.LabeledPoint;
-import org.apache.spark.mllib.stat.Statistics;
-import org.apache.spark.mllib.stat.test.ChiSqTestResult;
-
-JavaSparkContext jsc = ...
-
-Vector vec = ... // a vector composed of the frequencies of events
-
-// compute the goodness of fit. If a second vector to test against is not 
supplied as a parameter, 
-// the test runs against a uniform distribution.  
-ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);
-// summary of the test including the p-value, degrees of freedom, test 
statistic, the method used, 
-// and the null hypothesis.
-System.out.println(goodnessOfFitTestResult);
-
-Matrix mat = ... // a contingency matrix
-
-// conduct Pearson's independence test on the input contingency matrix
-ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);
-// summary of the test including the p-value, degrees of freedom...
-System.out.println(independenceTestResult);
-
-JavaRDD<LabeledPoint> obs = ... // an RDD of labeled points
-
-// The contingency table is constructed from the raw (feature, label) pairs 
and used to conduct
-// the independence test. Returns an array containing the ChiSquaredTestResult 
for every feature 
-// against the label.
-ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd());
-int i = 1;
-for (ChiSqTestResult result : featureTestResults) {
-    System.out.println("Column " + i + ":");
-    System.out.println(result); // summary of the test
-    i++;
-}
-
-{% endhighlight %}
-</div>
-
-</div>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to