Repository: spark Updated Branches: refs/heads/master e1139dd60 -> 43dfc84f8
[SPARK-2830][MLLIB] doc update for 1.1 1. renamed mllib-basics to mllib-data-types 1. renamed mllib-stats to mllib-statistics 1. moved random data generation to the bottom of mllib-stats 1. updated toc accordingly atalwalkar Author: Xiangrui Meng <m...@databricks.com> Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits: 0bd79f3 [Xiangrui Meng] add mllib-data-types b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types 4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/43dfc84f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/43dfc84f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/43dfc84f Branch: refs/heads/master Commit: 43dfc84f883822ea27b6e312d4353bf301c2e7ef Parents: e1139dd Author: Xiangrui Meng <m...@databricks.com> Authored: Wed Aug 27 01:19:48 2014 -0700 Committer: Xiangrui Meng <m...@databricks.com> Committed: Wed Aug 27 01:19:48 2014 -0700 ---------------------------------------------------------------------- docs/mllib-basics.md | 468 ---------------------------- docs/mllib-data-types.md | 468 ++++++++++++++++++++++++++++ docs/mllib-dimensionality-reduction.md | 4 +- docs/mllib-guide.md | 9 +- docs/mllib-statistics.md | 457 +++++++++++++++++++++++++++ docs/mllib-stats.md | 457 --------------------------- 6 files changed, 932 insertions(+), 931 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-basics.md ---------------------------------------------------------------------- diff --git a/docs/mllib-basics.md b/docs/mllib-basics.md deleted file mode 100644 index 8752df4..0000000 --- a/docs/mllib-basics.md +++ /dev/null @@ -1,468 +0,0 @@ ---- -layout: global -title: Basics - MLlib -displayTitle: <a href="mllib-guide.html">MLlib</a> - Basics ---- - -* Table of contents -{:toc} - -MLlib supports local vectors and matrices stored on a single machine, -as well as distributed matrices backed by one or more RDDs. -Local vectors and local matrices are simple data models -that serve as public interfaces. The underlying linear algebra operations are provided by -[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/). -A training example used in supervised learning is called a "labeled point" in MLlib. - -## Local vector - -A local vector has integer-typed and 0-based indices and double-typed values, stored on a single -machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by -a double array representing its entry values, while a sparse vector is backed by two parallel -arrays: indices and values. For example, a vector `(1.0, 0.0, 3.0)` can be represented in dense -format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, where `3` is the size -of the vector. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> - -The base class of local vectors is -[`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two -implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector) and -[`SparseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend -using the factory methods implemented in -[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors. - -{% highlight scala %} -import org.apache.spark.mllib.linalg.{Vector, Vectors} - -// Create a dense vector (1.0, 0.0, 3.0). -val dv: Vector = Vectors.dense(1.0, 0.0, 3.0) -// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. -val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)) -// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries. -val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))) -{% endhighlight %} - -***Note:*** -Scala imports `scala.collection.immutable.Vector` by default, so you have to import -`org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`. - -</div> - -<div data-lang="java" markdown="1"> - -The base class of local vectors is -[`Vector`](api/java/org/apache/spark/mllib/linalg/Vector.html), and we provide two -implementations: [`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVector.html) and -[`SparseVector`](api/java/org/apache/spark/mllib/linalg/SparseVector.html). We recommend -using the factory methods implemented in -[`Vectors`](api/java/org/apache/spark/mllib/linalg/Vector.html) to create local vectors. - -{% highlight java %} -import org.apache.spark.mllib.linalg.Vector; -import org.apache.spark.mllib.linalg.Vectors; - -// Create a dense vector (1.0, 0.0, 3.0). -Vector dv = Vectors.dense(1.0, 0.0, 3.0); -// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. -Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); -{% endhighlight %} -</div> - -<div data-lang="python" markdown="1"> -MLlib recognizes the following types as dense vectors: - -* NumPy's [`array`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) -* Python's list, e.g., `[1, 2, 3]` - -and the following as sparse vectors: - -* MLlib's [`SparseVector`](api/python/pyspark.mllib.linalg.SparseVector-class.html). -* SciPy's - [`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix) - with a single column - -We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented -in [`Vectors`](api/python/pyspark.mllib.linalg.Vectors-class.html) to create sparse vectors. - -{% highlight python %} -import numpy as np -import scipy.sparse as sps -from pyspark.mllib.linalg import Vectors - -# Use a NumPy array as a dense vector. -dv1 = np.array([1.0, 0.0, 3.0]) -# Use a Python list as a dense vector. -dv2 = [1.0, 0.0, 3.0] -# Create a SparseVector. -sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0]) -# Use a single-column SciPy csc_matrix as a sparse vector. -sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1)) -{% endhighlight %} - -</div> -</div> - -## Labeled point - -A labeled point is a local vector, either dense or sparse, associated with a label/response. -In MLlib, labeled points are used in supervised learning algorithms. -We use a double to store a label, so we can use labeled points in both regression and classification. -For binary classification, a label should be either `0` (negative) or `1` (positive). -For multiclass classification, labels should be class indices starting from zero: `0, 1, 2, ...`. - -<div class="codetabs"> - -<div data-lang="scala" markdown="1"> - -A labeled point is represented by the case class -[`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint). - -{% highlight scala %} -import org.apache.spark.mllib.linalg.Vectors -import org.apache.spark.mllib.regression.LabeledPoint - -// Create a labeled point with a positive label and a dense feature vector. -val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) - -// Create a labeled point with a negative label and a sparse feature vector. -val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> - -A labeled point is represented by -[`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html). - -{% highlight java %} -import org.apache.spark.mllib.linalg.Vectors; -import org.apache.spark.mllib.regression.LabeledPoint; - -// Create a labeled point with a positive label and a dense feature vector. -LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)); - -// Create a labeled point with a negative label and a sparse feature vector. -LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0})); -{% endhighlight %} -</div> - -<div data-lang="python" markdown="1"> - -A labeled point is represented by -[`LabeledPoint`](api/python/pyspark.mllib.regression.LabeledPoint-class.html). - -{% highlight python %} -from pyspark.mllib.linalg import SparseVector -from pyspark.mllib.regression import LabeledPoint - -# Create a labeled point with a positive label and a dense feature vector. -pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) - -# Create a labeled point with a negative label and a sparse feature vector. -neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0])) -{% endhighlight %} -</div> -</div> - -***Sparse data*** - -It is very common in practice to have sparse training data. MLlib supports reading training -examples stored in `LIBSVM` format, which is the default format used by -[`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and -[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/). It is a text format in which each line -represents a labeled sparse feature vector using the following format: - -~~~ -label index1:value1 index2:value2 ... -~~~ - -where the indices are one-based and in ascending order. -After loading, the feature indices are converted to zero-based. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> - -[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training -examples stored in LIBSVM format. - -{% highlight scala %} -import org.apache.spark.mllib.regression.LabeledPoint -import org.apache.spark.mllib.util.MLUtils -import org.apache.spark.rdd.RDD - -val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> -[`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) reads training -examples stored in LIBSVM format. - -{% highlight java %} -import org.apache.spark.mllib.regression.LabeledPoint; -import org.apache.spark.mllib.util.MLUtils; -import org.apache.spark.api.java.JavaRDD; - -JavaRDD<LabeledPoint> examples = - MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD(); -{% endhighlight %} -</div> - -<div data-lang="python" markdown="1"> -[`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.util.MLUtils-class.html) reads training -examples stored in LIBSVM format. - -{% highlight python %} -from pyspark.mllib.util import MLUtils - -examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") -{% endhighlight %} -</div> -</div> - -## Local matrix - -A local matrix has integer-typed row and column indices and double-typed values, stored on a single -machine. MLlib supports dense matrices, whose entry values are stored in a single double array in -column major. For example, the following matrix `\[ \begin{pmatrix} -1.0 & 2.0 \\ -3.0 & 4.0 \\ -5.0 & 6.0 -\end{pmatrix} -\]` -is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the matrix size `(3, 2)`. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> - -The base class of local matrices is -[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one -implementation: [`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix). -We recommend using the factory methods implemented -in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) to create local -matrices. - -{% highlight scala %} -import org.apache.spark.mllib.linalg.{Matrix, Matrices} - -// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) -val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)) -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> - -The base class of local matrices is -[`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide one -implementation: [`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html). -We recommend using the factory methods implemented -in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to create local -matrices. - -{% highlight java %} -import org.apache.spark.mllib.linalg.Matrix; -import org.apache.spark.mllib.linalg.Matrices; - -// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) -Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0}); -{% endhighlight %} -</div> - -</div> - -## Distributed matrix - -A distributed matrix has long-typed row and column indices and double-typed values, stored -distributively in one or more RDDs. It is very important to choose the right format to store large -and distributed matrices. Converting a distributed matrix to a different format may require a -global shuffle, which is quite expensive. Three types of distributed matrices have been implemented -so far. - -The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed -matrix without meaningful row indices, e.g., a collection of feature vectors. -It is backed by an RDD of its rows, where each row is a local vector. -We assume that the number of columns is not huge for a `RowMatrix` so that a single -local vector can be reasonably communicated to the driver and can also be stored / -operated on using a single node. -An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices, -which can be used for identifying rows and executing joins. -A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) format, -backed by an RDD of its entries. - -***Note*** - -The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size. -In general the use of non-deterministic RDDs can lead to errors. - -### RowMatrix - -A `RowMatrix` is a row-oriented distributed matrix without meaningful row indices, backed by an RDD -of its rows, where each row is a local vector. -Since each row is represented by a local vector, the number of columns is -limited by the integer range but it should be much smaller in practice. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> - -A [`RowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be -created from an `RDD[Vector]` instance. Then we can compute its column summary statistics. - -{% highlight scala %} -import org.apache.spark.mllib.linalg.Vector -import org.apache.spark.mllib.linalg.distributed.RowMatrix - -val rows: RDD[Vector] = ... // an RDD of local vectors -// Create a RowMatrix from an RDD[Vector]. -val mat: RowMatrix = new RowMatrix(rows) - -// Get its size. -val m = mat.numRows() -val n = mat.numCols() -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> - -A [`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be -created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics. - -{% highlight java %} -import org.apache.spark.api.java.JavaRDD; -import org.apache.spark.mllib.linalg.Vector; -import org.apache.spark.mllib.linalg.distributed.RowMatrix; - -JavaRDD<Vector> rows = ... // a JavaRDD of local vectors -// Create a RowMatrix from an JavaRDD<Vector>. -RowMatrix mat = new RowMatrix(rows.rdd()); - -// Get its size. -long m = mat.numRows(); -long n = mat.numCols(); -{% endhighlight %} -</div> -</div> - -### IndexedRowMatrix - -An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row indices. It is backed by -an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> - -An -[`IndexedRowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix) -can be created from an `RDD[IndexedRow]` instance, where -[`IndexedRow`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a -wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping -its row indices. - -{% highlight scala %} -import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix} - -val rows: RDD[IndexedRow] = ... // an RDD of indexed rows -// Create an IndexedRowMatrix from an RDD[IndexedRow]. -val mat: IndexedRowMatrix = new IndexedRowMatrix(rows) - -// Get its size. -val m = mat.numRows() -val n = mat.numCols() - -// Drop its row indices. -val rowMat: RowMatrix = mat.toRowMatrix() -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> - -An -[`IndexedRowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html) -can be created from an `JavaRDD<IndexedRow>` instance, where -[`IndexedRow`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a -wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping -its row indices. - -{% highlight java %} -import org.apache.spark.api.java.JavaRDD; -import org.apache.spark.mllib.linalg.distributed.IndexedRow; -import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix; -import org.apache.spark.mllib.linalg.distributed.RowMatrix; - -JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows -// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>. -IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd()); - -// Get its size. -long m = mat.numRows(); -long n = mat.numCols(); - -// Drop its row indices. -RowMatrix rowMat = mat.toRowMatrix(); -{% endhighlight %} -</div></div> - -### CoordinateMatrix - -A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries. Each entry is a tuple -of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the column index, and -`value` is the entry value. A `CoordinateMatrix` should be used only when both -dimensions of the matrix are huge and the matrix is very sparse. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> - -A -[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix) -can be created from an `RDD[MatrixEntry]` instance, where -[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a -wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix` -with sparse rows by calling `toIndexedRowMatrix`. Other computations for -`CoordinateMatrix` are not currently supported. - -{% highlight scala %} -import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry} - -val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries -// Create a CoordinateMatrix from an RDD[MatrixEntry]. -val mat: CoordinateMatrix = new CoordinateMatrix(entries) - -// Get its size. -val m = mat.numRows() -val n = mat.numCols() - -// Convert it to an IndexRowMatrix whose rows are sparse vectors. -val indexedRowMatrix = mat.toIndexedRowMatrix() -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> - -A -[`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html) -can be created from a `JavaRDD<MatrixEntry>` instance, where -[`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a -wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix` -with sparse rows by calling `toIndexedRowMatrix`. Other computations for -`CoordinateMatrix` are not currently supported. - -{% highlight java %} -import org.apache.spark.api.java.JavaRDD; -import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix; -import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix; -import org.apache.spark.mllib.linalg.distributed.MatrixEntry; - -JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries -// Create a CoordinateMatrix from a JavaRDD<MatrixEntry>. -CoordinateMatrix mat = new CoordinateMatrix(entries.rdd()); - -// Get its size. -long m = mat.numRows(); -long n = mat.numCols(); - -// Convert it to an IndexRowMatrix whose rows are sparse vectors. -IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix(); -{% endhighlight %} -</div> -</div> http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-data-types.md ---------------------------------------------------------------------- diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md new file mode 100644 index 0000000..101dc2f --- /dev/null +++ b/docs/mllib-data-types.md @@ -0,0 +1,468 @@ +--- +layout: global +title: Data Types - MLlib +displayTitle: <a href="mllib-guide.html">MLlib</a> - Data Types +--- + +* Table of contents +{:toc} + +MLlib supports local vectors and matrices stored on a single machine, +as well as distributed matrices backed by one or more RDDs. +Local vectors and local matrices are simple data models +that serve as public interfaces. The underlying linear algebra operations are provided by +[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/). +A training example used in supervised learning is called a "labeled point" in MLlib. + +## Local vector + +A local vector has integer-typed and 0-based indices and double-typed values, stored on a single +machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by +a double array representing its entry values, while a sparse vector is backed by two parallel +arrays: indices and values. For example, a vector `(1.0, 0.0, 3.0)` can be represented in dense +format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, where `3` is the size +of the vector. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +The base class of local vectors is +[`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two +implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector) and +[`SparseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend +using the factory methods implemented in +[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +// Create a dense vector (1.0, 0.0, 3.0). +val dv: Vector = Vectors.dense(1.0, 0.0, 3.0) +// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. +val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)) +// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries. +val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))) +{% endhighlight %} + +***Note:*** +Scala imports `scala.collection.immutable.Vector` by default, so you have to import +`org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`. + +</div> + +<div data-lang="java" markdown="1"> + +The base class of local vectors is +[`Vector`](api/java/org/apache/spark/mllib/linalg/Vector.html), and we provide two +implementations: [`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVector.html) and +[`SparseVector`](api/java/org/apache/spark/mllib/linalg/SparseVector.html). We recommend +using the factory methods implemented in +[`Vectors`](api/java/org/apache/spark/mllib/linalg/Vector.html) to create local vectors. + +{% highlight java %} +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.Vectors; + +// Create a dense vector (1.0, 0.0, 3.0). +Vector dv = Vectors.dense(1.0, 0.0, 3.0); +// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. +Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> +MLlib recognizes the following types as dense vectors: + +* NumPy's [`array`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) +* Python's list, e.g., `[1, 2, 3]` + +and the following as sparse vectors: + +* MLlib's [`SparseVector`](api/python/pyspark.mllib.linalg.SparseVector-class.html). +* SciPy's + [`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix) + with a single column + +We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented +in [`Vectors`](api/python/pyspark.mllib.linalg.Vectors-class.html) to create sparse vectors. + +{% highlight python %} +import numpy as np +import scipy.sparse as sps +from pyspark.mllib.linalg import Vectors + +# Use a NumPy array as a dense vector. +dv1 = np.array([1.0, 0.0, 3.0]) +# Use a Python list as a dense vector. +dv2 = [1.0, 0.0, 3.0] +# Create a SparseVector. +sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0]) +# Use a single-column SciPy csc_matrix as a sparse vector. +sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1)) +{% endhighlight %} + +</div> +</div> + +## Labeled point + +A labeled point is a local vector, either dense or sparse, associated with a label/response. +In MLlib, labeled points are used in supervised learning algorithms. +We use a double to store a label, so we can use labeled points in both regression and classification. +For binary classification, a label should be either `0` (negative) or `1` (positive). +For multiclass classification, labels should be class indices starting from zero: `0, 1, 2, ...`. + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> + +A labeled point is represented by the case class +[`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint). + +{% highlight scala %} +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.regression.LabeledPoint + +// Create a labeled point with a positive label and a dense feature vector. +val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) + +// Create a labeled point with a negative label and a sparse feature vector. +val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> + +A labeled point is represented by +[`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html). + +{% highlight java %} +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.mllib.regression.LabeledPoint; + +// Create a labeled point with a positive label and a dense feature vector. +LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)); + +// Create a labeled point with a negative label and a sparse feature vector. +LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0})); +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> + +A labeled point is represented by +[`LabeledPoint`](api/python/pyspark.mllib.regression.LabeledPoint-class.html). + +{% highlight python %} +from pyspark.mllib.linalg import SparseVector +from pyspark.mllib.regression import LabeledPoint + +# Create a labeled point with a positive label and a dense feature vector. +pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) + +# Create a labeled point with a negative label and a sparse feature vector. +neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0])) +{% endhighlight %} +</div> +</div> + +***Sparse data*** + +It is very common in practice to have sparse training data. MLlib supports reading training +examples stored in `LIBSVM` format, which is the default format used by +[`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and +[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/). It is a text format in which each line +represents a labeled sparse feature vector using the following format: + +~~~ +label index1:value1 index2:value2 ... +~~~ + +where the indices are one-based and in ascending order. +After loading, the feature indices are converted to zero-based. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training +examples stored in LIBSVM format. + +{% highlight scala %} +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +[`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) reads training +examples stored in LIBSVM format. + +{% highlight java %} +import org.apache.spark.mllib.regression.LabeledPoint; +import org.apache.spark.mllib.util.MLUtils; +import org.apache.spark.api.java.JavaRDD; + +JavaRDD<LabeledPoint> examples = + MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD(); +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> +[`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.util.MLUtils-class.html) reads training +examples stored in LIBSVM format. + +{% highlight python %} +from pyspark.mllib.util import MLUtils + +examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") +{% endhighlight %} +</div> +</div> + +## Local matrix + +A local matrix has integer-typed row and column indices and double-typed values, stored on a single +machine. MLlib supports dense matrices, whose entry values are stored in a single double array in +column major. For example, the following matrix `\[ \begin{pmatrix} +1.0 & 2.0 \\ +3.0 & 4.0 \\ +5.0 & 6.0 +\end{pmatrix} +\]` +is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the matrix size `(3, 2)`. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +The base class of local matrices is +[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one +implementation: [`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix). +We recommend using the factory methods implemented +in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) to create local +matrices. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.{Matrix, Matrices} + +// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) +val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)) +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> + +The base class of local matrices is +[`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide one +implementation: [`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html). +We recommend using the factory methods implemented +in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to create local +matrices. + +{% highlight java %} +import org.apache.spark.mllib.linalg.Matrix; +import org.apache.spark.mllib.linalg.Matrices; + +// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) +Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0}); +{% endhighlight %} +</div> + +</div> + +## Distributed matrix + +A distributed matrix has long-typed row and column indices and double-typed values, stored +distributively in one or more RDDs. It is very important to choose the right format to store large +and distributed matrices. Converting a distributed matrix to a different format may require a +global shuffle, which is quite expensive. Three types of distributed matrices have been implemented +so far. + +The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed +matrix without meaningful row indices, e.g., a collection of feature vectors. +It is backed by an RDD of its rows, where each row is a local vector. +We assume that the number of columns is not huge for a `RowMatrix` so that a single +local vector can be reasonably communicated to the driver and can also be stored / +operated on using a single node. +An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices, +which can be used for identifying rows and executing joins. +A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) format, +backed by an RDD of its entries. + +***Note*** + +The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size. +In general the use of non-deterministic RDDs can lead to errors. + +### RowMatrix + +A `RowMatrix` is a row-oriented distributed matrix without meaningful row indices, backed by an RDD +of its rows, where each row is a local vector. +Since each row is represented by a local vector, the number of columns is +limited by the integer range but it should be much smaller in practice. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +A [`RowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be +created from an `RDD[Vector]` instance. Then we can compute its column summary statistics. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.linalg.distributed.RowMatrix + +val rows: RDD[Vector] = ... // an RDD of local vectors +// Create a RowMatrix from an RDD[Vector]. +val mat: RowMatrix = new RowMatrix(rows) + +// Get its size. +val m = mat.numRows() +val n = mat.numCols() +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> + +A [`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be +created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics. + +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.distributed.RowMatrix; + +JavaRDD<Vector> rows = ... // a JavaRDD of local vectors +// Create a RowMatrix from an JavaRDD<Vector>. +RowMatrix mat = new RowMatrix(rows.rdd()); + +// Get its size. +long m = mat.numRows(); +long n = mat.numCols(); +{% endhighlight %} +</div> +</div> + +### IndexedRowMatrix + +An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row indices. It is backed by +an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +An +[`IndexedRowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix) +can be created from an `RDD[IndexedRow]` instance, where +[`IndexedRow`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a +wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping +its row indices. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix} + +val rows: RDD[IndexedRow] = ... // an RDD of indexed rows +// Create an IndexedRowMatrix from an RDD[IndexedRow]. +val mat: IndexedRowMatrix = new IndexedRowMatrix(rows) + +// Get its size. +val m = mat.numRows() +val n = mat.numCols() + +// Drop its row indices. +val rowMat: RowMatrix = mat.toRowMatrix() +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> + +An +[`IndexedRowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html) +can be created from an `JavaRDD<IndexedRow>` instance, where +[`IndexedRow`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a +wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping +its row indices. + +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.mllib.linalg.distributed.IndexedRow; +import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix; +import org.apache.spark.mllib.linalg.distributed.RowMatrix; + +JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows +// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>. +IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd()); + +// Get its size. +long m = mat.numRows(); +long n = mat.numCols(); + +// Drop its row indices. +RowMatrix rowMat = mat.toRowMatrix(); +{% endhighlight %} +</div></div> + +### CoordinateMatrix + +A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries. Each entry is a tuple +of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the column index, and +`value` is the entry value. A `CoordinateMatrix` should be used only when both +dimensions of the matrix are huge and the matrix is very sparse. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +A +[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix) +can be created from an `RDD[MatrixEntry]` instance, where +[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a +wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix` +with sparse rows by calling `toIndexedRowMatrix`. Other computations for +`CoordinateMatrix` are not currently supported. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry} + +val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries +// Create a CoordinateMatrix from an RDD[MatrixEntry]. +val mat: CoordinateMatrix = new CoordinateMatrix(entries) + +// Get its size. +val m = mat.numRows() +val n = mat.numCols() + +// Convert it to an IndexRowMatrix whose rows are sparse vectors. +val indexedRowMatrix = mat.toIndexedRowMatrix() +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> + +A +[`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html) +can be created from a `JavaRDD<MatrixEntry>` instance, where +[`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a +wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix` +with sparse rows by calling `toIndexedRowMatrix`. Other computations for +`CoordinateMatrix` are not currently supported. + +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix; +import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix; +import org.apache.spark.mllib.linalg.distributed.MatrixEntry; + +JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries +// Create a CoordinateMatrix from a JavaRDD<MatrixEntry>. +CoordinateMatrix mat = new CoordinateMatrix(entries.rdd()); + +// Get its size. +long m = mat.numRows(); +long n = mat.numCols(); + +// Convert it to an IndexRowMatrix whose rows are sparse vectors. +IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix(); +{% endhighlight %} +</div> +</div> http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-dimensionality-reduction.md ---------------------------------------------------------------------- diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md index 9f2cf6d..21cb35b 100644 --- a/docs/mllib-dimensionality-reduction.md +++ b/docs/mllib-dimensionality-reduction.md @@ -11,7 +11,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction of reducing the number of variables under consideration. It can be used to extract latent features from raw and noisy features or compress data while maintaining the structure. -MLlib provides support for dimensionality reduction on the <a href="mllib-basics.html#rowmatrix">RowMatrix</a> class. +MLlib provides support for dimensionality reduction on the <a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class. ## Singular value decomposition (SVD) @@ -58,7 +58,7 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver. ### SVD Example MLlib provides SVD functionality to row-oriented matrices, provided in the -<a href="mllib-basics.html#rowmatrix">RowMatrix</a> class. +<a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class. <div class="codetabs"> <div data-lang="scala" markdown="1"> http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-guide.md ---------------------------------------------------------------------- diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 4d4198b..d3a510b 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -7,12 +7,13 @@ MLlib is Spark's scalable machine learning library consisting of common learning including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below: -* [Data types](mllib-basics.html) -* [Basic statistics](mllib-stats.html) - * random data generation - * stratified sampling +* [Data types](mllib-data-types.html) +* [Basic statistics](mllib-statistics.html) * summary statistics + * correlations + * stratified sampling * hypothesis testing + * random data generation * [Classification and regression](mllib-classification-regression.html) * [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html) * [decision trees](mllib-decision-tree.html) http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-statistics.md ---------------------------------------------------------------------- diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md new file mode 100644 index 0000000..c463241 --- /dev/null +++ b/docs/mllib-statistics.md @@ -0,0 +1,457 @@ +--- +layout: global +title: Basic Statistics - MLlib +displayTitle: <a href="mllib-guide.html">MLlib</a> - Basic Statistics +--- + +* Table of contents +{:toc} + + +`\[ +\newcommand{\R}{\mathbb{R}} +\newcommand{\E}{\mathbb{E}} +\newcommand{\x}{\mathbf{x}} +\newcommand{\y}{\mathbf{y}} +\newcommand{\wv}{\mathbf{w}} +\newcommand{\av}{\mathbf{\alpha}} +\newcommand{\bv}{\mathbf{b}} +\newcommand{\N}{\mathbb{N}} +\newcommand{\id}{\mathbf{I}} +\newcommand{\ind}{\mathbf{1}} +\newcommand{\0}{\mathbf{0}} +\newcommand{\unit}{\mathbf{e}} +\newcommand{\one}{\mathbf{1}} +\newcommand{\zero}{\mathbf{0}} +\]` + +## Summary statistics + +We provide column summary statistics for `RDD[Vector]` through the function `colStats` +available in `Statistics`. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of +[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary), +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the +total count. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics} + +val observations: RDD[Vector] = ... // an RDD of Vectors + +// Compute column summary statistics. +val summary: MultivariateStatisticalSummary = Statistics.colStats(observations) +println(summary.mean) // a dense vector containing the mean value for each column +println(summary.variance) // column-wise variance +println(summary.numNonzeros) // number of nonzeros in each column + +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> + +[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of +[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html), +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the +total count. + +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.stat.MultivariateStatisticalSummary; +import org.apache.spark.mllib.stat.Statistics; + +JavaSparkContext jsc = ... + +JavaRDD<Vector> mat = ... // an RDD of Vectors + +// Compute column summary statistics. +MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd()); +System.out.println(summary.mean()); // a dense vector containing the mean value for each column +System.out.println(summary.variance()); // column-wise variance +System.out.println(summary.numNonzeros()); // number of nonzeros in each column + +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> +[`colStats()`](api/python/pyspark.mllib.stat.Statistics-class.html#colStats) returns an instance of +[`MultivariateStatisticalSummary`](api/python/pyspark.mllib.stat.MultivariateStatisticalSummary-class.html), +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the +total count. + +{% highlight python %} +from pyspark.mllib.stat import Statistics + +sc = ... # SparkContext + +mat = ... # an RDD of Vectors + +# Compute column summary statistics. +summary = Statistics.colStats(mat) +print summary.mean() +print summary.variance() +print summary.numNonzeros() + +{% endhighlight %} +</div> + +</div> + +## Correlations + +Calculating the correlation between two series of data is a common operation in Statistics. In MLlib +we provide the flexibility to calculate pairwise correlations among many series. The supported +correlation methods are currently Pearson's and Spearman's correlation. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively. + +{% highlight scala %} +import org.apache.spark.SparkContext +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.stat.Statistics + +val sc: SparkContext = ... + +val seriesX: RDD[Double] = ... // a series +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX + +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a +// method is not specified, Pearson's method will be used by default. +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson") + +val data: RDD[Vector] = ... // note that each Vector is a row and not a column + +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. +// If a method is not specified, Pearson's method will be used by default. +val correlMatrix: Matrix = Statistics.corr(data, "pearson") + +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to +calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or +a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively. + +{% highlight java %} +import org.apache.spark.api.java.JavaDoubleRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.*; +import org.apache.spark.mllib.stat.Statistics; + +JavaSparkContext jsc = ... + +JavaDoubleRDD seriesX = ... // a series +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX + +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a +// method is not specified, Pearson's method will be used by default. +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson"); + +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column + +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. +// If a method is not specified, Pearson's method will be used by default. +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson"); + +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> +[`Statistics`](api/python/pyspark.mllib.stat.Statistics-class.html) provides methods to +calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or +an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively. + +{% highlight python %} +from pyspark.mllib.stat import Statistics + +sc = ... # SparkContext + +seriesX = ... # a series +seriesY = ... # must have the same number of partitions and cardinality as seriesX + +# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a +# method is not specified, Pearson's method will be used by default. +print Statistics.corr(seriesX, seriesY, method="pearson") + +data = ... # an RDD of Vectors +# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. +# If a method is not specified, Pearson's method will be used by default. +print Statistics.corr(data, method="pearson") + +{% endhighlight %} +</div> + +</div> + +## Stratified sampling + +Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, +`sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of key-value pairs. For stratified +sampling, the keys can be thought of as a label and the value as a specific attribute. For example +the key can be man or woman, or document ids, and the respective values can be the list of ages +of the people in the population or the list of words in the documents. The `sampleByKey` method +will flip a coin to decide whether an observation will be sampled or not, therefore requires one +pass over the data, and provides an *expected* sample size. `sampleByKeyExact` requires significant +more resources than the per-stratum simple random sampling used in `sampleByKey`, but will provide +the exact sampling size with 99.99% confidence. `sampleByKeyExact` is currently not supported in +python. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired +fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of +keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample +size, whereas sampling with replacement requires two additional passes. + +{% highlight scala %} +import org.apache.spark.SparkContext +import org.apache.spark.SparkContext._ +import org.apache.spark.rdd.PairRDDFunctions + +val sc: SparkContext = ... + +val data = ... // an RDD[(K, V)] of any key value pairs +val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key + +// Get an exact sample from each stratum +val approxSample = data.sampleByKey(withReplacement = false, fractions) +val exactSample = data.sampleByKeyExact(withReplacement = false, fractions) + +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to +sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired +fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of +keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample +size, whereas sampling with replacement requires two additional passes. + +{% highlight java %} +import java.util.Map; + +import org.apache.spark.api.java.JavaPairRDD; +import org.apache.spark.api.java.JavaSparkContext; + +JavaSparkContext jsc = ... + +JavaPairRDD<K, V> data = ... // an RDD of any key value pairs +Map<K, Object> fractions = ... // specify the exact fraction desired from each key + +// Get an exact sample from each stratum +JavaPairRDD<K, V> approxSample = data.sampleByKey(false, fractions); +JavaPairRDD<K, V> exactSample = data.sampleByKeyExact(false, fractions); + +{% endhighlight %} +</div> +<div data-lang="python" markdown="1"> +[`sampleByKey()`](api/python/pyspark.rdd.RDD-class.html#sampleByKey) allows users to +sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the +desired fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the +set of keys. + +*Note:* `sampleByKeyExact()` is currently not supported in Python. + +{% highlight python %} + +sc = ... # SparkContext + +data = ... # an RDD of any key value pairs +fractions = ... # specify the exact fraction desired from each key as a dictionary + +approxSample = data.sampleByKey(False, fractions); + +{% endhighlight %} +</div> + +</div> + +## Hypothesis testing + +Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically +significant, whether this result occurred by chance or not. MLlib currently supports Pearson's +chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine +whether the goodness of fit or the independence test is conducted. The goodness of fit test requires +an input type of `Vector`, whereas the independence test requires a `Matrix` as input. + +MLlib also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared +independence tests. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret +hypothesis tests. + +{% highlight scala %} +import org.apache.spark.SparkContext +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.stat.Statistics._ + +val sc: SparkContext = ... + +val vec: Vector = ... // a vector composed of the frequencies of events + +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, +// the test runs against a uniform distribution. +val goodnessOfFitTestResult = Statistics.chiSqTest(vec) +println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, + // test statistic, the method used, and the null hypothesis. + +val mat: Matrix = ... // a contingency matrix + +// conduct Pearson's independence test on the input contingency matrix +val independenceTestResult = Statistics.chiSqTest(mat) +println(independenceTestResult) // summary of the test including the p-value, degrees of freedom... + +val obs: RDD[LabeledPoint] = ... // (feature, label) pairs. + +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature +// against the label. +val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs) +var i = 1 +featureTestResults.foreach { result => + println(s"Column $i:\n$result") + i += 1 +} // summary of the test + +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to +run Pearson's chi-squared tests. The following example demonstrates how to run and interpret +hypothesis tests. + +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.*; +import org.apache.spark.mllib.regression.LabeledPoint; +import org.apache.spark.mllib.stat.Statistics; +import org.apache.spark.mllib.stat.test.ChiSqTestResult; + +JavaSparkContext jsc = ... + +Vector vec = ... // a vector composed of the frequencies of events + +// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, +// the test runs against a uniform distribution. +ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec); +// summary of the test including the p-value, degrees of freedom, test statistic, the method used, +// and the null hypothesis. +System.out.println(goodnessOfFitTestResult); + +Matrix mat = ... // a contingency matrix + +// conduct Pearson's independence test on the input contingency matrix +ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat); +// summary of the test including the p-value, degrees of freedom... +System.out.println(independenceTestResult); + +JavaRDD<LabeledPoint> obs = ... // an RDD of labeled points + +// The contingency table is constructed from the raw (feature, label) pairs and used to conduct +// the independence test. Returns an array containing the ChiSquaredTestResult for every feature +// against the label. +ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd()); +int i = 1; +for (ChiSqTestResult result : featureTestResults) { + System.out.println("Column " + i + ":"); + System.out.println(result); // summary of the test + i++; +} + +{% endhighlight %} +</div> + +</div> + +## Random data generation + +Random data generation is useful for randomized algorithms, prototyping, and performance testing. +MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution: +uniform, standard normal, or Poisson. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory +methods to generate random double RDDs or vector RDDs. +The following example generates a random double RDD, whose values follows the standard normal +distribution `N(0, 1)`, and then map it to `N(1, 4)`. + +{% highlight scala %} +import org.apache.spark.SparkContext +import org.apache.spark.mllib.random.RandomRDDs._ + +val sc: SparkContext = ... + +// Generate a random double RDD that contains 1 million i.i.d. values drawn from the +// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions. +val u = normalRDD(sc, 1000000L, 10) +// Apply a transform to get a random double RDD following `N(1, 4)`. +val v = u.map(x => 1.0 + 2.0 * x) +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory +methods to generate random double RDDs or vector RDDs. +The following example generates a random double RDD, whose values follows the standard normal +distribution `N(0, 1)`, and then map it to `N(1, 4)`. + +{% highlight java %} +import org.apache.spark.SparkContext; +import org.apache.spark.api.JavaDoubleRDD; +import static org.apache.spark.mllib.random.RandomRDDs.*; + +JavaSparkContext jsc = ... + +// Generate a random double RDD that contains 1 million i.i.d. values drawn from the +// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions. +JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10); +// Apply a transform to get a random double RDD following `N(1, 4)`. +JavaDoubleRDD v = u.map( + new Function<Double, Double>() { + public Double call(Double x) { + return 1.0 + 2.0 * x; + } + }); +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> +[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides factory +methods to generate random double RDDs or vector RDDs. +The following example generates a random double RDD, whose values follows the standard normal +distribution `N(0, 1)`, and then map it to `N(1, 4)`. + +{% highlight python %} +from pyspark.mllib.random import RandomRDDs + +sc = ... # SparkContext + +# Generate a random double RDD that contains 1 million i.i.d. values drawn from the +# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions. +u = RandomRDDs.uniformRDD(sc, 1000000L, 10) +# Apply a transform to get a random double RDD following `N(1, 4)`. +v = u.map(lambda x: 1.0 + 2.0 * x) +{% endhighlight %} +</div> + +</div> http://git-wip-us.apache.org/repos/asf/spark/blob/43dfc84f/docs/mllib-stats.md ---------------------------------------------------------------------- diff --git a/docs/mllib-stats.md b/docs/mllib-stats.md deleted file mode 100644 index 511a9fb..0000000 --- a/docs/mllib-stats.md +++ /dev/null @@ -1,457 +0,0 @@ ---- -layout: global -title: Statistics Functionality - MLlib -displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality ---- - -* Table of contents -{:toc} - - -`\[ -\newcommand{\R}{\mathbb{R}} -\newcommand{\E}{\mathbb{E}} -\newcommand{\x}{\mathbf{x}} -\newcommand{\y}{\mathbf{y}} -\newcommand{\wv}{\mathbf{w}} -\newcommand{\av}{\mathbf{\alpha}} -\newcommand{\bv}{\mathbf{b}} -\newcommand{\N}{\mathbb{N}} -\newcommand{\id}{\mathbf{I}} -\newcommand{\ind}{\mathbf{1}} -\newcommand{\0}{\mathbf{0}} -\newcommand{\unit}{\mathbf{e}} -\newcommand{\one}{\mathbf{1}} -\newcommand{\zero}{\mathbf{0}} -\]` - -## Summary Statistics - -We provide column summary statistics for `RDD[Vector]` through the function `colStats` -available in `Statistics`. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> - -[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of -[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary), -which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the -total count. - -{% highlight scala %} -import org.apache.spark.mllib.linalg.Vector -import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics} - -val observations: RDD[Vector] = ... // an RDD of Vectors - -// Compute column summary statistics. -val summary: MultivariateStatisticalSummary = Statistics.colStats(observations) -println(summary.mean) // a dense vector containing the mean value for each column -println(summary.variance) // column-wise variance -println(summary.numNonzeros) // number of nonzeros in each column - -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> - -[`colStats()`](api/java/org/apache/spark/mllib/stat/Statistics.html) returns an instance of -[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html), -which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the -total count. - -{% highlight java %} -import org.apache.spark.api.java.JavaRDD; -import org.apache.spark.api.java.JavaSparkContext; -import org.apache.spark.mllib.linalg.Vector; -import org.apache.spark.mllib.stat.MultivariateStatisticalSummary; -import org.apache.spark.mllib.stat.Statistics; - -JavaSparkContext jsc = ... - -JavaRDD<Vector> mat = ... // an RDD of Vectors - -// Compute column summary statistics. -MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd()); -System.out.println(summary.mean()); // a dense vector containing the mean value for each column -System.out.println(summary.variance()); // column-wise variance -System.out.println(summary.numNonzeros()); // number of nonzeros in each column - -{% endhighlight %} -</div> - -<div data-lang="python" markdown="1"> -[`colStats()`](api/python/pyspark.mllib.stat.Statistics-class.html#colStats) returns an instance of -[`MultivariateStatisticalSummary`](api/python/pyspark.mllib.stat.MultivariateStatisticalSummary-class.html), -which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the -total count. - -{% highlight python %} -from pyspark.mllib.stat import Statistics - -sc = ... # SparkContext - -mat = ... # an RDD of Vectors - -# Compute column summary statistics. -summary = Statistics.colStats(mat) -print summary.mean() -print summary.variance() -print summary.numNonzeros() - -{% endhighlight %} -</div> - -</div> - -## Random data generation - -Random data generation is useful for randomized algorithms, prototyping, and performance testing. -MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution: -uniform, standard normal, or Poisson. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> -[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory -methods to generate random double RDDs or vector RDDs. -The following example generates a random double RDD, whose values follows the standard normal -distribution `N(0, 1)`, and then map it to `N(1, 4)`. - -{% highlight scala %} -import org.apache.spark.SparkContext -import org.apache.spark.mllib.random.RandomRDDs._ - -val sc: SparkContext = ... - -// Generate a random double RDD that contains 1 million i.i.d. values drawn from the -// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions. -val u = normalRDD(sc, 1000000L, 10) -// Apply a transform to get a random double RDD following `N(1, 4)`. -val v = u.map(x => 1.0 + 2.0 * x) -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> -[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory -methods to generate random double RDDs or vector RDDs. -The following example generates a random double RDD, whose values follows the standard normal -distribution `N(0, 1)`, and then map it to `N(1, 4)`. - -{% highlight java %} -import org.apache.spark.SparkContext; -import org.apache.spark.api.JavaDoubleRDD; -import static org.apache.spark.mllib.random.RandomRDDs.*; - -JavaSparkContext jsc = ... - -// Generate a random double RDD that contains 1 million i.i.d. values drawn from the -// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions. -JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10); -// Apply a transform to get a random double RDD following `N(1, 4)`. -JavaDoubleRDD v = u.map( - new Function<Double, Double>() { - public Double call(Double x) { - return 1.0 + 2.0 * x; - } - }); -{% endhighlight %} -</div> - -<div data-lang="python" markdown="1"> -[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides factory -methods to generate random double RDDs or vector RDDs. -The following example generates a random double RDD, whose values follows the standard normal -distribution `N(0, 1)`, and then map it to `N(1, 4)`. - -{% highlight python %} -from pyspark.mllib.random import RandomRDDs - -sc = ... # SparkContext - -# Generate a random double RDD that contains 1 million i.i.d. values drawn from the -# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions. -u = RandomRDDs.uniformRDD(sc, 1000000L, 10) -# Apply a transform to get a random double RDD following `N(1, 4)`. -v = u.map(lambda x: 1.0 + 2.0 * x) -{% endhighlight %} -</div> - -</div> - -## Correlations calculation - -Calculating the correlation between two series of data is a common operation in Statistics. In MLlib -we provide the flexibility to calculate pairwise correlations among many series. The supported -correlation methods are currently Pearson's and Spearman's correlation. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> -[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to -calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or -an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively. - -{% highlight scala %} -import org.apache.spark.SparkContext -import org.apache.spark.mllib.linalg._ -import org.apache.spark.mllib.stat.Statistics - -val sc: SparkContext = ... - -val seriesX: RDD[Double] = ... // a series -val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX - -// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a -// method is not specified, Pearson's method will be used by default. -val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson") - -val data: RDD[Vector] = ... // note that each Vector is a row and not a column - -// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. -// If a method is not specified, Pearson's method will be used by default. -val correlMatrix: Matrix = Statistics.corr(data, "pearson") - -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> -[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to -calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or -a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively. - -{% highlight java %} -import org.apache.spark.api.java.JavaDoubleRDD; -import org.apache.spark.api.java.JavaSparkContext; -import org.apache.spark.mllib.linalg.*; -import org.apache.spark.mllib.stat.Statistics; - -JavaSparkContext jsc = ... - -JavaDoubleRDD seriesX = ... // a series -JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX - -// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a -// method is not specified, Pearson's method will be used by default. -Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson"); - -JavaRDD<Vector> data = ... // note that each Vector is a row and not a column - -// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. -// If a method is not specified, Pearson's method will be used by default. -Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson"); - -{% endhighlight %} -</div> - -<div data-lang="python" markdown="1"> -[`Statistics`](api/python/pyspark.mllib.stat.Statistics-class.html) provides methods to -calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or -an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively. - -{% highlight python %} -from pyspark.mllib.stat import Statistics - -sc = ... # SparkContext - -seriesX = ... # a series -seriesY = ... # must have the same number of partitions and cardinality as seriesX - -# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a -# method is not specified, Pearson's method will be used by default. -print Statistics.corr(seriesX, seriesY, method="pearson") - -data = ... # an RDD of Vectors -# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. -# If a method is not specified, Pearson's method will be used by default. -print Statistics.corr(data, method="pearson") - -{% endhighlight %} -</div> - -</div> - -## Stratified sampling - -Unlike the other statistics functions, which reside in MLLib, stratified sampling methods, -`sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of key-value pairs. For stratified -sampling, the keys can be thought of as a label and the value as a specific attribute. For example -the key can be man or woman, or document ids, and the respective values can be the list of ages -of the people in the population or the list of words in the documents. The `sampleByKey` method -will flip a coin to decide whether an observation will be sampled or not, therefore requires one -pass over the data, and provides an *expected* sample size. `sampleByKeyExact` requires significant -more resources than the per-stratum simple random sampling used in `sampleByKey`, but will provide -the exact sampling size with 99.99% confidence. `sampleByKeyExact` is currently not supported in -python. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> -[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to -sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired -fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of -keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample -size, whereas sampling with replacement requires two additional passes. - -{% highlight scala %} -import org.apache.spark.SparkContext -import org.apache.spark.SparkContext._ -import org.apache.spark.rdd.PairRDDFunctions - -val sc: SparkContext = ... - -val data = ... // an RDD[(K, V)] of any key value pairs -val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key - -// Get an exact sample from each stratum -val approxSample = data.sampleByKey(withReplacement = false, fractions) -val exactSample = data.sampleByKeyExact(withReplacement = false, fractions) - -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> -[`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to -sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired -fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of -keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample -size, whereas sampling with replacement requires two additional passes. - -{% highlight java %} -import java.util.Map; - -import org.apache.spark.api.java.JavaPairRDD; -import org.apache.spark.api.java.JavaSparkContext; - -JavaSparkContext jsc = ... - -JavaPairRDD<K, V> data = ... // an RDD of any key value pairs -Map<K, Object> fractions = ... // specify the exact fraction desired from each key - -// Get an exact sample from each stratum -JavaPairRDD<K, V> approxSample = data.sampleByKey(false, fractions); -JavaPairRDD<K, V> exactSample = data.sampleByKeyExact(false, fractions); - -{% endhighlight %} -</div> -<div data-lang="python" markdown="1"> -[`sampleByKey()`](api/python/pyspark.rdd.RDD-class.html#sampleByKey) allows users to -sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the -desired fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the -set of keys. - -*Note:* `sampleByKeyExact()` is currently not supported in Python. - -{% highlight python %} - -sc = ... # SparkContext - -data = ... # an RDD of any key value pairs -fractions = ... # specify the exact fraction desired from each key as a dictionary - -approxSample = data.sampleByKey(False, fractions); - -{% endhighlight %} -</div> - -</div> - -## Hypothesis testing - -Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically -significant, whether this result occurred by chance or not. MLlib currently supports Pearson's -chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine -whether the goodness of fit or the independence test is conducted. The goodness of fit test requires -an input type of `Vector`, whereas the independence test requires a `Matrix` as input. - -MLlib also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared -independence tests. - -<div class="codetabs"> -<div data-lang="scala" markdown="1"> -[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to -run Pearson's chi-squared tests. The following example demonstrates how to run and interpret -hypothesis tests. - -{% highlight scala %} -import org.apache.spark.SparkContext -import org.apache.spark.mllib.linalg._ -import org.apache.spark.mllib.regression.LabeledPoint -import org.apache.spark.mllib.stat.Statistics._ - -val sc: SparkContext = ... - -val vec: Vector = ... // a vector composed of the frequencies of events - -// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, -// the test runs against a uniform distribution. -val goodnessOfFitTestResult = Statistics.chiSqTest(vec) -println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, - // test statistic, the method used, and the null hypothesis. - -val mat: Matrix = ... // a contingency matrix - -// conduct Pearson's independence test on the input contingency matrix -val independenceTestResult = Statistics.chiSqTest(mat) -println(independenceTestResult) // summary of the test including the p-value, degrees of freedom... - -val obs: RDD[LabeledPoint] = ... // (feature, label) pairs. - -// The contingency table is constructed from the raw (feature, label) pairs and used to conduct -// the independence test. Returns an array containing the ChiSquaredTestResult for every feature -// against the label. -val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs) -var i = 1 -featureTestResults.foreach { result => - println(s"Column $i:\n$result") - i += 1 -} // summary of the test - -{% endhighlight %} -</div> - -<div data-lang="java" markdown="1"> -[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to -run Pearson's chi-squared tests. The following example demonstrates how to run and interpret -hypothesis tests. - -{% highlight java %} -import org.apache.spark.api.java.JavaRDD; -import org.apache.spark.api.java.JavaSparkContext; -import org.apache.spark.mllib.linalg.*; -import org.apache.spark.mllib.regression.LabeledPoint; -import org.apache.spark.mllib.stat.Statistics; -import org.apache.spark.mllib.stat.test.ChiSqTestResult; - -JavaSparkContext jsc = ... - -Vector vec = ... // a vector composed of the frequencies of events - -// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, -// the test runs against a uniform distribution. -ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec); -// summary of the test including the p-value, degrees of freedom, test statistic, the method used, -// and the null hypothesis. -System.out.println(goodnessOfFitTestResult); - -Matrix mat = ... // a contingency matrix - -// conduct Pearson's independence test on the input contingency matrix -ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat); -// summary of the test including the p-value, degrees of freedom... -System.out.println(independenceTestResult); - -JavaRDD<LabeledPoint> obs = ... // an RDD of labeled points - -// The contingency table is constructed from the raw (feature, label) pairs and used to conduct -// the independence test. Returns an array containing the ChiSquaredTestResult for every feature -// against the label. -ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd()); -int i = 1; -for (ChiSqTestResult result : featureTestResults) { - System.out.println("Column " + i + ":"); - System.out.println(result); // summary of the test - i++; -} - -{% endhighlight %} -</div> - -</div> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org