[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-48561923
  
@dbtsai About the package name, `stat` is the standard acronym for 
statistics instead of `stats`. Checkout the urls returned by Google:

https://www.google.com/#q=statistics+department


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749219
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -370,10 +239,9 @@ class RowMatrix(
* Computes column-wise summary statistics.
*/
   def computeColumnSummaryStatistics(): MultivariateStatisticalSummary = {
-val zeroValue = new ColumnStatisticsAggregator(numCols().toInt)
-val summary = 
rows.map(_.toBreeze).aggregate[ColumnStatisticsAggregator](zeroValue)(
+val summary = rows.aggregate[OnlineSummarizer](new OnlineSummarizer)(
   (aggregator, data) => aggregator.add(data),
-  (aggregator1, aggregator2) => aggregator1.merge(aggregator2)
+  (aggregator1, aggregator2) => aggregator1.add(aggregator2)
--- End diff --

`merge` or `aggregate` may be better than overloading `add` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749221
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/OnlineSummarizer.scala ---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+import org.apache.spark.annotation.DeveloperApi
--- End diff --

sort imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749223
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/OnlineSummarizer.scala ---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: DeveloperApi ::
+ * OnlineSummarizer implements [[MultivariateStatisticalSummary]] to 
compute the mean, variance,
+ * minimum, maximum, counts, and non-zero counts for samples in sparse or 
dense vector format in
+ * a streaming fashion.
--- End diff --

`streaming` has special meaning in spark. Change it to `online`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749222
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/OnlineSummarizer.scala ---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: DeveloperApi ::
+ * OnlineSummarizer implements [[MultivariateStatisticalSummary]] to 
compute the mean, variance,
+ * minimum, maximum, counts, and non-zero counts for samples in sparse or 
dense vector format in
--- End diff --

`non-zero` -> `nonzero`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749225
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/OnlineSummarizer.scala ---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: DeveloperApi ::
+ * OnlineSummarizer implements [[MultivariateStatisticalSummary]] to 
compute the mean, variance,
+ * minimum, maximum, counts, and non-zero counts for samples in sparse or 
dense vector format in
+ * a streaming fashion.
+ *
+ * Two OnlineSummarizers can be merged together to have a statistical 
summary of a jointed dataset.
--- End diff --

`a jointed dataset` -> `the corresponding joint dataset`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749226
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/OnlineSummarizer.scala ---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: DeveloperApi ::
+ * OnlineSummarizer implements [[MultivariateStatisticalSummary]] to 
compute the mean, variance,
+ * minimum, maximum, counts, and non-zero counts for samples in sparse or 
dense vector format in
+ * a streaming fashion.
+ *
+ * Two OnlineSummarizers can be merged together to have a statistical 
summary of a jointed dataset.
+ *
+ * A numerically stable algorithm is implemented to compute sample mean 
and variance:
+ * Reference: 
[[http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance 
variance-wiki]]
+ * Zero elements (including explicit zero values) are skipped when calling 
add(),
+ * to have time complexity O(nnz) instead of O(n) for each column.
+ */
+@DeveloperApi
+class OnlineSummarizer extends MultivariateStatisticalSummary with 
Serializable {
--- End diff --

Shall we call it `MultivariateOnlineSummarizer`? The name is long but more 
accurate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749235
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/OnlineSummarizerSuite.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.Vectors
+
+class OnlineSummarizerSuite extends FunSuite {
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("basic error handing") {
+val summarizer = new OnlineSummarizer
+
+assert(summarizer.count === 0, "should be zero since nothing is 
added.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.numNonzeros
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting numNonzeros from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.variance
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting variance from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.mean
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting mean from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.max
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting max from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.min
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting min from empty summarizer should throw exception.")
+
+summarizer.add(Vectors.dense(-1.0, 2.0, 6.0))
+summarizer.add(Vectors.sparse(3, Seq((0, -2.0), (1, 6.0
--- End diff --

remove `summarizer.` to use builder pattern


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749229
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/OnlineSummarizer.scala ---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: DeveloperApi ::
+ * OnlineSummarizer implements [[MultivariateStatisticalSummary]] to 
compute the mean, variance,
+ * minimum, maximum, counts, and non-zero counts for samples in sparse or 
dense vector format in
+ * a streaming fashion.
+ *
+ * Two OnlineSummarizers can be merged together to have a statistical 
summary of a jointed dataset.
+ *
+ * A numerically stable algorithm is implemented to compute sample mean 
and variance:
+ * Reference: 
[[http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance 
variance-wiki]]
+ * Zero elements (including explicit zero values) are skipped when calling 
add(),
+ * to have time complexity O(nnz) instead of O(n) for each column.
+ */
+@DeveloperApi
+class OnlineSummarizer extends MultivariateStatisticalSummary with 
Serializable {
+
+  private var n = 0
+  private var currMean: BDV[Double] = _
+  private var currM2n: BDV[Double] = _
+  private var totalCnt: Long = 0
+  private var nnz: BDV[Double] = _
+  private var currMax: BDV[Double] = _
+  private var currMin: BDV[Double] = _
+
+  /**
+   * Add a new sample to this summarizer, and update the statistical 
summary.
+   *
+   * @param sample The sample in dense/sparse vector format to be added 
into this summarizer.
+   * @return This OnlineSummarizer object.
+   */
+  def add(sample: Vector): OnlineSummarizer = {
+if (n == 0) {
+  require(sample.toBreeze.length > 0, s"Vector should have dimension 
larger than zero.")
+  n = sample.toBreeze.length
+
+  currMean = BDV.zeros[Double](n)
+  currM2n = BDV.zeros[Double](n)
+  nnz = BDV.zeros[Double](n)
+  currMax = BDV.fill(n)(Double.MinValue)
+  currMin = BDV.fill(n)(Double.MaxValue)
+}
+
+require(n == sample.toBreeze.length, s"Dimensions mismatch when adding 
new sample." +
+  s" Expecting $n but got ${sample.toBreeze.length}.")
+
+sample.toBreeze.activeIterator.foreach {
+  case (_, 0.0) => // Skip explicit zero elements.
+  case (i, value) =>
+if (currMax(i) < value) {
+  currMax(i) = value
+}
+if (currMin(i) > value) {
+  currMin(i) = value
+}
+
+val tmpPrevMean = currMean(i)
+currMean(i) = (currMean(i) * nnz(i) + value) / (nnz(i) + 1.0)
+currM2n(i) += (value - currMean(i)) * (value - tmpPrevMean)
+
+nnz(i) += 1.0
+}
+
+totalCnt += 1
+this
+  }
+
+  /**
+   * Merge another OnlineSummarizer, and update the statistical summary. 
(Note that it's
+   * in place merging; as a result, this OnlineSummarizer object will be 
modified.)
+   *
+   * @param other The other OnlineSummarizer to be merged.
+   * @return This OnlineSummarizer object.
+   */
+  def add(other: OnlineSummarizer): OnlineSummarizer = {
+if (totalCnt == 0) {
+  other
+} else if (other.totalCnt == 0) {
+  this
+} else {
+  require(n == other.n, s"Dimensions mismatch when merging with 
another summarizer. " +
+s"Expecting $n but got ${other.n}.")
+
+  totalCnt += other.totalCnt
+  val deltaMean: BDV[Double] = currMean - other.currMean
+
+  var i = 0
+  while (i < n) {
+// merge mean together
+if (other.currMean(i) != 0.0) {
+  currMean(i) = (currMean(i) * nnz(i) + other.currMean(i) * 
other.n

[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749232
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/OnlineSummarizerSuite.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.Vectors
+
+class OnlineSummarizerSuite extends FunSuite {
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
--- End diff --

decrease the default value of tol, e.g., to `1e-10`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749234
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/OnlineSummarizerSuite.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.Vectors
+
+class OnlineSummarizerSuite extends FunSuite {
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("basic error handing") {
+val summarizer = new OnlineSummarizer
+
+assert(summarizer.count === 0, "should be zero since nothing is 
added.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.numNonzeros
--- End diff --

better add two spaces to this line and next line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749237
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/OnlineSummarizerSuite.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.Vectors
+
+class OnlineSummarizerSuite extends FunSuite {
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("basic error handing") {
+val summarizer = new OnlineSummarizer
+
+assert(summarizer.count === 0, "should be zero since nothing is 
added.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.numNonzeros
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting numNonzeros from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.variance
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting variance from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.mean
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting mean from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.max
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting max from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.min
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting min from empty summarizer should throw exception.")
+
+summarizer.add(Vectors.dense(-1.0, 2.0, 6.0))
+summarizer.add(Vectors.sparse(3, Seq((0, -2.0), (1, 6.0
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.add(Vectors.dense(3.0, 1.0))
+}.getMessage.contains(s"Dimensions mismatch"),
+  "Adding a new dense sample with different array size should throw 
exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.add(Vectors.sparse(5, Seq((0, -2.0), (1, 6.0
+}.getMessage.contains(s"Dimensions mismatch"),
+  "Adding a new sparse sample with different array size should throw 
exception.")
+
+val summarizer2 = new OnlineSummarizer
+summarizer2.add(Vectors.dense(1.0, -2.0, 0.0, 4.0))
+assert(intercept[IllegalArgumentException] {
+  summarizer.add(summarizer2)
+}.getMessage.contains(s"Dimensions mismatch"),
+  "Merging a new summarizer with different dimensions should throw 
exception.")
+  }
+
+  test("dense vector input") {
+val summarizer = new OnlineSummarizer
+
+// For column 2, the maximum will be 0.0, and it's not explicitly 
added since we ignore all
+// the zeros; it's a case we need to test. For column 3, the minimum 
will be 0.0 which we
+// need to test as well.
+summarizer.add(Vectors.dense(-1.0, 0.0, 6.0))
+summarizer.add(Vectors.dense(3.0, -3.0, 0.0))
+
+assert(summarizer.mean.toArray.corresponds(Vectors.dense(1.0, -1.5, 
3.0).toArray) {
--- End diff --

Could you add a method called `vectorEqual` and use 
`assert(vectorEqual(summarizer.mean, Vectors.dense(...)), "...")` in the code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749243
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/OnlineSummarizerSuite.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.Vectors
+
+class OnlineSummarizerSuite extends FunSuite {
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("basic error handing") {
+val summarizer = new OnlineSummarizer
+
+assert(summarizer.count === 0, "should be zero since nothing is 
added.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.numNonzeros
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting numNonzeros from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.variance
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting variance from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.mean
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting mean from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.max
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting max from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.min
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting min from empty summarizer should throw exception.")
+
+summarizer.add(Vectors.dense(-1.0, 2.0, 6.0))
+summarizer.add(Vectors.sparse(3, Seq((0, -2.0), (1, 6.0
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.add(Vectors.dense(3.0, 1.0))
+}.getMessage.contains(s"Dimensions mismatch"),
+  "Adding a new dense sample with different array size should throw 
exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.add(Vectors.sparse(5, Seq((0, -2.0), (1, 6.0
+}.getMessage.contains(s"Dimensions mismatch"),
+  "Adding a new sparse sample with different array size should throw 
exception.")
+
+val summarizer2 = new OnlineSummarizer
+summarizer2.add(Vectors.dense(1.0, -2.0, 0.0, 4.0))
+assert(intercept[IllegalArgumentException] {
+  summarizer.add(summarizer2)
+}.getMessage.contains(s"Dimensions mismatch"),
+  "Merging a new summarizer with different dimensions should throw 
exception.")
+  }
+
+  test("dense vector input") {
+val summarizer = new OnlineSummarizer
+
+// For column 2, the maximum will be 0.0, and it's not explicitly 
added since we ignore all
+// the zeros; it's a case we need to test. For column 3, the minimum 
will be 0.0 which we
+// need to test as well.
+summarizer.add(Vectors.dense(-1.0, 0.0, 6.0))
+summarizer.add(Vectors.dense(3.0, -3.0, 0.0))
+
+assert(summarizer.mean.toArray.corresponds(Vectors.dense(1.0, -1.5, 
3.0).toArray) {
+  compareDouble(_, _)
+}, "mean mismatch")
+
+assert(summarizer.min.toArray.corresponds(Vectors.dense(-1.0, -3, 
0.0).toArray) {
+  compareDouble(_, _)
+}, "min mismatch")
+
+assert(summarizer.max.toArray.corresponds(Vectors.dense(3.0, 0.0, 
6.0).toArray) {
+  compareDouble(_, _)
+}, "max mismatch")
+
+assert(summarizer.numNonzeros.toArray.corresponds(Vectors.dense(2, 1, 
1).toArray) {
+  _.toLong == _.toLong
+}, "numNonzeros mismatch")
+
+  

[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749276
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/OnlineSummarizerSuite.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.Vectors
+
+class OnlineSummarizerSuite extends FunSuite {
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("basic error handing") {
+val summarizer = new OnlineSummarizer
+
+assert(summarizer.count === 0, "should be zero since nothing is 
added.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.numNonzeros
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting numNonzeros from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.variance
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting variance from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.mean
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting mean from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.max
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting max from empty summarizer should throw exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.min
+}.getMessage.contains(s"Nothing has been added to this summarizer."),
+  "Getting min from empty summarizer should throw exception.")
+
+summarizer.add(Vectors.dense(-1.0, 2.0, 6.0))
+summarizer.add(Vectors.sparse(3, Seq((0, -2.0), (1, 6.0
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.add(Vectors.dense(3.0, 1.0))
+}.getMessage.contains(s"Dimensions mismatch"),
+  "Adding a new dense sample with different array size should throw 
exception.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.add(Vectors.sparse(5, Seq((0, -2.0), (1, 6.0
+}.getMessage.contains(s"Dimensions mismatch"),
+  "Adding a new sparse sample with different array size should throw 
exception.")
+
+val summarizer2 = new OnlineSummarizer
+summarizer2.add(Vectors.dense(1.0, -2.0, 0.0, 4.0))
+assert(intercept[IllegalArgumentException] {
+  summarizer.add(summarizer2)
+}.getMessage.contains(s"Dimensions mismatch"),
+  "Merging a new summarizer with different dimensions should throw 
exception.")
+  }
+
+  test("dense vector input") {
+val summarizer = new OnlineSummarizer
+
+// For column 2, the maximum will be 0.0, and it's not explicitly 
added since we ignore all
+// the zeros; it's a case we need to test. For column 3, the minimum 
will be 0.0 which we
+// need to test as well.
+summarizer.add(Vectors.dense(-1.0, 0.0, 6.0))
+summarizer.add(Vectors.dense(3.0, -3.0, 0.0))
+
+assert(summarizer.mean.toArray.corresponds(Vectors.dense(1.0, -1.5, 
3.0).toArray) {
+  compareDouble(_, _)
+}, "mean mismatch")
+
+assert(summarizer.min.toArray.corresponds(Vectors.dense(-1.0, -3, 
0.0).toArray) {
+  compareDouble(_, _)
+}, "min mismatch")
+
+assert(summarizer.max.toArray.corresponds(Vectors.dense(3.0, 0.0, 
6.0).toArray) {
+  compareDouble(_, _)
+}, "max mismatch")
+
+assert(summarizer.numNonzeros.toArray.corresponds(Vectors.dense(2, 1, 
1).toArray) {
+  _.toLong == _.toLong
+}, "numNonzeros mismatch")
+
+  

[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14749377
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/stat/OnlineSummarizerSuite.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.linalg.Vectors
+
+class OnlineSummarizerSuite extends FunSuite {
+
+  def compareDouble(x: Double, y: Double, tol: Double = 1E-3): Boolean = {
+math.abs(x - y) / (math.abs(y) + 1e-15) < tol
+  }
+
+  test("basic error handing") {
+val summarizer = new OnlineSummarizer
+
+assert(summarizer.count === 0, "should be zero since nothing is 
added.")
+
+assert(intercept[IllegalArgumentException] {
+  summarizer.numNonzeros
--- End diff --

btw, it should be sufficient to only test against IllegalArgumentException. 
Asserting on the error message may be error-prone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/955#discussion_r14796461
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/OnlineSummarizer.scala ---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.mllib.linalg.{Vectors, Vector}
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: DeveloperApi ::
+ * OnlineSummarizer implements [[MultivariateStatisticalSummary]] to 
compute the mean, variance,
+ * minimum, maximum, counts, and non-zero counts for samples in sparse or 
dense vector format in
+ * a streaming fashion.
+ *
+ * Two OnlineSummarizers can be merged together to have a statistical 
summary of a jointed dataset.
+ *
+ * A numerically stable algorithm is implemented to compute sample mean 
and variance:
+ * Reference: 
[[http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance 
variance-wiki]]
+ * Zero elements (including explicit zero values) are skipped when calling 
add(),
+ * to have time complexity O(nnz) instead of O(n) for each column.
+ */
+@DeveloperApi
+class OnlineSummarizer extends MultivariateStatisticalSummary with 
Serializable {
--- End diff --

I actually want to change MultivariateStatisticalSummary to 
StatisticalSummary since it's too verbose. But for consistency, I will change 
it to MultivariateOnlineSummarizer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-48701090
  
QA tests have started for PR 955. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16558/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-48701783
  
QA tests have started for PR 955. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16560/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-07-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-48702465
  
QA tests have started for PR 955. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16561/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/955

[SPARK-1969][MLlib] Public available online summarizer for mean, variance, 
min, and max

It basically moved the private ColumnStatisticsAggregator class from 
RowMatrix to public available DeveloperApi.

Changes:
1) Moved the trait from 
org.apache.spark.mllib.stat.MultivariateStatisticalSummary to 
org.apache.spark.mllib.stats.Summarizer 
2) Moved the private implementation from org.apache.spark.mllib.linalg. 
ColumnStatisticsAggregator to org.apache.spark.mllib.stats.OnlineSummarizer
3) When creating OnlineSummarizer object, the number of columns is not 
needed in the constructor. It's determined when users add the first sample.
4) Added the API documentation for OnlineSummarizer
5) Added the unittest for OnlineSummarizer

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dbtsai-summarizer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/955.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #955


commit 6d0e596a71b44c21b86ba3407d6dc62b0b684198
Author: DB Tsai 
Date:   2014-06-03T03:01:16Z

First version.

commit 1bd8e0c7ded84049371b29bc47c666957f07d091
Author: DB Tsai 
Date:   2014-06-03T20:53:50Z

Some cleanup.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45021851
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45022509
  
MultivariateStatisticalSummary is a public API -- we can't rename it 
arbitrarily. Why does it need to be renamed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45021837
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45023171
  
Since the "Statistical" in MultivariateStatisticalSummary is already in the 
package name as "stat", I think it worths to have a concise name. Also, most 
people spell the abbreviation of statistics as "stats", so I changed it from 
"stat" to "stats".

Since it's already a public API, I've no problem to change it back.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45024074
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45024088
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45024215
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15407/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45024214
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45024583
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45024565
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45026137
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45026138
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15405/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45026777
  
Don't know why jenkins is not happy with removing "private class 
ColumnStatisticsAggregator(private val n: Int)". After all, it's a private 
class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45028406
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15408/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45028404
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-04 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45066533
  
Maybe this is a MIMA problem. Found this (from @pwendell):

https://groups.google.com/forum/#!topic/migration-manager-user/5aQ0xxsL2lU


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-04 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45124672
  
@mengxr Get you. It's false-positive error. Do you have any comment or 
feedback moving it out as public api? I'm building a feature scaling api in 
MlUtils which depends on this. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-05 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45292634
  
@dbtsai The current workaround is excluding it in 
`project/MimaExcludes.scala`. Please check the examples there. At least, we 
need to make Jenkins happy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45297366
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45297370
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-05 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45297396
  
k... better to have Mima exclude the private class automatically, or we can 
have annotation for the private class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45299828
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45299830
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15492/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45309135
  
 Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45322839
  
Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45327623
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15504/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1969][MLlib] Public available online su...

2014-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/955#issuecomment-45327622
  
Build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---