[GitHub] spark issue #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...

2018-08-20 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/22136
  
@huaxingao thank you for your pull request. Can you please add a test to 
make sure this does not regress?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150650409
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala 
---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import scala.language.existentials
+import scala.util.Random
+
+import org.apache.commons.io.FilenameUtils
+import org.apache.hadoop.conf.{Configuration, Configured}
+import org.apache.hadoop.fs.{Path, PathFilter}
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
+
+import org.apache.spark.sql.SparkSession
+
+private object RecursiveFlag {
+  /**
+   * Sets the spark recursive flag and then restores it.
+   *
+   * @param value Value to set
+   * @param spark Existing spark session
+   * @param f The function to evaluate after setting the flag
+   * @return Returns the evaluation result T of the function
+   */
+  def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): 
T = {
+val flagName = FileInputFormat.INPUT_DIR_RECURSIVE
+val hadoopConf = spark.sparkContext.hadoopConfiguration
+val old = Option(hadoopConf.get(flagName))
+hadoopConf.set(flagName, value.toString)
+try f finally {
+  old match {
+case Some(v) => hadoopConf.set(flagName, v)
+case None => hadoopConf.unset(flagName)
+  }
+}
+  }
+}
+
+/**
+ * Filter that allows loading a fraction of HDFS files.
+ */
+private class SamplePathFilter extends Configured with PathFilter {
--- End diff --

Ok, if we think that duplicate file names are not an issue, then I would 
prefer having using the deterministic hashing-based scheme. I am also happy to 
make this into another PR since this is a pretty small matter.

@imatiach-msft a good hashing function is certainly a concern, and we will 
need to make a trade off between performance and correctness. If we really want 
to make sure that it works as expected, the best is probably to use a 
cryptographic hash, like `SHA-256` (which has strong guarantees of the 
distribution of output values):

https://stackoverflow.com/questions/5531455/how-to-hash-some-string-with-sha256-in-java
We have `murmur3` in the Spark source code, but it is not cryptographic and 
does not come with as strong guarantees. For the sake of performance, we may 
want to use it eventually.

Once you have the digest, which is a byte array, then it can be converted 
to a long first (by taking 8 bytes from the whole digest):

https://stackoverflow.com/questions/4485128/how-do-i-convert-long-to-byte-and-back-in-java
and then you can convert this long to a double and compare it to the 
requested fraction:

```java
  Long l = new Long(...);
double d = l.doubleValue();
   bool shouldKeep = d / (2<<31) <= requestedFraction
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150250118
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala 
---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import scala.language.existentials
+import scala.util.Random
+
+import org.apache.commons.io.FilenameUtils
+import org.apache.hadoop.conf.{Configuration, Configured}
+import org.apache.hadoop.fs.{Path, PathFilter}
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
+
+import org.apache.spark.sql.SparkSession
+
+private object RecursiveFlag {
+  /**
+   * Sets the spark recursive flag and then restores it.
+   *
+   * @param value Value to set
+   * @param spark Existing spark session
+   * @param f The function to evaluate after setting the flag
+   * @return Returns the evaluation result T of the function
+   */
+  def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): 
T = {
+val flagName = FileInputFormat.INPUT_DIR_RECURSIVE
+val hadoopConf = spark.sparkContext.hadoopConfiguration
+val old = Option(hadoopConf.get(flagName))
+hadoopConf.set(flagName, value.toString)
+try f finally {
+  old match {
+case Some(v) => hadoopConf.set(flagName, v)
+case None => hadoopConf.unset(flagName)
+  }
+}
+  }
+}
+
+/**
+ * Filter that allows loading a fraction of HDFS files.
+ */
+private class SamplePathFilter extends Configured with PathFilter {
--- End diff --

Actually, I take this comment back, since it would break on some 
pathological cases such as all the names being the same. When users want some 
samples, they most probably want a result that is a fraction of the original, 
whatever it may contain.

@jkbradley do you prefer a something that may not be deterministic (using 
random numbers) or deterministic but not respecting the sampling ratio in 
pathological cases? The only way to do both that I can think of is 
deduplicating, which requires a shuffle.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150247999
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala 
---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import scala.language.existentials
+import scala.util.Random
+
+import org.apache.commons.io.FilenameUtils
+import org.apache.hadoop.conf.{Configuration, Configured}
+import org.apache.hadoop.fs.{Path, PathFilter}
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
+
+import org.apache.spark.sql.SparkSession
+
+private object RecursiveFlag {
+  /**
+   * Sets the spark recursive flag and then restores it.
+   *
+   * @param value Value to set
+   * @param spark Existing spark session
+   * @param f The function to evaluate after setting the flag
+   * @return Returns the evaluation result T of the function
+   */
+  def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): 
T = {
+val flagName = FileInputFormat.INPUT_DIR_RECURSIVE
+val hadoopConf = spark.sparkContext.hadoopConfiguration
+val old = Option(hadoopConf.get(flagName))
+hadoopConf.set(flagName, value.toString)
+try f finally {
+  old match {
+case Some(v) => hadoopConf.set(flagName, v)
+case None => hadoopConf.unset(flagName)
+  }
+}
+  }
+}
+
+/**
+ * Filter that allows loading a fraction of HDFS files.
+ */
+private class SamplePathFilter extends Configured with PathFilter {
--- End diff --

Oh, it would be pretty simple to work it out this way (pseudocode). Note 
that a hash is a random variable between -2^31 and 2^31-1, so:
```
val fraction: Double = ??? // Something between 0 and 1
val pathname: String = ???
val hash = pathname.hashcode() // Could use some other other, more robust 
methods that return longs. 
val shouldKeep: Boolean = math.abs(hash) < (math.pow(2, 31) * fraction)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-30 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147661505
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields = Array("origin", "height", "width", "nChannels", 
"mode", "data")
+
+  val ocvTypes = Map(
--- End diff --

Same thing here (and even more important): `Map[String, Int]` if I am not 
mistaken.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-30 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147661396
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields = Array("origin", "height", "width", "nChannels", 
"mode", "data")
--- End diff --

nit: since this is public, put the explicit type: `Array[String]`. Scala 
can have otherwise some surprising behavior here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-30 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147661078
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala 
---
@@ -0,0 +1,120 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import scala.language.existentials
+import scala.util.Random
+
+import org.apache.commons.io.FilenameUtils
+import org.apache.hadoop.conf.{Configuration, Configured}
+import org.apache.hadoop.fs.{Path, PathFilter}
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
+
+import org.apache.spark.sql.SparkSession
+
+private object RecursiveFlag {
+
+  /**
+   * Sets a value of spark recursive flag.
+   * If value is a None, it unsets the flag.
+   *
+   * @param value value to set
+   * @param spark existing spark session
+   * @return previous value of this flag
+   */
+  def setRecursiveFlag(value: Option[String], spark: SparkSession): 
Option[String] = {
+val flagName = FileInputFormat.INPUT_DIR_RECURSIVE
+val hadoopConf = spark.sparkContext.hadoopConfiguration
+val old = Option(hadoopConf.get(flagName))
+
+value match {
+  case Some(v) => hadoopConf.set(flagName, v)
+  case None => hadoopConf.unset(flagName)
+}
+
+old
+  }
+}
+
+/**
+ * Filter that allows loading a fraction of HDFS files.
+ */
+private class SamplePathFilter extends Configured with PathFilter {
+  val random = new Random()
+
+  // Ratio of files to be read from disk
+  var sampleRatio: Double = 1
+
+  override def setConf(conf: Configuration): Unit = {
+if (conf != null) {
+  sampleRatio = conf.getDouble(SamplePathFilter.ratioParam, 1)
+}
+  }
+
+  override def accept(path: Path): Boolean = {
+// Note: checking fileSystem.isDirectory is very slow here, so we use 
basic rules instead
+!SamplePathFilter.isFile(path) || random.nextDouble() < sampleRatio
--- End diff --

I would prefer that we do not use a seed and that the result is 
deterministic, based for example on some hash of the file name, to make it more 
robust to future code changes. That being said, there is no fundamental issues 
with the current implementation and other developers may have differing 
opinions, so the current implementation is fine as far as I am concerned.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-26 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/19439
  
@hhbyyh I recall now the reason for an extra `origin` field, which is to 
get around the standard issue of many small image files in S3 or other 
distributed file systems. It is standard to compact many small images into 
larger zip files, and the original `readImages` implementation could 
recursively traverse zip files to deal with that. This is a feature that we 
would like to add again at some point.

When you compact multiple images in a single zip file, though, the filename 
is that of the zip file, so having an extra `origin` field is convenient to 
name the image correctly. This field is optional and this format is still 
experimental, so I do not think it is going to be an issue to deprecate this 
field if it is deemed to be too much trouble.

Here is for example a relevant issue that we have in Deep Learning 
Pipelines, which is very representative of normal scenarios:
https://github.com/databricks/spark-deep-learning/issues/67
The current workaround is suboptimal in terms of performance and user 
experience.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-24 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/19439
  
@hhbyyh regarding the data representation, one could indeed have the each 
of the representations being encoded with the proper array information. This 
brings some additional complexity for the complex UDFs though, because they 
need to select the proper field, and the target implementations in C++ or 
tensorflow already can cast the field to the proper type. I suggest we keep 
bytes[] for now and see if there is a need to have a more refined 
representations.

For the `origin` field, @dakirsa  or @imatiach-msft should have more 
context.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-18 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/19439
  
@hhbyyh thank you for bringing up these questions. In response to your 
questions:

> Does the current schema support or plan to support image feature data in 
Floats[] or Doubles[]?

It does, indirectly: this is what the field types `CV_32FXX` do. You need 
to do some low-level casting to convert the byte array to array of numbers, but 
that should not be a limitation for most applications.

> Correct me if I'm wrong, I don't see ocvTypes plays any role in the code. 
If the field is for future extension. maybe It's better to keep only the 
supported types. But not all the OpenCV types.

Indeed. These fields are added so that users know what values are allowed 
for these fields. A scala-friendly choice would have been sealed traits or 
enumerations, but the consensus in Spark has been for a low-level 
representation. Nothing precludes adding a case class to represent this dataset 
in the future, with more type safety information.

> In most scenarios, deep learning applications use rescaled/cropped images 
(typically 256, 224 or smaller). Maybe add an extra parameter "smallSideSize" 
to the readImages method, which is more convenient for the users and we can 
avoid to cache the image of original size (which could be 100 times larger than 
the scaled image). This can be done in follow up PR.

This is a good point, that we all hit. The issue here is that there is no 
unique definition of rescaling (what interpolation? crop and scale? scale and 
crop?) and each library has made different choices. This is certainly something 
that would be good for a future PR.

> Not sure about the reason to include "origin" info into the image data. 
Based on my experience, path info serves better as a separate column in the 
DataFrame. (E.g. prediction)

Yes, this feature has been debated. Some developers have had a compelling 
need for directly accessing some information about the origin of the image 
directly inside the image.

> IMO the parameter "recursive" may not be necessary. Existing wild card 
matching can provides more functions.

Indeed, this feature may not seem that useful at first glance. For some 
hadoop file systems though, in which images are accessed in a batched manner, 
it is useful to traverse these batches. This is important for performance 
reasons. This is why it is marked as experimental for the time being.

> For scala API, ImageSchema should be in a separate file but not to be 
mixed with image reading.

I do not have a strong opinion about this point, I will let other 
developers decide.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19156: [SPARK-19634][FOLLOW-UP][ML] Improve interface of...

2017-09-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19156#discussion_r137603986
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -109,31 +108,47 @@ object Summarizer extends Logging {
   }
 
   @Since("2.3.0")
-  def mean(col: Column): Column = getSingleMetric(col, "mean")
+  def mean(col: Column, weightCol: Column = lit(1.0)): Column = {
--- End diff --

I am not a fan of default parameters, it tends to cause issues with binary 
compatibility. Unless you have some good reasons, you should have two different 
functions:

```scala
def mean(col: Column): Column = mean(col, lit(1.0))
def mean(col: Column, weightCol: Column): Column = ...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-15 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/18798
  
Thank you @yanboliang.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-10 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/18798
  
@yanboliang do you feel comfortable to merge this PR? I think that all the 
questions have been addressed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r131971123
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,587 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.3.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.3.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.3.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
--- End diff --

Put a comment about performance here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-08 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r131970836
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,587 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.3.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.3.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.3.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.3.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.3.0")
--- End diff --

Let's put a comment about performance to indicate that it is about 3x 
slower than using the RDD interface. 


---
If your project is

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130742319
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130742836
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130741880
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130742759
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130742524
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130742933
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130743131
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala ---
@@ -0,0 +1,619 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import org.scalatest.exceptions.TestFailedException
+
+import org.apache.spark.{SparkException, SparkFunSuite}
+import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.util.TestingUtils._
+import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => 
OldVectors}
+import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, 
Statistics}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
+
+class SummarizerSuite extends SparkFunSuite with MLlibTestSparkContext {
+
+  import testImplicits._
+  import Summarizer._
+  import SummaryBuilderImpl._
+
+  private case class ExpectedMetrics(
+  mean: Seq[Double],
+  variance: Seq[Double],
+  count: Long,
+  numNonZeros: Seq[Long],
+  max: Seq[Double],
+  min: Seq[Double],
+  normL2: Seq[Double],
+  normL1: Seq[Double])
+
+  // The input is expected to be either a sparse vector, a dense vector or 
an array of doubles
+  // (which will be converted to a dense vector)
+  // The expected is the list of all the known metrics.
+  //
+  // The tests take an list of input vectors and a list of all the summary 
values that
+  // are expected for this input. They currently test against some fixed 
subset of the
+  // metrics, but should be made fuzzy in the future.
+
+  private def testExample(name: String, input: Seq[Any], exp: 
ExpectedMetrics): Unit = {
+def inputVec: Seq[Vector] = input.map {
+  case x: Array[Double @unchecked] => Vectors.dense(x)
+  case x: Seq[Double @unchecked] => Vectors.dense(x.toArray)
+  case x: Vector => x
+  case x => throw new Exception(x.toString)
+}
+
+val s = {
+  val s2 = new MultivariateOnlineSummarizer
+  inputVec.foreach(v => s2.add(OldVectors.fromML(v)))
+  s2
+}
+
+// Because the Spark context is reset between tests, we cannot hold a 
reference onto it.
+def wrapped() = {
+  val df = sc.parallelize(inputVec).map(Tuple1.apply).toDF("features")
+  val c = df.col("features")
+  (df, c)
+}
+
+registerTest(s"$name - mean only") {
+  val (df, c) = wrapped()
+  compare(df.select(metrics("mean").summary(c), mean(c)), 
Seq(Row(exp.mean), s.mean))
+}
+
+registerTest(s"$name - mean only (direct)") {
+  val (df, c) = wrapped()
+  compare(df.select(mean(c)), Seq(exp.mean))
+}
+
+registerTest(s"$name - variance only") {
+  val (df, c) = wrapped()
+  compare(df.select(metrics("variance").summary(c), variance(c)),
+Seq(Row(exp.variance), s.variance))
+}
+
+registerTest(s"$name - variance only (direct)") {
+  val (df, c) = wrapped()
+  compare(df.select(variance(c)), Seq(s.variance))
+}
+
+registerTest(s"$name - count only") {
+  val (df, c) = wrapped()
+  compare(df.select(metrics("count").summary(c), count(c)),
+Seq(Row(exp.count), exp.count))
+}
+
+registerTest(s"$name - count only (direct)") {
+  val (df, c) = wrapped()
+  compare(df.select(count(c)),
+Seq(exp.count))
+}
+
+registerTest(s"$name - numNonZeros only") {
+  val (df, c) = wrapped()
+  compare(df.select(metrics("numNonZ

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130741348
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
--- End diff --

this is not going to be 2.2 anymore


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-01 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/17419
  
I am going to close this PR, since this is being taken over by 
@WeichenXu123  in #18798 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17419: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread thunterdb
Github user thunterdb closed the pull request at:

https://github.com/apache/spark/pull/17419


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-01 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/18798
  
cc @hvanhovell as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-01 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/18798
  
Thank you for the performance numbers @WeichenXu123 , I have a couple of 
comments:
 - you say that SQL uses adaptive compaction. How bad is that? I assume it 
adds some overhead.
 - did you just run each experiment once? I would be interested in error 
bars on these numbers, as it can take up to 30 seconds for the JVM to warm up 
and optimize the byte code. You should report the geometric mean or the median 
time of running these experiments to make sure that you are skewed by outliers. 
Some others will probably have some good advice as well.
 - from the performance numbers, there are 2 different regimes: small 
vectors, and big vectors (for which even the DataFrame -> RDD conversion is 
faster than working straight with DataFrames). I would be curious to know the 
bottlenecks for each case.

If we trust these numbers, the overall conclusion is that the SQL interface 
adds a 2x-3x performance overhead over RDDs for the time being. @cloud-fan 
@liancheng are there still some low hanging fruits that could be merged into 
SQL? 

This state of affair is of course far from great, but I am in favor of 
merging this piece and improve it iteratively with the help of the SQL team, as 
this code is easy to benchmark and representative of the rest of MLlib, once we 
start to rely more on dataframe and catalysts, and less on RDDs.

@yanboliang @viirya @kiszk what are your thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-08-01 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/18798
  
@WeichenXu123 thanks! Can you post some performance numbers as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18281: [SPARK-21027][ML][PYTHON] Added tunable paralleli...

2017-06-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r121790182
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -325,8 +343,13 @@ final class OneVsRest @Since("1.4.0") (
   multiclassLabeled.persist(StorageLevel.MEMORY_AND_DISK)
 }
 
+val iters = Range(0, numClasses).par
--- End diff --

@jkbradley thanks for calling this out. Indeed, the code as it stands may 
cause some non-deterministic issues in complex environments. I put a comment 
about that in that PR:
https://github.com/apache/spark/pull/16774
See how it is done in this file:

https://github.com/apache/spark/pull/16774/files#diff-d14539cadce0fba9d8b7d970adaf8b26

It should be a quick change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17419: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-03-30 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17419#discussion_r109063248
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,746 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import breeze.{linalg => la}
+import breeze.linalg.{Vector => BV}
+import breeze.numerics
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, 
Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData, UnsafeProjection, UnsafeRow}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.types._
+
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param column a column that contains Vector object.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(column: Column): Column
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric) ++ metrics)
+new SummaryBuilderImpl(typedMetrics, computeMetrics)
+  }
+
+  def mean(col: Column): Column = ge

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-30 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/17419
  
I looked a bit deeper into the performance aspect. Here are some quick 
insights:
 - there was an immediate bottleneck in `VectorUDT`, which boosts the 
performance already by 3x
 - it is not clear if switching to pure Breeze operations helps given the 
overhead for tiny vectors. I will need to do more analysis on larger vectors.
 - now, most of the time is roughly split between 
`ObjectAggregationIterator.processInputs` (40%), some codegen'ed expression 
(20%) and our own `MetricsAggregate.update` (35%)

That benchmark focuses on the overhead of catalyst. I will do another 
benchmark with dense vectors to see how it fares in practice with more real 
data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17419: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-03-29 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17419#discussion_r108743634
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala ---
@@ -335,4 +335,65 @@ class SummarizerSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(Buffer.totalCount(summarizer) === 6)
   }
 
+  // TODO: this test should not be committed. It is here to isolate some 
performance hotspots.
+  test("perf test") {
+val n = 1000
+val rdd1 = sc.parallelize(1 to n).map { idx =>
+  OldVectors.dense(idx.toDouble)
+}
+val trieouts = 10
--- End diff --

I did not try about that class, thanks. It should stabilize the results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/17419
  
I have added a small perf test to find the performance bottlenecks. Note 
that this test works on the worst case (vectors of size 1) from the perspective 
of overhead. Here are the numbers I currently get. I will profile the code to 
see if there are some obvious targets for optimization:

```
[min ~ median ~ max], higher is better:

RDD = [2482 ~ 46150 ~ 48354] records / milli
dataframe (variance only) = [4217 ~ 4557 ~ 4848] records / milli
dataframe = [2887 ~ 4420 ~ 4717] records / milli
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-03-27 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/17419
  
@sethah it would have been nice, but I do not think we should merge it this 
late into the release cycle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17419: [SPARK-19634][ML][WIP] Multivariate summarizer - ...

2017-03-24 Thread thunterdb
GitHub user thunterdb opened a pull request:

https://github.com/apache/spark/pull/17419

[SPARK-19634][ML][WIP] Multivariate summarizer - dataframes API

## What changes were proposed in this pull request?

This patch adds the DataFrames API to the multivariate summarizer (mean, 
variance, etc.). In addition to all the features of 
`MultivariateOnlineSummarizer`, it also allows the user to select a subset of 
the metrics. This should resolve some performance issues related to computing 
unrequested metrics.

Furthermore, it uses the BLAS API to the extent possible, so that the given 
code should be efficient for the dense case.

## How was this patch tested?

This patch includes most of the tests of the RDD-based. It compares results 
against the existing `MultivariateOnlineSummarizer` as well as adding more 
tests.

This patch also includes some documentation for some low-level constructs 
such as `TypedImperativeAggregate`.

## Performance

I have not run tests against the existing implementation. However, this 
patch uses the recommended low-level SQL APIs, so it should be interesting to 
compare both implementation in that respect.

## WIP

Marked as WIP because some debugging comments are still present in the code.

Thanks to @hvanhovell and Cheng Liang for suggestions on SparkSQL.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thunterdb/spark 19634

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17419


commit f3fa6580bca70f3307d70e938ef8531c928d958b
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-03T18:36:02Z

work

commit 7539835dad863a6b73d88d79983342f9ddb7fb9d
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-06T22:38:41Z

work on the test suite

commit 673943f334b94e5d1ecd8874cb82bbc875d739c6
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-07T00:01:30Z

last work

commit 202b672afec127f4e0885cf3a58f4dfc97031fc6
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-13T22:48:47Z

work on using imperative aggregators

commit be019813f241d0ad3559b4d84339f1bb1055cbc4
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-17T21:44:40Z

Merge remote-tracking branch 'upstream/master' into 19634

commit a983284cfeddabd017792e3991cf99a7d3ab1e16
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-18T00:14:40Z

more work on summarizer

commit 647a4fecb17d478c3c8cd68d40f2a9456eb10c66
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T17:47:30Z

work

commit 3c4bef772a3cbc759e43223af658a357c5ca6bc2
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T18:54:16Z

changes

commit 56390ccc456c67b2f7a08c1271fa50408518da0f
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T18:54:19Z

Merge remote-tracking branch 'upstream/master' into 19634

commit c3f236c4422031ae818cb6bbec2415b3f1bf7b70
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T19:03:07Z

cleanup

commit ef955c00275705f14342f3e4ed970a78f0f3c141
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T22:42:42Z

debugging

commit a04f923913ca1118a61d66bd53b8514af62594d7
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-21T23:14:23Z

work

commit 946d490c8b29e55ec0e6d40785122269063894ad
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-22T21:14:29Z

Merge remote-tracking branch 'upstream/master' into 19634

commit 201eb7712054967cd5093d3a908f4ebbd73f30a8
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-22T21:19:57Z

debug

commit f4dec88a49d0a20e1b328617fd721633fd8c201a
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-23T18:27:19Z

trying to debug serialization issue

commit 4af0f47d326ef91d7cf9ccaf6a45ee3f904b191f
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-23T23:16:10Z

better tests

commit 9f29030f75089884156bdc4ee634857b3730114d
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-24T00:12:28Z

changes

commit e9877dc2f08d393f079bdf6fbbf1b9b9abaa21da
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-24T21:04:32Z

debugging

commit 3a11d0265ef665a63cd070eeb1ae4ac25bc89908
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-24T22:14:06Z

more tests and debugging

commit 6d26c17d0bd4ab18d564ee7f37916780211702d5
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-24T23:12:19Z

fix

[GitHub] spark issue #17108: [SPARK-19636][ML] Feature parity for correlation statist...

2017-03-23 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/17108
  
Tickets created:
 - https://issues.apache.org/jira/browse/SPARK-20076
 - https://issues.apache.org/jira/browse/SPARK-20077


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r107505718
  
--- Diff: 
mllib-local/src/test/scala/org/apache/spark/ml/util/TestingUtils.scala ---
@@ -32,6 +32,10 @@ object TestingUtils {
* the relative tolerance is meaningless, so the exception will be 
raised to warn users.
*/
   private def RelativeErrorComparison(x: Double, y: Double, eps: Double): 
Boolean = {
+// Special case for NaNs
--- End diff --

@jkbradley I do not think this change is going to be controversial, but I 
want to point out that from now on, matrix/vector checks will not _always_ 
throw errors when comparing `NaN`: the previous code would throw whenever a NaN 
was found.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r107505201
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala 
---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{SQLDataTypes, Vector}
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and 
Datasets.
+ *
+ * The functions in this package generalize the functions in 
[[org.apache.spark.sql.Dataset.stat]]
+ * to spark.ml's Vector types.
+ */
+@Since("2.2.0")
+@Experimental
+object Correlations {
--- End diff --

sure, I do not know if there is a convention for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r107505212
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala 
---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{SQLDataTypes, Vector}
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and 
Datasets.
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r107505215
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala 
---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{SQLDataTypes, Vector}
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and 
Datasets.
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r107505185
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala 
---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{SQLDataTypes, Vector}
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and 
Datasets.
+ *
+ * The functions in this package generalize the functions in 
[[org.apache.spark.sql.Dataset.stat]]
+ * to spark.ml's Vector types.
+ */
+@Since("2.2.0")
+@Experimental
+object Correlations {
+
+  /**
+   * Compute the correlation matrix for the input RDD of Vectors using the 
specified method.
+   * Methods currently supported: `pearson` (default), `spearman`.
+   *
+   * @param dataset A dataset or a dataframe
+   * @param column The name of the column of vectors for which the 
correlation coefficient needs
+   *   to be computed. This must be a column of the dataset, 
and it must contain
+   *   Vector objects.
+   * @param method String specifying the method to use for computing 
correlation.
+   *   Supported: `pearson` (default), `spearman`
+   * @return A dataframe that contains the correlation matrix of the 
column of vectors. This
+   * dataframe contains a single row and a single column of name
+   * '$METHODNAME($COLUMN)'.
+   * @throws IllegalArgumentException if the column is not a valid column 
in the dataset, or if
+   *  the content of this column is not of 
type Vector.
+   *
+   *  Here is how to access the correlation coefficient:
+   *  {{{
+   *val data: Dataset[Vector] = ...
+   *val Row(coeff: Matrix) = Statistics.corr(data, "value").head
+   *// coeff now contains the Pearson correlation matrix.
+   *  }}}
+   *
+   * @note For Spearman, a rank correlation, we need to create an 
RDD[Double] for each column
+   * and sort it in order to retrieve the ranks and then join the columns 
back into an RDD[Vector],
+   * which is fairly costly. Cache the input RDD before calling corr with 
`method = "spearman"` to
+   * avoid recomputing the common lineage.
+   */
+  @Since("2.2.0")
+  def corr(dataset: Dataset[_], column: String, method: String): DataFrame 
= {
+val rdd = dataset.select(column).rdd.map {
+  case Row(v: Vector) => OldVectors.fromML(v)
+}
+val oldM = OldStatistics.corr(rdd, method)
+val name = s"$method($column)"
+val schema = StructType(Array(StructField(name, 
SQLDataTypes.MatrixType, nullable = true)))
--- End diff --

Good point. It seems that Spark can be quite liberal with the nullability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r107505180
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/util/LinalgUtils.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.util
+
+import breeze.linalg.{Matrix => BM}
+
+import org.apache.spark.internal.Logging
+
+/**
+ * Utility test methods for linear algebra.
+ */
+object LinalgUtils extends Logging {
--- End diff --

You are right, I had missed that file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-17 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/16483
  
It looks good to me.

cc @jkbradley or @mengxr for final approval


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-17 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106746316
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -322,13 +335,12 @@ object PageRank extends Logging {
 def personalizedVertexProgram(id: VertexId, attr: (Double, Double),
   msgSum: Double): (Double, Double) = {
   val (oldPR, lastDelta) = attr
-  var teleport = oldPR
-  val delta = if (src==id) resetProb else 0.0
-  teleport = oldPR*delta
-
-  val newPR = teleport + (1.0 - resetProb) * msgSum
-  val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else 
newPR - oldPR
-  (newPR, newDelta)
+  val newPR = if (lastDelta == Double.NegativeInfinity) {
--- End diff --

I agree that the new code is easier to follow in that respect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-16 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/16483
  
In addition, this introduces an extra step reduction at each iteration. I 
am fine with that since it is for correctness, but @jkbradley may want to 
comment as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106529377
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -353,9 +365,19 @@ object PageRank extends Logging {
 vertexProgram(id, attr, msgSum)
 }
 
-Pregel(pagerankGraph, initialMessage, activeDirection = 
EdgeDirection.Out)(
+val rankGraph = Pregel(pagerankGraph, initialMessage, activeDirection 
= EdgeDirection.Out)(
   vp, sendMessage, messageCombiner)
   .mapVertices((vid, attr) => attr._1)
-  } // end of deltaPageRank
+
+// If the graph has sinks (vertices with no outgoing edges) the sum of 
ranks will not be correct
--- End diff --

This is the same code as above, please factor it into a function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106532078
  
--- Diff: 
graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala ---
@@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite with 
LocalSparkContext {
   val nVertices = 100
   val starGraph = GraphGenerators.starGraph(sc, nVertices).cache()
   val resetProb = 0.15
+  val tol = 0.0001
+  val numIter = 2
   val errorTol = 1.0e-5
 
-  val staticRanks1 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices
-  val staticRanks2 = starGraph.staticPageRank(numIter = 3, 
resetProb).vertices.cache()
+  val staticRanks = starGraph.staticPageRank(numIter, 
resetProb).vertices.cache()
+  val staticRanks2 = starGraph.staticPageRank(numIter + 1, 
resetProb).vertices
 
-  // Static PageRank should only take 3 iterations to converge
-  val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, 
pr1, pr2) =>
+  // Static PageRank should only take 2 iterations to converge
--- End diff --

Why does it take only two iterations to converge now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106535595
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -322,13 +335,12 @@ object PageRank extends Logging {
 def personalizedVertexProgram(id: VertexId, attr: (Double, Double),
   msgSum: Double): (Double, Double) = {
   val (oldPR, lastDelta) = attr
-  var teleport = oldPR
-  val delta = if (src==id) resetProb else 0.0
-  teleport = oldPR*delta
-
-  val newPR = teleport + (1.0 - resetProb) * msgSum
-  val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else 
newPR - oldPR
-  (newPR, newDelta)
+  val newPR = if (lastDelta == Double.NegativeInfinity) {
--- End diff --

My memory of the algorithm is a bit rusty. Why don't you need to check for 
self-loops here anymore?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106528007
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -162,7 +162,15 @@ object PageRank extends Logging {
   iteration += 1
 }
 
-rankGraph
+// If the graph has sinks (vertices with no outgoing edges) the sum of 
ranks will not be correct
--- End diff --

put the name of the ticket as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r106309333
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -245,7 +245,7 @@ object ApproximatePercentile {
 val result = new Array[Double](percentages.length)
 var i = 0
 while (i < percentages.length) {
-  result(i) = summaries.query(percentages(i))
+  result(i) = summaries.query(percentages(i)).get
--- End diff --

Yes, this is correct. But you should leave a comment, since it is not 
obvious.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17108: [SPARK-19636][ML] Feature parity for correlation statist...

2017-03-15 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/17108
  
I moved the code `Correlations` as suggested. @imatiach-msft , I addressed 
your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r106307446
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/stat/StatisticsSuite.scala ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import breeze.linalg.{DenseMatrix => BDM, Matrix => BM}
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.Matrix
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Row}
+
+
+class StatisticsSuite extends SparkFunSuite with MLlibTestSparkContext 
with Logging {
+
+  import StatisticsSuite._
+
+  val xData = Array(1.0, 0.0, -2.0)
+  val yData = Array(4.0, 5.0, 3.0)
+  val zeros = new Array[Double](3)
+  val data = Seq(
+Vectors.dense(1.0, 0.0, 0.0, -2.0),
+Vectors.dense(4.0, 5.0, 0.0, 3.0),
+Vectors.dense(6.0, 7.0, 0.0, 8.0),
+Vectors.dense(9.0, 0.0, 0.0, 1.0)
+  )
+
+  private def X = 
spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
+
+  private def extract(df: DataFrame): BDM[Double] = {
+val Array(Row(mat: Matrix)) = df.collect()
+mat.asBreeze.toDenseMatrix
+  }
+
+
+  test("corr(X) default, pearson") {
+val defaultMat = Statistics.corr(X, "features")
+val pearsonMat = Statistics.corr(X, "features", "pearson")
+// scalastyle:off
+val expected = BDM(
+  (1., 0.05564149, Double.NaN, 0.4004714),
+  (0.05564149, 1., Double.NaN, 0.9135959),
+  (Double.NaN, Double.NaN, 1., Double.NaN),
+  (0.40047142, 0.91359586, Double.NaN, 1.000))
+// scalastyle:on
+
+assert(matrixApproxEqual(extract(defaultMat), expected))
+assert(matrixApproxEqual(extract(pearsonMat), expected))
+  }
+
+  test("corr(X) spearman") {
+val spearmanMat = Statistics.corr(X, "features", "spearman")
+// scalastyle:off
+val expected = BDM(
+  (1.000,  0.1054093,  Double.NaN, 0.400),
+  (0.1054093,  1.000,  Double.NaN, 0.9486833),
+  (Double.NaN, Double.NaN, 1., Double.NaN),
+  (0.400,  0.9486833,  Double.NaN, 1.000))
+// scalastyle:on
+assert(matrixApproxEqual(extract(spearmanMat), expected))
+  }
+
+}
+
+
+object StatisticsSuite extends Logging {
+
+  def approxEqual(v1: Double, v2: Double, threshold: Double = 1e-6): 
Boolean = {
--- End diff --

Moved


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r106307517
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala ---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.{SQLDataTypes, Vector}
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and 
Datasets.
+ *
+ * The functions in this package generalize the functions in 
[[org.apache.spark.sql.Dataset.stat]]
+ * to MLlib's Vector types.
+ */
+@Since("2.2.0")
--- End diff --

Good point, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r106306502
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/stat/StatisticsSuite.scala ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import breeze.linalg.{DenseMatrix => BDM, Matrix => BM}
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.Matrix
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Row}
+
+
+class StatisticsSuite extends SparkFunSuite with MLlibTestSparkContext 
with Logging {
+
+  import StatisticsSuite._
+
+  val xData = Array(1.0, 0.0, -2.0)
+  val yData = Array(4.0, 5.0, 3.0)
+  val zeros = new Array[Double](3)
+  val data = Seq(
+Vectors.dense(1.0, 0.0, 0.0, -2.0),
+Vectors.dense(4.0, 5.0, 0.0, 3.0),
+Vectors.dense(6.0, 7.0, 0.0, 8.0),
+Vectors.dense(9.0, 0.0, 0.0, 1.0)
+  )
+
+  private def X = 
spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
+
+  private def extract(df: DataFrame): BDM[Double] = {
+val Array(Row(mat: Matrix)) = df.collect()
+mat.asBreeze.toDenseMatrix
+  }
+
+
+  test("corr(X) default, pearson") {
+val defaultMat = Statistics.corr(X, "features")
+val pearsonMat = Statistics.corr(X, "features", "pearson")
+// scalastyle:off
--- End diff --

The problem is the alignment of the values, which we realize by padding 
with `0`'s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r106306385
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala ---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.{SQLDataTypes, Vector}
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and 
Datasets.
+ *
+ * The functions in this package generalize the functions in 
[[org.apache.spark.sql.Dataset.stat]]
+ * to MLlib's Vector types.
+ */
+@Since("2.2.0")
+object Statistics {
+
+  /**
+   * Compute the correlation matrix for the input RDD of Vectors using the 
specified method.
+   * Methods currently supported: `pearson` (default), `spearman`.
+   *
+   * @param dataset a dataset or a dataframe
+   * @param column the name of the column of vectors for which the 
correlation coefficient needs
+   *   to be computed. This must be a column of the dataset, 
and it must contain
+   *   Vector objects.
+   * @param method String specifying the method to use for computing 
correlation.
+   *   Supported: `pearson` (default), `spearman`
+   * @return A dataframe that contains the correlation matrix of the 
column of vectors. This
+   * dataframe contains a single row and a single column of name
+   * '$METHODNAME($COLUMN)'.
+   * @throws IllegalArgumentException if the column is not a valid column 
in the dataset, or if
+   *  the content of this column is not of 
type Vector.
+   *
+   *  Here is how to access the correlation coefficient:
+   *  {{{
+   *val data: Dataset[Vector] = ...
+   *val Row(coeff: Matrix) = Statistics.corr(data, "value").head
+   *// coeff now contains the Pearson correlation matrix.
+   *  }}}
+   *
+   * @note For Spearman, a rank correlation, we need to create an 
RDD[Double] for each column
+   * and sort it in order to retrieve the ranks and then join the columns 
back into an RDD[Vector],
+   * which is fairly costly. Cache the input RDD before calling corr with 
`method = "spearman"` to
+   * avoid recomputing the common lineage.
+   */
+  // TODO: how do we handle missing values?
--- End diff --

Good point. I will remove the comment at this point, since this should be 
decided in JIRA instead of during the implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r106306111
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala ---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.{SQLDataTypes, Vector}
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and 
Datasets.
+ *
+ * The functions in this package generalize the functions in 
[[org.apache.spark.sql.Dataset.stat]]
+ * to MLlib's Vector types.
+ */
+@Since("2.2.0")
+object Statistics {
+
+  /**
+   * Compute the correlation matrix for the input RDD of Vectors using the 
specified method.
+   * Methods currently supported: `pearson` (default), `spearman`.
+   *
+   * @param dataset a dataset or a dataframe
--- End diff --

oh yes, thank you. I am correcting the other instances of course.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-03-15 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r106305822
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Statistics.scala ---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.{SQLDataTypes, Vector}
+import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
+import org.apache.spark.mllib.stat.{Statistics => OldStatistics}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.types.{StructField, StructType}
+
+/**
+ * API for statistical functions in MLlib, compatible with Dataframes and 
Datasets.
+ *
+ * The functions in this package generalize the functions in 
[[org.apache.spark.sql.Dataset.stat]]
+ * to MLlib's Vector types.
--- End diff --

I will use `spark.ml` which is the most correct terminology.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

2017-03-14 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/17110
  
@jkbradley LGTM, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17215: [MINOR][ML] Improve MLWriter overwrite error message

2017-03-13 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/17215
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104813864
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
+
+/**
+ * :: Experimental ::
+ * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test 
chi-squared]]
+ * during binary classification.
+ */
+@Since("2.0.0")
+@Experimental
+object ChiSquared extends Impurity {
+  private object CSTest extends 
org.apache.commons.math3.stat.inference.ChiSquareTest()
--- End diff --

why not a private `val`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104813302
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -237,6 +237,41 @@ class DecisionTreeClassifierSuite
 compareAPIs(rdd, dt, categoricalFeatures = Map.empty[Int, Int], 
numClasses)
   }
 
+  test("split quality using chi-squared and minimum gain") {
+// Generate a data set where the 1st feature is useful and the others 
are noise
+val features = Vector.fill(200) {
+  Array.fill(3) { scala.util.Random.nextInt(2).toDouble }
+}
+val labels = features.map { fv =>
+  LabeledPoint(if (fv(0) == 1.0) 1.0 else 0.0, Vectors.dense(fv))
+}
+val rdd = sc.parallelize(labels)
+
+// two-class learning problem
+val numClasses = 2
+// all binary features
+val catFeatures = Map(Vector.tabulate(features.head.length) { j => (j, 
2) } : _*)
+
+// Chi-squared split quality with a p-value threshold of 0.01 should 
allow
+// only the first feature to be used since the others are uncorrelated 
noise
+val train: DataFrame = TreeTests.setMetadata(rdd, catFeatures, 
numClasses)
+val dt = new DecisionTreeClassifier()
+  .setImpurity("chisquared")
+  .setMaxDepth(5)
+  .setMinInfoGain(0.01)
+val treeModel = dt.fit(train)
+
+// The tree should use exactly one of the 3 features: featue(0)
--- End diff --

nit: feature


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104812803
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
--- End diff --

you cannot guarantee compatibility with existing code here, since you would 
break the bytecode either way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104812596
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
+throw new UnsupportedOperationException("Impurity.calculate")
+
+  /**
+   * :: DeveloperApi ::
+   * Determine if this impurity measure is a test-statistic measure
+   * @return True if this is a split quality measure based on a test 
statistic (i.e. returns a
+   * p-value) or false otherwise.
+   * @note Unless overridden this method returns false by default, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean = false
+}
+
+/**
+ * :: DeveloperApi ::
+ * Utility functions for Impurity measures
+ */
+@Since("2.0.0")
+@DeveloperApi
--- End diff --

there is no need for this object to be publicly exposed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104812468
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
--- End diff --

scala: do not add a default implementation, it causes issues with java 
compatibility


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104812484
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
+throw new UnsupportedOperationException("Impurity.calculate")
+
+  /**
+   * :: DeveloperApi ::
+   * Determine if this impurity measure is a test-statistic measure
+   * @return True if this is a split quality measure based on a test 
statistic (i.e. returns a
+   * p-value) or false otherwise.
+   * @note Unless overridden this method returns false by default, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean = false
--- End diff --

scala: do not add a default implementation, it causes issues with java 
compatibility


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-02-28 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17108#discussion_r103596760
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala 
---
@@ -0,0 +1,25 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+/**
+ *
--- End diff --

oops, sorry, removing this file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...

2017-02-28 Thread thunterdb
GitHub user thunterdb opened a pull request:

https://github.com/apache/spark/pull/17108

[SPARK-19636][ML] Feature parity for correlation statistics in MLlib

## What changes were proposed in this pull request?

This patch adds the Dataframes-based support for the correlation statistics 
found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following 
the design doc discussed in the JIRA ticket.

The current implementation is a simple wrapper around the `spark.mllib` 
implementation. Future optimizations can be implemented at a later stage.

## How was this patch tested?

```
build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite"
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thunterdb/spark 19636

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17108.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17108


commit 42c26bdd5a3ac24bbe3a4101ee252c072373efe6
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-02-23T23:44:55Z

changes

commit d9f6a6c6c2457920fbf1002da5da62ff7d29b46c
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-01T00:31:19Z

commit

commit 7d4ccfef4e6d3a7b65c3cca149afb250414aea4c
Author: Timothy Hunter <timhun...@databricks.com>
Date:   2017-03-01T00:31:34Z

Cleanup




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-27 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/15770
  
Note that any of these formats would cause trouble for a graph with high 
centrality (lady gaga in the twitter graph). That being said, I do not have a 
strong opinion as to which option we pick, in order to move things along.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-02-22 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/16971
  
@zhengruifeng thanks for looking into this issue. I have one comment above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102592174
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala 
---
@@ -78,7 +80,12 @@ object StatFunctions extends Logging {
 def apply(summaries: Array[QuantileSummaries], row: Row): 
Array[QuantileSummaries] = {
   var i = 0
   while (i < summaries.length) {
-summaries(i) = summaries(i).insert(row.getDouble(i))
+if (!row.isNullAt(i)) {
--- End diff --

Thank you for fixing this issue, it was an oversight in my original 
implementation. 

The current exception being thrown depends on an implementation detail 
(calling `sampled.head`). Can you modify the function `def query` below to 
explicitly throw an exception if `sampled` is empty, and document this behavior 
in that function? This way, we will not forget it if decide to change the 
semantics of that class.

As @MLnick was mentioning above, it would be preferrable to either return 
an Option, null or NaN eventually, but this can wait for more consensus.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...

2017-02-22 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16971#discussion_r102589719
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -89,18 +89,17 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities of each 
column
*
-   * @note Rows containing any null or NaN values will be removed before 
calculation. If
-   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
+   *   the dataframe is empty, or all rows in some column contain null or 
NaN, null is returned.
*
* @since 2.2.0
*/
   def approxQuantile(
   cols: Array[String],
   probabilities: Array[Double],
   relativeError: Double): Array[Array[Double]] = {
-// TODO: Update NaN/null handling to keep consistent with the 
single-column version
 try {
-  StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): 
_*).na.drop(), cols,
+  StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*), 
cols,
 probabilities, relativeError).map(_.toArray).toArray
 } catch {
   case e: NoSuchElementException => null
--- End diff --

+1. I tend to think that the result should be `NaN` (following the IEEE 
convention) or `null` (following scala Option convention). But pending a 
resolution, I am fine with throwing an exception because it is the most 
conservative behavior (stopping computations).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-21 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/15770
  
@wangmiao1981 yes I had seen the discussions there. I believe that 
eventually PIC should be moved into graphframes, but we can have a simple API 
in `spark.ml` for the time being.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-21 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/15770
  
You are right, I had forgotten that for this algorithm, the input is the 
edges, and the output is the label for each of the vertices.

This is a tricky algorithm to put as a transformer, since it does not 
follow the usual convention that data should only be appended to the dataframe. 
I suggest we follow the same example as ALS the mllib implementation of PIC:
 - let's make it an estimator that returns a model: the model contains the 
labels for each of the points in a dataframe (the current output of transform)
 - the model's transform method now takes points with an id, and joins it 
with the models to append a column of labels. This is the same as ALS.

If we do not follow this pattern, then the model selection algorithms are 
not going to work. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14299: Ensure broadcasted variables are destroyed even in case ...

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/14299
  
@AnthonyTruchet thank you for the PR. This is definitely worth fixing for 
large deployments. Now, as you noticed, this portion of code does not quite 
abide by the best engineering practices... Instead of adding an extra layer of 
nesting, would you mind make the following changes?

```scala
  def fit[S <: Iterable[String]](dataset: RDD[S]): Word2VecModel = {
...
val expTable = sc.broadcast(createExpTable())
val bcVocab = sc.broadcast(vocab)
val bcVocabHash = sc.broadcast(vocabHash)
try { fit0(expTable, ...) } finally {
  ...
}
   }
   private final def fit0(...) { 
 // Put all the content here.
 // Note that the inner code also includes some broadcasts, you may 
want to fix these as well if you can
   }
```

I personally agree about resource management and scala-arm. We try to keep 
scala dependencies to a minimum, unfortunately, because they can be very 
tedious to move from one scala version to another.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/16774
  
Thanks for working on this task, this is a much requested feature. While it 
will work for simple cases in the current shape, it is going to cause some 
issues for any complex deployments (Apache Toree, Livy, Databricks, etc.) 
because the threadpool that controls the computations is not managed. The 
default assumption with `.par` is a lot of quick tasks. With the current 
implementation, because the same thread pool is going to be shared across all 
the parallel collections, users are going to encounter some mysterious freezes 
in other places, while the ML models are finishing to train (I am talking from 
experience here).

While the situation with `.par` has notably improved with scala 2.10, it is 
better to:
 - create a dedicated thread pool for each `.fit`, that users can replace.
 - use futures, of which the execution context is tied to the thread pool 
above.
 - not use semaphores, but instead rely on the thread limit at the thread 
pool level to cap the number of concurrent execution.

If you do not do that, users in a shared environment like any of the above 
will experience some mysterious freezes depending on what other users are 
doing. Ideally the default resources should be tied to `SparkSession`, but we 
can start with a default static pool marked as experimental API.

More concretely, the API should look like this, I believe:
```
  def setNumParallelEval(num: Int) // Creates an execution context with the 
given max number of threads
  def setExecutionExecutorService(exec: ExecutorService) // Will use the 
given executor service instead of an executor service shared by all the ML 
calculations
```

See the doc in:
https://twitter.github.io/scala_school/concurrency.html#executor

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html

More about parallel collections:

http://stackoverflow.com/questions/5424496/scala-parallel-collections-degree-of-parallelism

http://stackoverflow.com/questions/14214757/what-is-the-benefit-of-using-futures-over-parallel-collections-in-scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...

2017-02-17 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16774#discussion_r101834675
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -121,6 +121,33 @@ class CrossValidatorSuite
 }
   }
 
+  test("cross validation with parallel evaluation") {
+val lr = new LogisticRegression
+val lrParamMaps = new ParamGridBuilder()
+  .addGrid(lr.regParam, Array(0.001, 1000.0))
+  .addGrid(lr.maxIter, Array(0, 3))
+  .build()
+val eval = new BinaryClassificationEvaluator
+val cv = new CrossValidator()
+  .setEstimator(lr)
+  .setEstimatorParamMaps(lrParamMaps)
+  .setEvaluator(eval)
+  .setNumFolds(2)
+  .setNumParallelEval(1)
+val cvSerialModel = cv.fit(dataset)
+cv.setNumParallelEval(2)
--- End diff --

this test is not deterministic now. @MLnick , do you know if we have some 
utilities to retry tests a much of times?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16973: [SPARKR][EXAMPLES] update examples to stop spark session

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/16973
  
These changes look good to me, but my knowledge of R is very limited. 
@mengxr should confirm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16557: [SPARK-18693][ML][MLLIB] ML Evaluators should use weight...

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/16557
  
I agree, let's break this PR. It will go faster, and some changes may 
require longer discussions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16776: [SPARK-19436][SQL] Add missing tests for approxQuantile

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/16776
  
Sorry I missed the conversation here. LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-17 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/15770
  
@wangmiao1981 thanks a lot! I would be very happy to see that PR in Spark 
2.2 and I will gladly help you for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101666251
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala
 ---
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import scala.collection.mutable
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
+
+class PowerIterationClusteringSuite extends SparkFunSuite
+  with MLlibTestSparkContext with DefaultReadWriteTest {
+
+  @transient var data: Dataset[_] = _
+  final val r1 = 1.0
+  final val n1 = 10
+  final val r2 = 4.0
+  final val n2 = 40
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+data = PowerIterationClusteringSuite.generatePICData(spark, r1, r2, 
n1, n2)
+  }
+
+  test("default parameters") {
+val pic = new PowerIterationClustering()
+
+assert(pic.getK === 2)
+assert(pic.getMaxIter === 20)
+assert(pic.getInitMode === "random")
+assert(pic.getFeaturesCol === "features")
+assert(pic.getPredictionCol === "prediction")
+assert(pic.getIdCol === "id")
+  }
+
+  test("set parameters") {
+val pic = new PowerIterationClustering()
+  .setK(9)
+  .setMaxIter(33)
+  .setInitMode("degree")
+  .setFeaturesCol("test_feature")
+  .setPredictionCol("test_prediction")
+  .setIdCol("test_id")
+
+assert(pic.getK === 9)
+assert(pic.getMaxIter === 33)
+assert(pic.getInitMode === "degree")
+assert(pic.getFeaturesCol === "test_feature")
+assert(pic.getPredictionCol === "test_prediction")
+assert(pic.getIdCol === "test_id")
+  }
+
+  test("parameters validation") {
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setK(1)
+}
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setInitMode("no_such_a_mode")
+}
+  }
+
+  test("power iteration clustering") {
--- End diff --

can you also add a test with a dataframe that has some extra data in it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101665899
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the 
abstract: PIC finds a very
+ * low-dimensional embedding of a dataset using truncated power iteration 
on a normalized pair-wise
+ * similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is an
+ * expensive operation, because it uses PIC algorithm to cluster the whole 
input dataset.
+ *
+ * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral 
clustering (Wikipedia)]]
+ */
+@Since("2.2.0")
+@Experimental
+class PowerIterationClustering private[clustering] (
+@Since(&

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101664268
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
--- End diff --

Instead of making an 'id' column, which does not convey much information, 
we should follow the example of `K-Means` and call it `prediction`. You already 
include the trait for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101663790
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
--- End diff --

Instead of just validating the schema, we should validate and transform. 
You can follow the example in 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L92


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662332
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the 
abstract: PIC finds a very
+ * low-dimensional embedding of a dataset using truncated power iteration 
on a normalized pair-wise
+ * similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is an
+ * expensive operation, because it uses PIC algorithm to cluster the whole 
input dataset.
+ *
+ * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral 
clustering (Wikipedia)]]
+ */
+@Since("2.2.0")
+@Experimental
+class PowerIterationClustering private[clustering] (
+@Sinc

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662298
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
--- End diff --

No need with comment above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662273
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
--- End diff --

You do not need to use write a function as you do below after that, it will 
allow more user-friendly error messages in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662038
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
--- End diff --

same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101662018
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
--- End diff --

no need for brackets


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup ...

2016-11-11 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/15826
  
@yanboliang that looks great, thank you. LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87267275
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
   val initialCoefWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
--- End diff --

can you document on line 465 the layout of the data, to help future 
developers (all the coefficients, in column major order, maybe followed by a 
last column that contain the intercept)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to wrong re...

2016-11-09 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/15683
  
+1 for trying to get it into 2.1 (modulo tests)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15809: [SPARK-18268][ML][MLLib] ALS fail with better message if...

2016-11-09 Thread thunterdb
Github user thunterdb commented on the issue:

https://github.com/apache/spark/pull/15809
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and c...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15826#discussion_r87256910
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
@@ -110,21 +110,20 @@ class NaiveBayes @Since("1.5.0") (
   @Since("2.1.0")
   def setWeightCol(value: String): this.type = set(weightCol, value)
 
+  override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
+trainWithLabelCheck(dataset)
+  }
+
   /**
* ml assumes input labels in range [0, numClasses). But this 
implementation
* is also called by mllib NaiveBayes which allows other kinds of input 
labels
-   * such as {-1, +1}. Here we use this parameter to switch between 
different processing logic.
-   * It should be removed when we remove mllib NaiveBayes.
+   * such as {-1, +1}. `positiveLabel` is used to determine whether the 
label
+   * should be checked and it should be removed when we remove mllib 
NaiveBayes.
*/
-  private[spark] var isML: Boolean = true
-
-  private[spark] def setIsML(isML: Boolean): this.type = {
-this.isML = isML
-this
-  }
-
-  override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
-if (isML) {
+  private[spark] def trainWithLabelCheck(
+  dataset: Dataset[_],
+  positiveLabel: Boolean = true): NaiveBayesModel = {
--- End diff --

Since it is an internal method, I suggest to not use a default argument and 
specify it explicitly above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and c...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15826#discussion_r87256702
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
@@ -226,13 +206,33 @@ class NaiveBayes @Since("1.5.0") (
 @Since("1.6.0")
 object NaiveBayes extends DefaultParamsReadable[NaiveBayes] {
   /** String name for multinomial model type. */
-  private[spark] val Multinomial: String = "multinomial"
+  private[classification] val Multinomial: String = "multinomial"
 
   /** String name for Bernoulli model type. */
-  private[spark] val Bernoulli: String = "bernoulli"
+  private[classification] val Bernoulli: String = "bernoulli"
 
   /* Set of modelTypes that NaiveBayes supports */
-  private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli)
+  private[classification] val supportedModelTypes = Set(Multinomial, 
Bernoulli)
+
+  private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: 
Vector) => {
+val values = v match {
+  case sv: SparseVector => sv.values
+  case dv: DenseVector => dv.values
+}
+
+require(values.forall(_ >= 0.0),
+  s"Naive Bayes requires nonnegative feature values but found $v.")
+  }
+
+  private[NaiveBayes] val requireZeroOneBernoulliValues: Vector => Unit = 
(v: Vector) => {
--- End diff --

same thing here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15826: [SPARK-14077][ML][FOLLOW-UP] Minor refactor and c...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15826#discussion_r87256589
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
@@ -226,13 +206,33 @@ class NaiveBayes @Since("1.5.0") (
 @Since("1.6.0")
 object NaiveBayes extends DefaultParamsReadable[NaiveBayes] {
   /** String name for multinomial model type. */
-  private[spark] val Multinomial: String = "multinomial"
+  private[classification] val Multinomial: String = "multinomial"
 
   /** String name for Bernoulli model type. */
-  private[spark] val Bernoulli: String = "bernoulli"
+  private[classification] val Bernoulli: String = "bernoulli"
 
   /* Set of modelTypes that NaiveBayes supports */
-  private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli)
+  private[classification] val supportedModelTypes = Set(Multinomial, 
Bernoulli)
+
+  private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: 
Vector) => {
--- End diff --

@yanboliang indeed it makes a difference when defining methods inside 
another method. In the current case, you brought the function as a value of the 
companion object. You can safely turn it into a method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87116504
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExampl

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87113033
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
--- End diff --

p2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87113839
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
--- End diff --

Actually, I would at least mention that `d` is a distance function. It is 
important in the context of LSH.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87113728
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
--- End diff --

You should either link to the definition of a metric space, or explain what 
M and d are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >