[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-19 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145845879
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+# DataFrame with a single column of images named "image" (nullable)
+imageSchema = StructType(StructField("image", StructType([
+StructField(imageFields[0], StringType(),  True),
+StructField(imageFields[1], IntegerType(), False),
+StructField(imageFields[2], IntegerType(), False),
+StructField(imageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields[4], StringType(), False),
--- End diff --

I believe this was changed to IntegerType in scala. Is it possible to 
import this from scala so we don't need to define it in two places?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-18 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145531800
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'skip' (filter out rows with invalid data) or 'error' 
(throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'skip' (filter out rows with invalid data) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
+ *
+ * @note This is different from scikit-learn's OneHotEncoder, which keeps 
all categories.
+ * The output vectors are sparse.
+ *
+ * @see `StringIndexer` for converting categorical values into category 
indices
+ */
+@Since("2.3.0")
+class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("oneHotEncoder"))
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setInputCols(values: Array[String]): this.type = set(inputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setOutputCols(values: Array[String]): this.type = set(outputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setDropLast(value: Boolean): this.type = set(dropLast, value)
+
+  /** @group setParam

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-18 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145501280
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields = Array("origin", "height", "width", "nChannels", 
"mode", "data")
+
+  val ocvTypes = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, 
"CV_8UC4" -> 24,
+"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, 
"CV_8SC4" -> 25,
+"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, 
"CV_16UC4" -> 26,
+"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, 
"CV_16SC4" -> 27,
+"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, 
"CV_32SC4" -> 28,
+"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, 
"CV_32FC4" -> 29,
+"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, 
"CV_64FC4" -> 30
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Array[Byte])
+   */
+  private val columnSchema = StructType(
--- End diff --

I think we should make this public, I've been working on some code that 
creates dataframes of images and it's really useful to have direct access to 
this schema. Also the intention here is to encourage the community to 
standardize on this representation so I think we should explicitly make it use 
to use this schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-18 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145499084
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields = Array("origin", "height", "width", "nChannels", 
"mode", "data")
+
+  val ocvTypes = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, 
"CV_8UC4" -> 24,
+"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, 
"CV_8SC4" -> 25,
+"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, 
"CV_16UC4" -> 26,
+"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, 
"CV_16SC4" -> 27,
+"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, 
"CV_32SC4" -> 28,
+"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, 
"CV_32FC4" -> 29,
+"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, 
"CV_64FC4" -> 30
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Array[Byte])
--- End diff --

I believe you're missing one `Int` above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-12 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r144365856
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,256 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val ocvTypes = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, 
"CV_8UC4" -> 24,
+"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, 
"CV_8SC4" -> 25,
+"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, 
"CV_16UC4" -> 26,
+"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, 
"CV_16SC4" -> 27,
+"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, 
"CV_32SC4" -> 28,
+"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, 
"CV_32FC4" -> 29,
+"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, 
"CV_64FC4" -> 30
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField("origin", StringType, true) ::
+StructField("height", IntegerType, false) ::
+StructField("width", IntegerType, false) ::
+StructField("nChannels", IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField("mode", StringType, false) ::
--- End diff --

After some more thought and conversation I actually think we should use an 
IntegerType here. There is one issue I had not noticed before that I think 
could be an issue down the road. The openCV string representation for some 
types is not unique, eg "CV_16U" and "CV_16SC1" map to type 2 (1 channel, 16 
bit, unsigned). Having more than one identifier for each type is a potential 
minefield I think we should avoid.

Alternatively I think we could stick to using strings if we restrict the 
supported types and pick only one representation to be valid when there are 
duplicates.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-12 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r144363062
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,133 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+ImageSchema = StructType([
+StructField(ImageFields[0], StringType(),  True),
+StructField(ImageFields[1], IntegerType(), False),
+StructField(ImageFields[2], IntegerType(), False),
+StructField(ImageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(ImageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(ImageFields[5], BinaryType(), False)])
+
+
+# TODO: generalize to other datatypes and number of channels
+def toNDArray(image):
+"""
+Converts an image to a 1-dimensional array
+
+Args:
+image (object): The image to be converted
+
+Returns:
+array: The image as a 1-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+return np.asarray(image.data, dtype=np.uint8) \
+ .reshape((height, width, 3))[:, :, (2, 1, 0)]
+
+
+# TODO: generalize to other datatypes and number of channels
+def toImage(array, origin="", mode="CV_8UC3"):
+"""
+
+Converts a one-dimensional array to a 2 dimensional image
+
+Args:
+array (array):
+origin (str):
+mode (int):
+
+Returns:
+object: 2 dimensional image
+
+.. versionadded:: 2.3.0
+"""
+length = np.prod(array.shape)
+
+data = bytearray(array.astype(dtype=np.int8)[:, :, (2, 1, 0)]
+  .reshape(length))
+height = array.shape[0]
+width = array.shape[1]
+nChannels = array.shape[2]
+# Creating new Row with _create_row(), because Row(name = value, ... )
--- End diff --

@imatiach-msft If the bug I've filed is the only issue, I'd prefer to use 
the public API here and make sure that we fix the bug for 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-11 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r144128955
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val ocvTypes = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, 
"CV_8UC4" -> 24,
+"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, 
"CV_8SC4" -> 25,
+"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, 
"CV_16UC4" -> 26,
+"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, 
"CV_16SC4" -> 27,
+"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, 
"CV_32SC4" -> 28,
+"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, 
"CV_32FC4" -> 29,
+"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, 
"CV_64FC4" -> 30
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField("origin", StringType, true) ::
+  StructField("height", IntegerType, false) ::
+  StructField("width", IntegerType, false) ::
+  StructField("nChannels", IntegerType, false) ::
+  // OpenCV-compatible type: CV_8UC3 in most cases
+  StructField("mode", StringType, false) ::
--- End diff --

String is great, sorry for the noise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-11 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r144127767
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,133 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+ImageSchema = StructType([
+StructField(ImageFields[0], StringType(),  True),
+StructField(ImageFields[1], IntegerType(), False),
+StructField(ImageFields[2], IntegerType(), False),
+StructField(ImageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(ImageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(ImageFields[5], BinaryType(), False)])
+
+
+# TODO: generalize to other datatypes and number of channels
+def toNDArray(image):
+"""
+Converts an image to a 1-dimensional array
+
+Args:
+image (object): The image to be converted
+
+Returns:
+array: The image as a 1-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+return np.asarray(image.data, dtype=np.uint8) \
+ .reshape((height, width, 3))[:, :, (2, 1, 0)]
--- End diff --

I believe that OpenCV in python reads images using a BGR channel order by 
default. This conflicts with the RGB default used by some other python 
libraries (eg matplotlib) but since there doesn't seem to be a clear standard, 
I suggest we don't re-order anything by default. That way spark will at least 
maintain consistent behaviour between scala and python.


http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_core/py_basic_ops/py_basic_ops.html


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r143853274
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,133 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+ImageSchema = StructType([
+StructField(ImageFields[0], StringType(),  True),
+StructField(ImageFields[1], IntegerType(), False),
+StructField(ImageFields[2], IntegerType(), False),
+StructField(ImageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(ImageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(ImageFields[5], BinaryType(), False)])
+
+
+# TODO: generalize to other datatypes and number of channels
+def toNDArray(image):
+"""
+Converts an image to a 1-dimensional array
+
+Args:
+image (object): The image to be converted
+
+Returns:
+array: The image as a 1-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+return np.asarray(image.data, dtype=np.uint8) \
+ .reshape((height, width, 3))[:, :, (2, 1, 0)]
+
+
+# TODO: generalize to other datatypes and number of channels
+def toImage(array, origin="", mode="CV_8UC3"):
+"""
+
+Converts a one-dimensional array to a 2 dimensional image
+
+Args:
+array (array):
+origin (str):
+mode (int):
+
+Returns:
+object: 2 dimensional image
+
+.. versionadded:: 2.3.0
+"""
+length = np.prod(array.shape)
+
+data = bytearray(array.astype(dtype=np.int8)[:, :, (2, 1, 0)]
+  .reshape(length))
+height = array.shape[0]
+width = array.shape[1]
+nChannels = array.shape[2]
+# Creating new Row with _create_row(), because Row(name = value, ... )
--- End diff --

@holdenk I believe the ordered by name schema works in general, there is a 
serialization bug that I'm aware of, I filed it here, 
https://issues.apache.org/jira/browse/SPARK-22232 but I hope we can fix that 
for 2.3 (I can help with that).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r143799505
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val ocvTypes = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, 
"CV_8UC4" -> 24,
+"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, 
"CV_8SC4" -> 25,
+"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, 
"CV_16UC4" -> 26,
+"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, 
"CV_16SC4" -> 27,
+"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, 
"CV_32SC4" -> 28,
+"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, 
"CV_32FC4" -> 29,
+"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, 
"CV_64FC4" -> 30
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField("origin", StringType, true) ::
+  StructField("height", IntegerType, false) ::
--- End diff --

Why is the first StructField less indented than the rest? Is this a 
convention for `::`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r143794449
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val ocvTypes = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, 
"CV_8UC4" -> 24,
+"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, 
"CV_8SC4" -> 25,
+"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, 
"CV_16UC4" -> 26,
+"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, 
"CV_16SC4" -> 27,
+"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, 
"CV_32SC4" -> 28,
+"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, 
"CV_32FC4" -> 29,
+"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, 
"CV_64FC4" -> 30
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField("origin", StringType, true) ::
+  StructField("height", IntegerType, false) ::
+  StructField("width", IntegerType, false) ::
+  StructField("nChannels", IntegerType, false) ::
+  // OpenCV-compatible type: CV_8UC3 in most cases
+  StructField("mode", StringType, false) ::
--- End diff --

I thought the intention was to use the int values from ocvType Map here. If 
we're using the string names directly in the schema, do we need to map for 
anything?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r143799083
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,133 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+ImageSchema = StructType([
+StructField(ImageFields[0], StringType(),  True),
+StructField(ImageFields[1], IntegerType(), False),
+StructField(ImageFields[2], IntegerType(), False),
+StructField(ImageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(ImageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(ImageFields[5], BinaryType(), False)])
+
+
+# TODO: generalize to other datatypes and number of channels
+def toNDArray(image):
+"""
+Converts an image to a 1-dimensional array
+
+Args:
+image (object): The image to be converted
+
+Returns:
+array: The image as a 1-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+return np.asarray(image.data, dtype=np.uint8) \
+ .reshape((height, width, 3))[:, :, (2, 1, 0)]
+
+
+# TODO: generalize to other datatypes and number of channels
+def toImage(array, origin="", mode="CV_8UC3"):
+"""
+
+Converts a one-dimensional array to a 2 dimensional image
+
+Args:
+array (array):
+origin (str):
+mode (int):
+
+Returns:
+object: 2 dimensional image
+
+.. versionadded:: 2.3.0
+"""
+length = np.prod(array.shape)
+
+data = bytearray(array.astype(dtype=np.int8)[:, :, (2, 1, 0)]
+  .reshape(length))
--- End diff --

nit: do you mind using `dtype=np.uint8`? I know it's technically the same 
in this case but it's more consistent with how python thinks of bytes.

Also `ndarray.ravel()` is shorthand for `ndarray.reshape(lenght)` :).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r143798662
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,133 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+ImageSchema = StructType([
+StructField(ImageFields[0], StringType(),  True),
+StructField(ImageFields[1], IntegerType(), False),
+StructField(ImageFields[2], IntegerType(), False),
+StructField(ImageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(ImageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(ImageFields[5], BinaryType(), False)])
+
+
+# TODO: generalize to other datatypes and number of channels
+def toNDArray(image):
+"""
+Converts an image to a 1-dimensional array
+
+Args:
+image (object): The image to be converted
+
+Returns:
+array: The image as a 1-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+return np.asarray(image.data, dtype=np.uint8) \
+ .reshape((height, width, 3))[:, :, (2, 1, 0)]
--- End diff --

This code assumes `image` is 3-channel GBR image and the user wants an RGB 
image. I think we should try and support at least the 3 image types that 
`readImages` (CV_8UC1, CV_8UC3, and CV_8UC4) but it would be nice to also 
support 1, 3, and 4 channel float images.

The ndarray constructor is quite flexible and might be easier to work with 
than calling `asarray` in this case because you want to treat the bytearray as 
a buffer not as a sequence of ints.

```
np.ndarray(
  shape=(height, width, nChannels),
  dtype=np.uint8,
  buffer=image.data,
  strides=(width * nChannels, nChannels, 1))
```

Also I have mixed feelings about re-ordering the channels. I think it's 
probably useful in the most common use-case, but (if my understanding is 
correct) the open cv types don't require a specific channel order so we can't 
just assume the input is RGB or RGBA. Maybe we should avoid re-ordering, 
document the ordering we use whenever appropriate, and then leave it up to the 
user to do any necessary re-order for themselves.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18896: [SPARK-21681][ML] fix bug of MLOR do not work correctly ...

2017-08-15 Thread MrBago
Github user MrBago commented on the issue:

https://github.com/apache/spark/pull/18896
  
@jkbradley please take a look when you get a chance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18896: [SPARK-21681][ML] fix bug of MLOR do not work cor...

2017-08-15 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/18896#discussion_r133301688
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -1392,6 +1415,61 @@ class LogisticRegressionSuite
 assert(model2.interceptVector.toArray.sum ~== 0.0 absTol eps)
   }
 
+  test("test SPARK-21681") {
--- End diff --

I would include a description of the test in addition to the ticket #.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18888: [Spark-17025][ML][Python] Persistence for Pipelin...

2017-08-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r132575143
  
--- Diff: python/pyspark/ml/pipeline.py ---
@@ -242,3 +327,65 @@ def _to_java(self):
 JavaParams._new_java_obj("org.apache.spark.ml.PipelineModel", 
self.uid, java_stages)
 
 return _java_obj
+
+
+@inherit_doc
+class SharedReadWrite():
+"""
+Functions for :py:class:`MLReader` and :py:class:`MLWriter` shared 
between
+:py:class:'Pipeline' and :py:class'PipelineModel'
+
+.. versionadded:: 2.3.0
+"""
+
+@staticmethod
+def validateStages(stages):
+"""
+Check that all stages are Writable
+"""
+for stage in stages:
+if not isinstance(stage, MLWritable):
+raise ValueError("Pipeline write will fail on this pipline 
" +
+ "because stage %s of type %s is not 
MLWritable",
+ stage.uid, type(stage))
+
+@staticmethod
+def saveImpl(instance, stages, sc, path):
+"""
+Save metadata and stages for a :py:class:`Pipeline` or 
:py:class:`PipelineModel`
+- save metadata to path/metadata
+- save stages to stages/IDX_UID
+"""
+stageUids = [stage.uid for stage in stages]
+jsonParams = {'stageUids': stageUids, 'savedAsPython': True}
+DefaultParamsWriter.saveMetadata(instance, path, sc, 
paramMap=jsonParams)
+stagesDir = os.path.join(path, "stages")
--- End diff --

@jkbradley, what's the right way to handle Paths in pyspark? Scala has 
`org.apache.hadoop.fs.Path`, is there something similar in pyspark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18888: [Spark-17025][ML][Python] Persistence for Pipelin...

2017-08-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r132561786
  
--- Diff: python/pyspark/ml/pipeline.py ---
@@ -242,3 +327,65 @@ def _to_java(self):
 JavaParams._new_java_obj("org.apache.spark.ml.PipelineModel", 
self.uid, java_stages)
 
 return _java_obj
+
+
+@inherit_doc
+class SharedReadWrite():
+"""
+Functions for :py:class:`MLReader` and :py:class:`MLWriter` shared 
between
+:py:class:'Pipeline' and :py:class'PipelineModel'
+
+.. versionadded:: 2.3.0
+"""
+
+@staticmethod
+def validateStages(stages):
+"""
+Check that all stages are Writable
+"""
+for stage in stages:
+if not isinstance(stage, MLWritable):
+raise ValueError("Pipeline write will fail on this pipline 
" +
+ "because stage %s of type %s is not 
MLWritable",
+ stage.uid, type(stage))
+
+@staticmethod
+def saveImpl(instance, stages, sc, path):
+"""
+Save metadata and stages for a :py:class:`Pipeline` or 
:py:class:`PipelineModel`
+- save metadata to path/metadata
+- save stages to stages/IDX_UID
+"""
+stageUids = [stage.uid for stage in stages]
+jsonParams = {'stageUids': stageUids, 'savedAsPython': True}
+DefaultParamsWriter.saveMetadata(instance, path, sc, 
paramMap=jsonParams)
+stagesDir = os.path.join(path, "stages")
+for index, stage in enumerate(stages):
+stage.write().save(SharedReadWrite
+   .getStagePath(stage.uid, index, 
len(stages), stagesDir))
+
+@staticmethod
+def load(metadata, sc, path):
+"""
+Load metadata and stages for a :py:class:`Pipeline` or 
:py:class:`PipelineModel`
+
+:return:  (UID, list of stages)
+"""
+stagesDir = os.path.join(path, "stages")
+stageUids = metadata['paramMap']['stageUids']
+stages = []
+for index, stageUid in enumerate(stageUids):
+stagePath = SharedReadWrite.getStagePath(stageUid, index, 
len(stageUids), stagesDir)
+stage = DefaultParamsReader.loadParamsInstance(stagePath, sc)
+stages.append(stage)
+return (metadata['uid'], stages)
+
+@staticmethod
+def getStagePath(stageUid, stageIdx, numStages, stagesDir):
--- End diff --

`stageIdx` isn't used by this method, is that intentional?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18888: [Spark-17025][ML][Python] Persistence for Pipelin...

2017-08-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r132531424
  
--- Diff: python/pyspark/ml/pipeline.py ---
@@ -16,14 +16,15 @@
 #
 
 import sys
+import os
 
 if sys.version > '3':
 basestring = str
 
 from pyspark import since, keyword_only, SparkContext
 from pyspark.ml.base import Estimator, Model, Transformer
 from pyspark.ml.param import Param, Params
-from pyspark.ml.util import JavaMLWriter, JavaMLReader, MLReadable, 
MLWritable
+from pyspark.ml.util import *
--- End diff --

can we do `import pyspark.ml.util as mlutil`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18888: [Spark-17025][ML][Python] Persistence for Pipelin...

2017-08-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r132560721
  
--- Diff: python/pyspark/ml/pipeline.py ---
@@ -130,13 +131,20 @@ def copy(self, extra=None):
 @since("2.0.0")
 def write(self):
 """Returns an MLWriter instance for this ML instance."""
-return JavaMLWriter(self)
+allStagesAreJava = True
+stages = self.getStages()
+for stage in stages:
+if not isinstance(stage, JavaMLWritable):
+allStagesAreJava = False
--- End diff --

How about `allStagesAreJava = all(isinstance(stage, JavaMLWritable) for 
stage in self.getStages())`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18888: [Spark-17025][ML][Python] Persistence for Pipelin...

2017-08-10 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r132562392
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1142,6 +1142,35 @@ def test_nested_pipeline_persistence(self):
 except OSError:
 pass
 
+def test_python_transformer_pipeline_persistence(self):
+"""
+Pipeline[MockUnaryTransformer, Binarizer]
+"""
+temp_path = tempfile.mkdtemp()
+
+try:
+df = self.spark.range(0, 10).toDF('input')
+tf = MockUnaryTransformer(shiftVal=2)\
+.setInputCol("input").setOutputCol("shiftedInput")
+tf2 = Binarizer(threshold=6, inputCol="shiftedInput", 
outputCol="binarized")
+pl = Pipeline(stages=[tf, tf2])
+model = pl.fit(df)
+
+pipeline_path = temp_path + "/pipeline"
+pl.save(pipeline_path)
+loaded_pipeline = Pipeline.load(pipeline_path)
+self._compare_pipelines(pl, loaded_pipeline)
+
+model_path = temp_path + "/pipeline-model"
+model.save(model_path)
+loaded_model = PipelineModel.load(model_path)
+self._compare_pipelines(model, loaded_model)
+finally:
+try:
+rmtree(temp_path)
--- End diff --

Why do we need this in a `try` block? I worry about silencing errors in 
tests because it's a good way to miss issues. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17373: [SPARK-12664] Expose probability in mlp model

2017-08-08 Thread MrBago
Github user MrBago commented on the issue:

https://github.com/apache/spark/pull/17373
  
Thanks for the changes @WeichenXu123. I don't have any other comments 
@jkbradley do you want to have a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-01 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r130746928
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private(
   private var outputs: Array[BDM[Double]] = null
   private var deltas: Array[BDM[Double]] = null
 
-  override def forward(data: BDM[Double]): Array[BDM[Double]] = {
+  override def forward(data: BDM[Double], containsLastLayer: Boolean): 
Array[BDM[Double]] = {
 // Initialize output arrays for all layers. Special treatment for 
InPlace
--- End diff --

Could you add the above comment in the code, it could be useful for folks 
reading/editing this in the future.

Also it seems like the last layer could also be a 
SigmoidLayerWithSqueredError or a SigmiodFunction do we need to hand those 
cases any differently?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-01 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r130747030
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -363,7 +363,7 @@ private[ann] trait TopologyModel extends Serializable {
* @param data input data
* @return array of outputs for each of the layers
*/
-  def forward(data: BDM[Double]): Array[BDM[Double]]
+  def forward(data: BDM[Double], containsLastLayer: Boolean): 
Array[BDM[Double]]
--- End diff --

Can you update the docstring for this method to add the argument?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-01 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r130746665
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala ---
@@ -463,7 +479,7 @@ private[ml] class FeedForwardModel private(
   private var outputs: Array[BDM[Double]] = null
   private var deltas: Array[BDM[Double]] = null
 
-  override def forward(data: BDM[Double]): Array[BDM[Double]] = {
+  override def forward(data: BDM[Double], containsLastLayer: Boolean): 
Array[BDM[Double]] = {
--- End diff --

Could we use a variable name like`includeLastLayer` here? 
`containsLastLayer` sounds like a property of the model instead of an 
instruction to the method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17373: [SPARK-12664] Expose probability in mlp model

2017-08-01 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17373#discussion_r130747996
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -82,6 +83,23 @@ class MultilayerPerceptronClassifierSuite
 }
   }
 
+  test("test model probability") {
+val layers = Array[Int](2, 5, 2)
+val trainer = new MultilayerPerceptronClassifier()
+  .setLayers(layers)
+  .setBlockSize(1)
+  .setSeed(123L)
+  .setMaxIter(100)
+  .setSolver("l-bfgs")
+val model = trainer.fit(dataset)
+model.setProbabilityCol("probability")
+val result = model.transform(dataset)
+val features2prob = udf { features: Vector => 
model.mlpModel.predict(features) }
+val cmpVec = udf { (v1: Vector, v2: Vector) => v1 ~== v2 relTol 1e-3 }
+assert(result.select(cmpVec(features2prob(col("features")), 
col("probability")))
+  .rdd.map(_.getBoolean(0)).reduce(_ && _))
+  }
+
--- End diff --

I think we should include a stronger test for this. I did a quick search 
and couldn't find a strong test for `mlpModel.predict`, it might be good to add 
one. Also, I believe this xor dataset only produces probability predictions 
~equal to 0 or 1.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18081: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndar...

2017-05-24 Thread MrBago
Github user MrBago commented on the issue:

https://github.com/apache/spark/pull/18081
  
@viirya I was running python tests for pyspark-ml and pyspark-mllib and 
this was the only place where the python3/numpy interaction caused a test 
failure. There might be other places where floor int division might be 
preferable to float division, but if they exist they're not causing any of the 
tests to fail. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18081: [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndar...

2017-05-24 Thread MrBago
Github user MrBago commented on the issue:

https://github.com/apache/spark/pull/18081
  
@srowen floor divide has been in python since 2.2, 
https://www.python.org/download/releases/2.2/.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18081: [SPARK-20862] BugFix - avoid passing float to nda...

2017-05-23 Thread MrBago
GitHub user MrBago opened a pull request:

https://github.com/apache/spark/pull/18081

[SPARK-20862] BugFix - avoid passing float to ndarray.reshape in 
LogisticRegressionModel

## What changes were proposed in this pull request?

Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer 
takes floats as arguments as of 1.12. Also, python3 uses float division for 
`/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set 
to a float.

## How was this patch tested?

Existing tests run using python3 and numpy 1.12.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MrBago/spark BF-py3floatbug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18081.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18081


commit b53616026eac18d4e49bf5a60e077f7226afc951
Author: Bago Amirbekian <b...@databricks.com>
Date:   2017-05-24T01:01:45Z

BugFix - avoid passing float to ndarray.reshape in LogisticRegressionModel.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18077: Delegate looping over paramMaps to estimators

2017-05-23 Thread MrBago
Github user MrBago commented on the issue:

https://github.com/apache/spark/pull/18077
  
@jkbradley can you take a look at this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18077: Delegate looping over paramMaps to estimators

2017-05-23 Thread MrBago
GitHub user MrBago opened a pull request:

https://github.com/apache/spark/pull/18077

Delegate looping over paramMaps to estimators 


Changes:

pyspark.ml Estimators can take either a list of param maps or a dict of 
params. This change allows the CrossValidator and TrainValidationSplit 
Estimators to pass through lists of param maps to the underlying estimators so 
that those estimators can handle parallelization when appropriate (eg 
distributed hyper parameter tuning). 

Testing:

Existing unit tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MrBago/spark delegate_params

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18077.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18077


commit 312030df1c3b4bfa08ee266f3d463e409da0b42f
Author: Bago Amirbekian <b...@databricks.com>
Date:   2017-05-23T23:04:04Z

Delegate looping over paramMaps to estimator inside CrossValidator & 
TrainValidationSplit.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-27 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108299791
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

Thanks @jkbradley, I reverted setup.py.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-27 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/17421#discussion_r108286662
  
--- Diff: dev/sparktestsupport/modules.py ---
@@ -431,6 +431,7 @@ def __hash__(self):
 "pyspark.ml.linalg.__init__",
 "pyspark.ml.recommendation",
 "pyspark.ml.regression",
+"pyspark.ml.stat",
--- End diff --

@holdenk thanks for catching that, should be fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-24 Thread MrBago
GitHub user MrBago opened a pull request:

https://github.com/apache/spark/pull/17421

[SPARK-20040][ML][python] pyspark wrapper for ChiSquareTest

## What changes were proposed in this pull request?

A pyspark wrapper for spark.ml.stat.ChiSquareTest

## How was this patch tested?

unit tests
doctests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MrBago/spark chiSquareTestWrapper

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17421


commit a6bc10c9aa9166e7274d9c9ca3959a15b70e87ec
Author: Bago Amirbekian <b...@databricks.com>
Date:   2017-03-24T23:58:21Z

Added pyspark wrapper for ChiSquareTest and associated tests.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2