[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19077
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83715/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19077
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19077
  
**[Test build #83715 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83715/testReport)**
 for PR 19077 at commit 
[`967c73a`](https://github.com/apache/spark/commit/967c73a2462f0948029a37a84f6912983f03c126).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150376997
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,196 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+An attribute of :class:`_ImageSchema` in this module.
+
+.. autoclass:: _ImageSchema
+   :members:
+"""
+
+import numpy as np
+from pyspark import SparkContext
+from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
+from pyspark.sql import DataFrame, SparkSession
+
+
+class _ImageSchema(object):
+"""
+Internal class for `pyspark.ml.image.ImageSchema` attribute. Meant to 
be private and
+not to be instantized. Use `pyspark.ml.image.ImageSchema` attribute to 
access the
+APIs of this class.
+"""
+
+def __init__(self):
+self._imageSchema = None
+self._ocvTypes = None
+self._imageFields = None
+self._undefinedImageType = None
+
+@property
+def imageSchema(self):
+"""
+Returns the image schema.
+
+:rtype StructType: a DataFrame with a single column of images
+   named "image" (nullable)
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageSchema is None:
+ctx = SparkContext._active_spark_context
+jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
+self._imageSchema = _parse_datatype_json_string(jschema.json())
+return self._imageSchema
+
+@property
+def ocvTypes(self):
+"""
+Returns the OpenCV type mapping supported
+
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+
+if self._ocvTypes is None:
+ctx = SparkContext._active_spark_context
+self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
+return self._ocvTypes
+
+@property
+def imageFields(self):
+"""
+Returns field names of image columns.
+
+:rtype list: a list of field names.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageFields is None:
+ctx = SparkContext._active_spark_context
+self._imageFields = 
list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields())
+return self._imageFields
+
+@property
+def undefinedImageType(self):
+"""
+Returns the name of undefined image type for the invalid image.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._undefinedImageType is None:
+ctx = SparkContext._active_spark_context
+self._undefinedImageType = \
+
ctx._jvm.org.apache.spark.ml.image.ImageSchema.undefinedImageType()
+return self._undefinedImageType
+
+def toNDArray(self, image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image: The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+def toImage(self, array, origin=""):
+"""
+Converts an array with metadata to a two-dimensional image.
+
+:param array array: The array to convert to image
+:param str origin: Path to the image, optional
+:rtype object: Two dimensional image
+
 

[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19543
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19543
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83718/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19543
  
**[Test build #83718 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83718/testReport)**
 for PR 19543 at commit 
[`158b6b9`](https://github.com/apache/spark/commit/158b6b998bd774f52430d42791e6e21a0537a110).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19543
  
**[Test build #83718 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83718/testReport)**
 for PR 19543 at commit 
[`158b6b9`](https://github.com/apache/spark/commit/158b6b998bd774f52430d42791e6e21a0537a110).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-11-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19543
  
add to white list


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-11-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19543
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...

2017-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19370
  
@jsnowacki, I happened to make another JIRA - 
https://issues.apache.org/jira/browse/SPARK-22495 for this PR. Mind linking 
this PR to the JIRA?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19543: [SPARK-19606][MESOS] Support constraints in spark...

2017-11-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19543#discussion_r150376571
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala
 ---
@@ -122,4 +122,11 @@ package object config {
 "Example: key1:val1,key2:val2")
   .stringConf
   .createOptional
+
+  private[spark] val DRIVER_CONSTRAINTS =
+ConfigBuilder("spark.mesos.driver.constraints")
+  .doc("Attribute based constraints on mesos resource offers. Applied 
by the dispatcher " +
+"when launching drivers. Default is to accept all offers with 
sufficient resources.")
+  .stringConf
+  .createWithDefault("")
--- End diff --

should this be default to ""? looks like it might still match something

https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L267


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150376396
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,196 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+An attribute of :class:`_ImageSchema` in this module.
--- End diff --

This actually sounds like a class attribute in `_ImageSchema`. How about 
... something like .. 

```
An attribute of this module that contains the instance of 
:class:`_ImageSchema`
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19713: [SPARK-22488] [SQL] Fix the view resolution issue in the...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19713
  
**[Test build #83717 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83717/testReport)**
 for PR 19713 at commit 
[`c491974`](https://github.com/apache/spark/commit/c49197428168685ec0c3491ad7f16dd10f7e09e1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150375535
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,239 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Defines the image schema and methods to read and manipulate images.
+ */
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
--- End diff --

Let's add `(Java-specific)`, for example, 

```
(Java-specific) OpenCV type mapping supported
``` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150375473
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,196 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+An attribute of :class:`_ImageSchema` in this module.
+
+.. autoclass:: _ImageSchema
+   :members:
+"""
+
+import numpy as np
+from pyspark import SparkContext
+from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
+from pyspark.sql import DataFrame, SparkSession
+
+
+class _ImageSchema(object):
+"""
+Internal class for `pyspark.ml.image.ImageSchema` attribute. Meant to 
be private and
+not to be instantized. Use `pyspark.ml.image.ImageSchema` attribute to 
access the
+APIs of this class.
+"""
+
+def __init__(self):
+self._imageSchema = None
+self._ocvTypes = None
+self._imageFields = None
+self._undefinedImageType = None
+
+@property
+def imageSchema(self):
+"""
+Returns the image schema.
+
+:rtype StructType: a DataFrame with a single column of images
+   named "image" (nullable)
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageSchema is None:
+ctx = SparkContext._active_spark_context
+jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
+self._imageSchema = _parse_datatype_json_string(jschema.json())
+return self._imageSchema
+
+@property
+def ocvTypes(self):
+"""
+Returns the OpenCV type mapping supported
+
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+
+if self._ocvTypes is None:
+ctx = SparkContext._active_spark_context
+self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
+return self._ocvTypes
+
+@property
+def imageFields(self):
+"""
+Returns field names of image columns.
+
+:rtype list: a list of field names.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageFields is None:
+ctx = SparkContext._active_spark_context
+self._imageFields = 
list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields())
+return self._imageFields
+
+@property
+def undefinedImageType(self):
+"""
+Returns the name of undefined image type for the invalid image.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._undefinedImageType is None:
+ctx = SparkContext._active_spark_context
+self._undefinedImageType = \
+
ctx._jvm.org.apache.spark.ml.image.ImageSchema.undefinedImageType()
+return self._undefinedImageType
+
+def toNDArray(self, image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image: The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+def toImage(self, array, origin=""):
+"""
+Converts an array with metadata to a two-dimensional image.
+
+:param array array: The array to convert to image
+:param str origin: Path to the image, optional
+:rtype object: Two dimensional image
+
 

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150375592
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,239 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Defines the image schema and methods to read and manipulate images.
+ */
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val javaOcvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField("origin", StringType, true) ::
+StructField("height", IntegerType, false) ::
+StructField("width", IntegerType, false) ::
+StructField("nChannels", IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField("mode", IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField("data", BinaryType, false) :: Nil)
+
+  val imageFields: Array[String] = columnSchema.fieldNames
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private[spark] def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
+val img = ImageIO.read(new ByteArrayInputStream(bytes))
+
+if (img == null) {
+  None
+} 

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150375543
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,239 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Defines the image schema and methods to read and manipulate images.
+ */
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val ocvTypes: Map[String, Int] = Map(
--- End diff --

Similarly, 

```
(Scala-specific) OpenCV type mapping supported
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19690: [SPARK-22467]Added a switch to support whether `stdout_s...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19690
  
**[Test build #83716 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83716/testReport)**
 for PR 19690 at commit 
[`37fba60`](https://github.com/apache/spark/commit/37fba60570925be29388a828e9f66caf545ec39e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19077
  
**[Test build #83715 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83715/testReport)**
 for PR 19077 at commit 
[`967c73a`](https://github.com/apache/spark/commit/967c73a2462f0948029a37a84f6912983f03c126).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19714
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83714/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19714
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19714
  
**[Test build #83714 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83714/testReport)**
 for PR 19714 at commit 
[`c78951b`](https://github.com/apache/spark/commit/c78951b818128024da351412f8a867a66bc2c441).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...

2017-11-10 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19208#discussion_r150373516
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") 
override val uid: String)
 instr.logParams(numFolds, seed, parallelism)
 logTuningParams(instr)
 
+val collectSubModelsParam = $(collectSubModels)
+
+var subModels: Option[Array[Array[Model[_ = if 
(collectSubModelsParam) {
--- End diff --

Sorry I didn't follow up on this before.  I think that @WeichenXu123 's 
argument is valid, but please say if there are issues I'm missing @holdenk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19689
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83711/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19689
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19689
  
**[Test build #83711 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83711/testReport)**
 for PR 19689 at commit 
[`20dc217`](https://github.com/apache/spark/commit/20dc21741a5219dd19bb44806e640b7ea4eed012).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83710/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #83710 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83710/testReport)**
 for PR 19651 at commit 
[`4b09faa`](https://github.com/apache/spark/commit/4b09faa5538a0949494693d02b8b95065a616a8c).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r150369899
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, 
SearchArgumentFactory}
+import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types._
+
+/**
+ * Utility functions to convert Spark data source filters to ORC filters.
+ */
+private[orc] object OrcFilters {
--- End diff --

Yep. 
[This](https://github.com/apache/spark/pull/18953/files#diff-6cac9bc2656e3782b0312dceb8c55d47)
 will show you that explicitly. Mainly, it uses new APIs.
```
-Some(builder.startAnd().equals(attribute, value).end())
 +val castedValue = castLiteralValue(value, dataTypeMap(attribute))
 +Some(builder.startAnd().equals(attribute, getType(attribute), 
castedValue).end())
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19719
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150368818
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends 
HiveThriftJdbcTest {
 conf += resultSet.getString(1) -> resultSet.getString(2)
   }
 
-  assert(conf.get("spark.sql.hive.metastore.version") === 
Some("1.2.1"))
+  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
+}
+  }
+
+  test("Checks Hive version via SET") {
--- End diff --

i might be a session level conf


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19719
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83713/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19719
  
**[Test build #83713 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83713/testReport)**
 for PR 19719 at commit 
[`42d05a7`](https://github.com/apache/spark/commit/42d05a7f052d9a4f53030b35799c2ac479fb68f0).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150368744
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging {
 .stringConf
 .createWithDefault(builtinHiveVersion)
 
+  // A deprecated config which is only used to provide a default value, in 
case some existing
+  // applications depend on this config, e.g. Spark SQL ODBC driver.
+  val HIVE_EXECUTION_VERSION = buildConf("spark.sql.hive.version")
--- End diff --

or maybe more meaningful name


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19714
  
**[Test build #83714 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83714/testReport)**
 for PR 19714 at commit 
[`c78951b`](https://github.com/apache/spark/commit/c78951b818128024da351412f8a867a66bc2c441).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150368620
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala
 ---
@@ -55,6 +55,7 @@ private[hive] object SparkSQLEnv extends Logging {
   metadataHive.setOut(new PrintStream(System.out, true, "UTF-8"))
   metadataHive.setInfo(new PrintStream(System.err, true, "UTF-8"))
   metadataHive.setError(new PrintStream(System.err, true, "UTF-8"))
+  sparkSession.conf.set(HiveUtils.HIVE_EXECUTION_VERSION, 
HiveUtils.builtinHiveVersion)
--- End diff --

HiveUtils.builtinHiveVersion => HiveUtils.HIVE_EXECUTION_VERSION?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19630: wip: [SPARK-22409] Introduce function type argument in p...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19630
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19630: wip: [SPARK-22409] Introduce function type argument in p...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19630
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83709/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19630: wip: [SPARK-22409] Introduce function type argument in p...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19630
  
**[Test build #83709 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83709/testReport)**
 for PR 19630 at commit 
[`94cded6`](https://github.com/apache/spark/commit/94cded64333956933f82fdc298d486a9aa49347c).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19719
  
**[Test build #83713 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83713/testReport)**
 for PR 19719 at commit 
[`42d05a7`](https://github.com/apache/spark/commit/42d05a7f052d9a4f53030b35799c2ac479fb68f0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...

2017-11-10 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/19714#discussion_r150368276
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -154,12 +158,12 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   // --- BroadcastHashJoin 

 
   case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, 
left, right)
-if canBuildRight(joinType) && canBroadcast(right) =>
+if canBuildRight(joinType) && canBroadcast(right, 
left.stats.hints.broadcast) =>
--- End diff --

@mgaido91 Thanks for your suggestion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150368111
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging {
 .stringConf
 .createWithDefault(builtinHiveVersion)
 
+  // A deprecated config which is only used to provide a default value, in 
case some existing
+  // applications depend on this config, e.g. Spark SQL ODBC driver.
--- End diff --

if Thrift Server only works for 1.2.1,I guess we don’t need an isolated 
classload for its hive client like SparkSQLCLIDriver


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r150368108
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
 ---
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.io._
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.common.`type`.HiveDecimal
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.util._
+import 
org.apache.spark.sql.execution.datasources.orc.OrcUtils.getTypeDescription
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+private[orc] class OrcSerializer(dataSchema: StructType) {
+
+  private[this] lazy val orcStruct: OrcStruct =
+createOrcValue(dataSchema).asInstanceOf[OrcStruct]
+
+  private[this] val writableWrappers =
+dataSchema.fields.map(f => getWritableWrapper(f.dataType))
+
+  def serialize(row: InternalRow): OrcStruct = {
+convertInternalRowToOrcStruct(row, dataSchema, Some(writableWrappers), 
Some(orcStruct))
+  }
+
+  /**
+   * Return a Orc value object for the given Spark schema.
+   */
+  private[this] def createOrcValue(dataType: DataType) =
+OrcStruct.createValue(getTypeDescription(dataType))
+
+  /**
+   * Convert Apache Spark InternalRow to Apache ORC OrcStruct.
+   */
+  private[this] def convertInternalRowToOrcStruct(
+  row: InternalRow,
+  schema: StructType,
+  valueWrappers: Option[Seq[Any => Any]] = None,
+  struct: Option[OrcStruct] = None): OrcStruct = {
+val wrappers =
+  
valueWrappers.getOrElse(schema.fields.map(_.dataType).map(getWritableWrapper).toSeq)
+val orcStruct = 
struct.getOrElse(createOrcValue(schema).asInstanceOf[OrcStruct])
+
+for (schemaIndex <- 0 until schema.length) {
+  val fieldType = schema(schemaIndex).dataType
+  if (row.isNullAt(schemaIndex)) {
+orcStruct.setFieldValue(schemaIndex, null)
+  } else {
+val field = row.get(schemaIndex, fieldType)
+val fieldValue = 
wrappers(schemaIndex)(field).asInstanceOf[WritableComparable[_]]
+orcStruct.setFieldValue(schemaIndex, fieldValue)
+  }
+}
+orcStruct
+  }
+
+  private[this] def withNullSafe(f: Any => Any): Any => Any = {
+input => if (input == null) null else f(input)
+  }
+
+  /**
+   * Builds a WritableComparable-return function ahead of time according 
to DataType
+   * to avoid pattern matching and branching costs per row.
+   */
+  private[this] def getWritableWrapper(dataType: DataType): Any => Any = 
dataType match {
+case NullType => _ => null
+
+case BooleanType => withNullSafe(o => new 
BooleanWritable(o.asInstanceOf[Boolean]))
+
+case ByteType => withNullSafe(o => new 
ByteWritable(o.asInstanceOf[Byte]))
--- End diff --

we can apply the same technology: pass `SpecializedGetters` as a parameter 
to avoid boxing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-10 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19439
  
Thanks for the updates!  My only remaining comments are about:
* Default arguments for readImages in Scala not being Java-friendly  (I'd 
still recommend taking the easy route by having 1 simple method just taking a 
path + 1 complex method taking all arguments.)
* Determinism for sampling  (commented above)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19719
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19719
  
**[Test build #83712 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83712/testReport)**
 for PR 19719 at commit 
[`28fab02`](https://github.com/apache/spark/commit/28fab02358120f396f3272ca66ae3ac8b0534e4b).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r150367926
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.orc.{OrcFile, TypeDescription}
+
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.analysis.Resolver
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.types._
+
+object OrcUtils extends Logging {
+
+  // The extensions for ORC compression codecs
+  val extensionsForCompressionCodecNames = Map(
+"NONE" -> "",
+"SNAPPY" -> ".snappy",
+"ZLIB" -> ".zlib",
+"LZO" -> ".lzo")
+
+  def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = {
+val origPath = new Path(pathStr)
+val fs = origPath.getFileSystem(conf)
+val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath)
+  .filterNot(_.isDirectory)
+  .map(_.getPath)
+  .filterNot(_.getName.startsWith("_"))
+  .filterNot(_.getName.startsWith("."))
+paths
+  }
+
+  private[orc] def readSchema(file: Path, conf: Configuration): 
Option[TypeDescription] = {
+try {
+  val fs = file.getFileSystem(conf)
+  val readerOptions = OrcFile.readerOptions(conf).filesystem(fs)
+  val reader = OrcFile.createReader(file, readerOptions)
+  val schema = reader.getSchema
+  if (schema.getFieldNames.size == 0) {
+None
+  } else {
+Some(schema)
+  }
+} catch {
+  case _: IOException => None
+}
+  }
+
+  private[orc] def readSchema(sparkSession: SparkSession, files: 
Seq[FileStatus])
+  : Option[StructType] = {
+val conf = sparkSession.sessionState.newHadoopConf()
+files.map(_.getPath).flatMap(readSchema(_, conf)).headOption.map { 
schema =>
+  logDebug(s"Reading schema from file $files, got Hive schema string: 
$schema")
+  
CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]
+}
+  }
+
+  private[orc] def getSchemaString(schema: StructType): String = {
+schema.fields.map(f => 
s"${f.name}:${f.dataType.catalogString}").mkString("struct<", ",", ">")
+  }
+
+  private[orc] def getTypeDescription(dataType: DataType) = dataType match 
{
+case st: StructType => TypeDescription.fromString(getSchemaString(st))
+case _ => TypeDescription.fromString(dataType.catalogString)
+  }
+
+  /**
+   * Return a missing schema in a give ORC file.
+   */
+  private[orc] def getMissingSchema(
--- End diff --

will we need the schema in the future? If not, I'd like to change it to 
return `Seq[String]`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19719
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83712/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r150367867
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
 ---
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.hadoop.io._
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.common.`type`.HiveDecimal
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.util._
+import 
org.apache.spark.sql.execution.datasources.orc.OrcUtils.getTypeDescription
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+private[orc] class OrcSerializer(dataSchema: StructType) {
+
+  private[this] lazy val orcStruct: OrcStruct =
+createOrcValue(dataSchema).asInstanceOf[OrcStruct]
+
+  private[this] val writableWrappers =
+dataSchema.fields.map(f => getWritableWrapper(f.dataType))
+
+  def serialize(row: InternalRow): OrcStruct = {
+convertInternalRowToOrcStruct(row, dataSchema, Some(writableWrappers), 
Some(orcStruct))
+  }
+
+  /**
+   * Return a Orc value object for the given Spark schema.
+   */
+  private[this] def createOrcValue(dataType: DataType) =
+OrcStruct.createValue(getTypeDescription(dataType))
+
+  /**
+   * Convert Apache Spark InternalRow to Apache ORC OrcStruct.
+   */
+  private[this] def convertInternalRowToOrcStruct(
--- End diff --

The old `OrcSerializer` doesn't have such a method, and caches the 
`OrcStruct` for better performance. Can we try to follow it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r150367580
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, 
SearchArgumentFactory}
+import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types._
+
+/**
+ * Utility functions to convert Spark data source filters to ORC filters.
+ */
+private[orc] object OrcFilters {
--- End diff --

This class looks very similar the one in hive module, what's the difference?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r150367382
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
 ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.hadoop.io._
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+private[orc] class OrcDeserializer(
+dataSchema: StructType,
+requiredSchema: StructType,
+maybeMissingSchema: Option[StructType]) {
+
+  private[this] val mutableRow = new 
SpecificInternalRow(requiredSchema.map(_.dataType))
+
+  private[this] val unwrappers = requiredSchema.fields.map(f => 
unwrapperFor(f.dataType))
+
+  def deserialize(writable: OrcStruct): InternalRow = {
+convertOrcStructToInternalRow(writable, dataSchema, requiredSchema,
+  maybeMissingSchema, Some(unwrappers), Some(mutableRow))
+  }
+
+  /**
+   * Convert Apache ORC OrcStruct to Apache Spark InternalRow.
+   * If internalRow is not None, fill into it. Otherwise, create a 
SpecificInternalRow and use it.
+   */
+  private[this] def convertOrcStructToInternalRow(
+  orcStruct: OrcStruct,
+  dataSchema: StructType,
+  requiredSchema: StructType,
+  missingSchema: Option[StructType] = None,
+  valueUnwrappers: Option[Seq[(Any, InternalRow, Int) => Unit]] = None,
+  internalRow: Option[InternalRow] = None): InternalRow = {
+val mutableRow = internalRow.getOrElse(new 
SpecificInternalRow(requiredSchema.map(_.dataType)))
+val unwrappers =
+  
valueUnwrappers.getOrElse(requiredSchema.fields.map(_.dataType).map(unwrapperFor).toSeq)
+var i = 0
+val len = requiredSchema.length
+val names = orcStruct.getSchema.getFieldNames
+while (i < len) {
+  val name = requiredSchema(i).name
+  val writable = if (missingSchema.isEmpty || 
missingSchema.get.getFieldIndex(name).isEmpty) {
--- End diff --

Can we strictly follow the style in `OrcFileFormat.unwrapOrcStructs`? i.e. 
no method like `convertOrcStructToInternalRow`, top-level columns and struct 
fields are handled with different while loops.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150367190
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala 
---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import scala.language.existentials
+import scala.util.Random
+
+import org.apache.commons.io.FilenameUtils
+import org.apache.hadoop.conf.{Configuration, Configured}
+import org.apache.hadoop.fs.{Path, PathFilter}
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
+
+import org.apache.spark.sql.SparkSession
+
+private object RecursiveFlag {
+  /**
+   * Sets the spark recursive flag and then restores it.
+   *
+   * @param value Value to set
+   * @param spark Existing spark session
+   * @param f The function to evaluate after setting the flag
+   * @return Returns the evaluation result T of the function
+   */
+  def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): 
T = {
+val flagName = FileInputFormat.INPUT_DIR_RECURSIVE
+val hadoopConf = spark.sparkContext.hadoopConfiguration
+val old = Option(hadoopConf.get(flagName))
+hadoopConf.set(flagName, value.toString)
+try f finally {
+  old match {
+case Some(v) => hadoopConf.set(flagName, v)
+case None => hadoopConf.unset(flagName)
+  }
+}
+  }
+}
+
+/**
+ * Filter that allows loading a fraction of HDFS files.
+ */
+private class SamplePathFilter extends Configured with PathFilter {
--- End diff --

I would prefer determinism since that's a pretty important standard in 
Spark.  I could imagine either (a) using a file hash with a *global* random 
number or (b) using random numbers if we are certain about how PathFilters work.

For (a):
* Why is there a worry about duplicate filenames?  Is the full path not 
available?
* If you do hash filenames, then I wouldn't generate a random number for 
each row.  (If you're generating a random number per row, then why not just use 
that for sampling and skip the hash?)  You could generate a single random 
number on the driver and use that in the comparison with each hash.

For (b):
* If we knew how PathFilters were consumed, then we could presumably figure 
out a way to make this deterministic just by setting a random seed here. E.g., 
if a new PathFilter instance were instantiated to read each partition, then 
that would work.  But if PathFilters are shared across reads of multiple 
partitions, then partition ordering could cause problems with determinism.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r150367100
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
 ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.hadoop.io._
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+private[orc] class OrcDeserializer(
+dataSchema: StructType,
+requiredSchema: StructType,
+maybeMissingSchema: Option[StructType]) {
+
+  private[this] val mutableRow = new 
SpecificInternalRow(requiredSchema.map(_.dataType))
+
+  private[this] val unwrappers = requiredSchema.fields.map(f => 
unwrapperFor(f.dataType))
+
+  def deserialize(writable: OrcStruct): InternalRow = {
+convertOrcStructToInternalRow(writable, dataSchema, requiredSchema,
+  maybeMissingSchema, Some(unwrappers), Some(mutableRow))
+  }
+
+  /**
+   * Convert Apache ORC OrcStruct to Apache Spark InternalRow.
+   * If internalRow is not None, fill into it. Otherwise, create a 
SpecificInternalRow and use it.
+   */
+  private[this] def convertOrcStructToInternalRow(
+  orcStruct: OrcStruct,
+  dataSchema: StructType,
+  requiredSchema: StructType,
+  missingSchema: Option[StructType] = None,
+  valueUnwrappers: Option[Seq[(Any, InternalRow, Int) => Unit]] = None,
+  internalRow: Option[InternalRow] = None): InternalRow = {
+val mutableRow = internalRow.getOrElse(new 
SpecificInternalRow(requiredSchema.map(_.dataType)))
+val unwrappers =
+  
valueUnwrappers.getOrElse(requiredSchema.fields.map(_.dataType).map(unwrapperFor).toSeq)
+var i = 0
+val len = requiredSchema.length
+val names = orcStruct.getSchema.getFieldNames
+while (i < len) {
+  val name = requiredSchema(i).name
+  val writable = if (missingSchema.isEmpty || 
missingSchema.get.getFieldIndex(name).isEmpty) {
--- End diff --

I think this check is only needed for top-level columns.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19640: [SPARK-16986][CORE][WEB-UI] Support configure history se...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19640
  
I think it makes sense, but the implementation is too hacky. We should 
extract the time epoch and convert it to timestamp string with local timezone.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150366571
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging {
 .stringConf
 .createWithDefault(builtinHiveVersion)
 
+  // A deprecated config which is only used to provide a default value, in 
case some existing
+  // applications depend on this config, e.g. Spark SQL ODBC driver.
+  val HIVE_EXECUTION_VERSION = buildConf("spark.sql.hive.version")
--- End diff --

This config is a kind of a fake one, I think comments are enough for it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150366512
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging {
 .stringConf
 .createWithDefault(builtinHiveVersion)
 
+  // A deprecated config which is only used to provide a default value, in 
case some existing
+  // applications depend on this config, e.g. Spark SQL ODBC driver.
--- End diff --

it's ODBC. Actually it's something out of Spark's control, we must be 
careful and avoid behavior changes, like `conf.get("spark.sql.hive.version")` 
should still return `1.2.1` instead of null.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19719
  
**[Test build #83712 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83712/testReport)**
 for PR 19719 at commit 
[`28fab02`](https://github.com/apache/spark/commit/28fab02358120f396f3272ca66ae3ac8b0534e4b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150366200
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends 
HiveThriftJdbcTest {
 conf += resultSet.getString(1) -> resultSet.getString(2)
   }
 
-  assert(conf.get("spark.sql.hive.metastore.version") === 
Some("1.2.1"))
+  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
+}
+  }
+
+  test("Checks Hive version via SET") {
+withJdbcStatement() { statement =>
+  val resultSet = statement.executeQuery("SET")
+
+  val conf = mutable.Map.empty[String, String]
+  while (resultSet.next()) {
+conf += resultSet.getString(1) -> resultSet.getString(2)
+  }
+
+  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
--- End diff --

In Hive, "SET" returns all changed properties while "SET -v" returns all 
properties. In Spark, `SET` ueries all key-value pairs that are **set** in the 
SQLConf of the sparkSession.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19640: [SPARK-16986][CORE][WEB-UI] Support configure history se...

2017-11-10 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/19640
  
@cloud-fan For the UI part, how about this PR: 
https://github.com/apache/spark/pull/14577


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread yaooqinn
Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19719
  
make it an AlternativeConfig maybe better


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150364666
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging {
 .stringConf
 .createWithDefault(builtinHiveVersion)
 
+  // A deprecated config which is only used to provide a default value, in 
case some existing
+  // applications depend on this config, e.g. Spark SQL ODBC driver.
--- End diff --

i guess the jdbc driver do not depend on it. it used to only return 
`1.2.1`, but i think that this server may work fine with other hive versions


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150364544
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala 
---
@@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging {
 .stringConf
 .createWithDefault(builtinHiveVersion)
 
+  // A deprecated config which is only used to provide a default value, in 
case some existing
+  // applications depend on this config, e.g. Spark SQL ODBC driver.
+  val HIVE_EXECUTION_VERSION = buildConf("spark.sql.hive.version")
--- End diff --

doc may needed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...

2017-11-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18134
  
Yeah, you always can implement such a UDF. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...

2017-11-10 Thread sergiobilello-eb
Github user sergiobilello-eb commented on the issue:

https://github.com/apache/spark/pull/18134
  
thanks @gatorsmile! Do you suggest any workaround until then? I mean not 
rebuilding spark with that patch Can I register my UDF that contains that 
logic?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...

2017-11-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18134
  
https://issues.apache.org/jira/browse/SPARK-20909 

This is not part of 2.2. Based on JIRA, it will be included in 2.3 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19689
  
**[Test build #83711 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83711/testReport)**
 for PR 19689 at commit 
[`20dc217`](https://github.com/apache/spark/commit/20dc21741a5219dd19bb44806e640b7ea4eed012).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...

2017-11-10 Thread sergiobilello-eb
Github user sergiobilello-eb commented on the issue:

https://github.com/apache/spark/pull/18134
  
```spark-sql> select dayofweek('2007-02-03'), dayofweek('2009-07-30'), 
dayofweek('2017-05-27'), dayofweek(null), dayofweek('1582-10-15 13:10:15');
17/11/10 16:08:23 INFO SparkSqlParser: Parsing command: select 
dayofweek('2007-02-03'), dayofweek('2009-07-30'), dayofweek('2017-05-27'), 
dayofweek(null), dayofweek('1582-10-15 13:10:15')
17/11/10 16:08:23 INFO HiveMetaStore: 0: get_database: default
17/11/10 16:08:23 INFO audit: ugi=sergio.bilelloip=unknown-ip-addr  
cmd=get_database: default
17/11/10 16:08:23 INFO HiveMetaStore: 0: get_database: default
17/11/10 16:08:23 INFO audit: ugi=sergio.bilelloip=unknown-ip-addr  
cmd=get_database: default
17/11/10 16:08:23 INFO HiveMetaStore: 0: get_function: default.dayofweek
17/11/10 16:08:23 INFO audit: ugi=sergio.bilelloip=unknown-ip-addr  
cmd=get_function: default.dayofweek
Error in query: Undefined function: 'dayofweek'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'default'.; line 1 pos 7```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...

2017-11-10 Thread sergiobilello-eb
Github user sergiobilello-eb commented on the issue:

https://github.com/apache/spark/pull/18134
  
thanks @gatorsmile :) 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150363309
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends 
HiveThriftJdbcTest {
 conf += resultSet.getString(1) -> resultSet.getString(2)
   }
 
-  assert(conf.get("spark.sql.hive.metastore.version") === 
Some("1.2.1"))
+  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
+}
+  }
+
+  test("Checks Hive version via SET") {
+withJdbcStatement() { statement =>
+  val resultSet = statement.executeQuery("SET")
+
+  val conf = mutable.Map.empty[String, String]
+  while (resultSet.next()) {
+conf += resultSet.getString(1) -> resultSet.getString(2)
+  }
+
+  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
--- End diff --

Nope, I think. Try it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...

2017-11-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18134
  
@sergiobilello-eb This is just a SQL function. You can call it in SQL 
interface or using the `df.select(expr("dayofweek('2009-07-30')"))`. It is not 
part of the DataFrame functions. You can submit a PR or report it as an issue 
to add such an API.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #83710 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83710/testReport)**
 for PR 19651 at commit 
[`4b09faa`](https://github.com/apache/spark/commit/4b09faa5538a0949494693d02b8b95065a616a8c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...

2017-11-10 Thread sergiobilello-eb
Github user sergiobilello-eb commented on the issue:

https://github.com/apache/spark/pull/18134
  
This function is not reported in the API documentation: 
https://spark.apache.org/docs/2.2.0/api/java/index.html?org/apache/spark/sql/functions.html
 or 
https://spark.apache.org/docs/2.1.0/api/java/index.html?org/apache/spark/sql/functions.html
 ?
How is it possible? I am trying to use it from spark-sql on a spark 2.1.0 
cluster 
Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150362312
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends 
HiveThriftJdbcTest {
 conf += resultSet.getString(1) -> resultSet.getString(2)
   }
 
-  assert(conf.get("spark.sql.hive.metastore.version") === 
Some("1.2.1"))
+  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
+}
+  }
+
+  test("Checks Hive version via SET") {
+withJdbcStatement() { statement =>
+  val resultSet = statement.executeQuery("SET")
+
+  val conf = mutable.Map.empty[String, String]
+  while (resultSet.next()) {
+conf += resultSet.getString(1) -> resultSet.getString(2)
+  }
+
+  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
--- End diff --

a default value will be returned, isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-10 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150361985
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
+  

[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-11-10 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19601
  
@cloud-fan could you please review this again? Now, this PR does not apply 
any change to `ColumnVector` and `WritableColumnVector`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...

2017-11-10 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19077
  
According to [this 
material](http://s3-eu-west-1.amazonaws.com/presentations2013/10_presentation.pdf#pages=42),
 `Unsafe.allocateMemory` uses 8-byte boundary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19407: [SPARK-21667][Streaming] ConsoleSink should not f...

2017-11-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19407


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...

2017-11-10 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19719#discussion_r150358375
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends 
HiveThriftJdbcTest {
 conf += resultSet.getString(1) -> resultSet.getString(2)
   }
 
-  assert(conf.get("spark.sql.hive.metastore.version") === 
Some("1.2.1"))
+  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
+}
+  }
+
+  test("Checks Hive version via SET") {
+withJdbcStatement() { statement =>
+  val resultSet = statement.executeQuery("SET")
+
+  val conf = mutable.Map.empty[String, String]
+  while (resultSet.next()) {
+conf += resultSet.getString(1) -> resultSet.getString(2)
+  }
+
+  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
--- End diff --

If we do not set it explicitly, this will not be returned.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-11-10 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19407
  
Thanks! Merging to master and 2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19407
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19407
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83704/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19407
  
**[Test build #83704 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83704/testReport)**
 for PR 19407 at commit 
[`788fbf3`](https://github.com/apache/spark/commit/788fbf309261f1b003d5047ad4c86039de2fe16e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83703/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19459
  
**[Test build #83703 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83703/testReport)**
 for PR 19459 at commit 
[`6c72e37`](https://github.com/apache/spark/commit/6c72e37b0ca520d2756722ce2f18fae3ea32c39e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19719
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83705/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19719
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19719
  
**[Test build #83705 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83705/testReport)**
 for PR 19719 at commit 
[`7b767bf`](https://github.com/apache/spark/commit/7b767bfb010f7daeecbc8475c92bc1137e7e019a).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19719
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83708/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19719
  
**[Test build #83708 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83708/testReport)**
 for PR 19719 at commit 
[`0ce241d`](https://github.com/apache/spark/commit/0ce241da78c9a59fe255a76a6271313b5bc7d039).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19719
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19720: [SPARK-22494][SQL] Fix 64KB limit exception with Coalesc...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19720
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19720: [SPARK-22494][SQL] Fix 64KB limit exception with Coalesc...

2017-11-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19720
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83707/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19720: [SPARK-22494][SQL] Fix 64KB limit exception with Coalesc...

2017-11-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19720
  
**[Test build #83707 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83707/testReport)**
 for PR 19720 at commit 
[`f0edd7e`](https://github.com/apache/spark/commit/f0edd7e077c84b6a890f7ed9cff2eefadf5eee33).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-10 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r150349566
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -852,6 +662,41 @@ private[spark] object RandomForest extends Logging {
   }
 
   /**
+   * Find the best split for a node.
+   *
+   * @param binAggregates Bin statistics.
+   * @return tuple for best split: (Split, information gain, prediction at 
node)
+   */
+  private[tree] def binsToBestSplit(
+  binAggregates: DTStatsAggregator,
+  splits: Array[Array[Split]],
+  featuresForNode: Option[Array[Int]],
+  node: LearningNode): (Split, ImpurityStats) = {
+val validFeatureSplits = 
getNonConstantFeatures(binAggregates.metadata, featuresForNode)
+// For each (feature, split), calculate the gain, and select the best 
(feature, split).
+val parentImpurityCalc = if (node.stats == null) None else 
Some(node.stats.impurityCalculator)
--- End diff --

Note to check: Will node.stats == null for the top level for sure?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-10 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r150159113
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala ---
@@ -0,0 +1,215 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.ml.tree.{CategoricalSplit, Split}
+import org.apache.spark.mllib.tree.impurity.ImpurityCalculator
+import org.apache.spark.mllib.tree.model.ImpurityStats
+
+/** Utility methods for choosing splits during local & distributed tree 
training. */
+private[impl] object SplitUtils {
+
+  /** Sorts ordered feature categories by label centroid, returning an 
ordered list of categories */
+  private def sortByCentroid(
+  binAggregates: DTStatsAggregator,
+  featureIndex: Int,
+  featureIndexIdx: Int): List[Int] = {
+/* Each bin is one category (feature value).
+ * The bins are ordered based on centroidForCategories, and this 
ordering determines which
+ * splits are considered.  (With K categories, we consider K - 1 
possible splits.)
+ *
+ * centroidForCategories is a list: (category, centroid)
+ */
+val numCategories = binAggregates.metadata.numBins(featureIndex)
+val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
+
+val centroidForCategories = Range(0, numCategories).map { featureValue 
=>
+  val categoryStats =
+binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
+  val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, 
categoryStats)
+  (featureValue, centroid)
+}
+// TODO(smurching): How to handle logging statements like these?
+// logDebug("Centroids for categorical variable: " + 
centroidForCategories.mkString(","))
+// bins sorted by centroids
+val categoriesSortedByCentroid = 
centroidForCategories.toList.sortBy(_._2).map(_._1)
+// logDebug("Sorted centroids for categorical variable = " +
+//   categoriesSortedByCentroid.mkString(","))
+categoriesSortedByCentroid
+  }
+
+  /**
+   * Find the best split for an unordered categorical feature at a single 
node.
+   *
+   * Algorithm:
+   *  - Considers all possible subsets (exponentially many)
+   *
+   * @param featureIndex  Global index of feature being split.
+   * @param featureIndexIdx Index of feature being split within subset of 
features for current node.
+   * @param featureSplits Array of splits for the current feature
+   * @param parentCalculator Optional: ImpurityCalculator containing 
impurity stats for current node
+   * @return  (best split, statistics for split)  If no valid split was 
found, the returned
+   *  ImpurityStats instance will be invalid (have member valid = 
false).
+   */
+  private[impl] def chooseUnorderedCategoricalSplit(
+  binAggregates: DTStatsAggregator,
+  featureIndex: Int,
+  featureIndexIdx: Int,
+  featureSplits: Array[Split],
+  parentCalculator: Option[ImpurityCalculator] = None): (Split, 
ImpurityStats) = {
+// Unordered categorical feature
+val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
+val numSplits = binAggregates.metadata.numSplits(featureIndex)
+var parentCalc = parentCalculator
+val (bestFeatureSplitIndex, bestFeatureGainStats) =
+  Range(0, numSplits).map { splitIndex =>
+val leftChildStats = 
binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex)
+val rightChildStats = binAggregates.getParentImpurityCalculator()
+  .subtract(leftChildStats)
+val gainAndImpurityStats = 
ImpurityUtils.calculateImpurityStats(parentCalc,
+  leftChildStats, rightChildStats, binAggregates.metadata)
+// Compute parent stats once, when considering first split for 
current feature
+if 

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-10 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r150159513
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala ---
@@ -0,0 +1,215 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.ml.tree.{CategoricalSplit, Split}
+import org.apache.spark.mllib.tree.impurity.ImpurityCalculator
+import org.apache.spark.mllib.tree.model.ImpurityStats
+
+/** Utility methods for choosing splits during local & distributed tree 
training. */
+private[impl] object SplitUtils {
+
+  /** Sorts ordered feature categories by label centroid, returning an 
ordered list of categories */
+  private def sortByCentroid(
+  binAggregates: DTStatsAggregator,
+  featureIndex: Int,
+  featureIndexIdx: Int): List[Int] = {
+/* Each bin is one category (feature value).
+ * The bins are ordered based on centroidForCategories, and this 
ordering determines which
+ * splits are considered.  (With K categories, we consider K - 1 
possible splits.)
+ *
+ * centroidForCategories is a list: (category, centroid)
+ */
+val numCategories = binAggregates.metadata.numBins(featureIndex)
+val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
+
+val centroidForCategories = Range(0, numCategories).map { featureValue 
=>
+  val categoryStats =
+binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
+  val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, 
categoryStats)
+  (featureValue, centroid)
+}
+// TODO(smurching): How to handle logging statements like these?
+// logDebug("Centroids for categorical variable: " + 
centroidForCategories.mkString(","))
+// bins sorted by centroids
+val categoriesSortedByCentroid = 
centroidForCategories.toList.sortBy(_._2).map(_._1)
+// logDebug("Sorted centroids for categorical variable = " +
+//   categoriesSortedByCentroid.mkString(","))
+categoriesSortedByCentroid
+  }
+
+  /**
+   * Find the best split for an unordered categorical feature at a single 
node.
+   *
+   * Algorithm:
+   *  - Considers all possible subsets (exponentially many)
+   *
+   * @param featureIndex  Global index of feature being split.
+   * @param featureIndexIdx Index of feature being split within subset of 
features for current node.
+   * @param featureSplits Array of splits for the current feature
+   * @param parentCalculator Optional: ImpurityCalculator containing 
impurity stats for current node
+   * @return  (best split, statistics for split)  If no valid split was 
found, the returned
+   *  ImpurityStats instance will be invalid (have member valid = 
false).
+   */
+  private[impl] def chooseUnorderedCategoricalSplit(
+  binAggregates: DTStatsAggregator,
+  featureIndex: Int,
+  featureIndexIdx: Int,
+  featureSplits: Array[Split],
+  parentCalculator: Option[ImpurityCalculator] = None): (Split, 
ImpurityStats) = {
+// Unordered categorical feature
+val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
+val numSplits = binAggregates.metadata.numSplits(featureIndex)
+var parentCalc = parentCalculator
+val (bestFeatureSplitIndex, bestFeatureGainStats) =
+  Range(0, numSplits).map { splitIndex =>
+val leftChildStats = 
binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex)
+val rightChildStats = binAggregates.getParentImpurityCalculator()
+  .subtract(leftChildStats)
+val gainAndImpurityStats = 
ImpurityUtils.calculateImpurityStats(parentCalc,
+  leftChildStats, rightChildStats, binAggregates.metadata)
+// Compute parent stats once, when considering first split for 
current feature
+if 

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-10 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r150158027
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -627,221 +621,37 @@ private[spark] object RandomForest extends Logging {
   }
 
   /**
-   * Calculate the impurity statistics for a given (feature, split) based 
upon left/right
-   * aggregates.
-   *
-   * @param stats the recycle impurity statistics for this feature's all 
splits,
-   *  only 'impurity' and 'impurityCalculator' are valid 
between each iteration
-   * @param leftImpurityCalculator left node aggregates for this (feature, 
split)
-   * @param rightImpurityCalculator right node aggregate for this 
(feature, split)
-   * @param metadata learning and dataset metadata for DecisionTree
-   * @return Impurity statistics for this (feature, split)
+   * Return a list of pairs (featureIndexIdx, featureIndex) where 
featureIndex is the global
+   * (across all trees) index of a feature and featureIndexIdx is the 
index of a feature within the
+   * list of features for a given node. Filters out constant features 
(features with 0 splits)
*/
-  private def calculateImpurityStats(
-  stats: ImpurityStats,
-  leftImpurityCalculator: ImpurityCalculator,
-  rightImpurityCalculator: ImpurityCalculator,
-  metadata: DecisionTreeMetadata): ImpurityStats = {
-
-val parentImpurityCalculator: ImpurityCalculator = if (stats == null) {
-  leftImpurityCalculator.copy.add(rightImpurityCalculator)
-} else {
-  stats.impurityCalculator
-}
-
-val impurity: Double = if (stats == null) {
-  parentImpurityCalculator.calculate()
-} else {
-  stats.impurity
-}
-
-val leftCount = leftImpurityCalculator.count
-val rightCount = rightImpurityCalculator.count
-
-val totalCount = leftCount + rightCount
-
-// If left child or right child doesn't satisfy minimum instances per 
node,
-// then this split is invalid, return invalid information gain stats.
-if ((leftCount < metadata.minInstancesPerNode) ||
-  (rightCount < metadata.minInstancesPerNode)) {
-  return 
ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator)
-}
-
-val leftImpurity = leftImpurityCalculator.calculate() // Note: This 
equals 0 if count = 0
-val rightImpurity = rightImpurityCalculator.calculate()
-
-val leftWeight = leftCount / totalCount.toDouble
-val rightWeight = rightCount / totalCount.toDouble
-
-val gain = impurity - leftWeight * leftImpurity - rightWeight * 
rightImpurity
-
-// if information gain doesn't satisfy minimum information gain,
-// then this split is invalid, return invalid information gain stats.
-if (gain < metadata.minInfoGain) {
-  return 
ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator)
+  private[impl] def getNonConstantFeatures(
+  metadata: DecisionTreeMetadata,
+  featuresForNode: Option[Array[Int]]): Seq[(Int, Int)] = {
+Range(0, metadata.numFeaturesPerNode).map { featureIndexIdx =>
--- End diff --

Was there a reason to remove the use of view and withFilter here?  With the 
output of this method going through further Seq operations, I would expect the 
previous implementation to be more efficient.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >