[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19077 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83715/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19077 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19077 **[Test build #83715 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83715/testReport)** for PR 19077 at commit [`967c73a`](https://github.com/apache/spark/commit/967c73a2462f0948029a37a84f6912983f03c126). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150376997 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,196 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +.. attribute:: ImageSchema + +An attribute of :class:`_ImageSchema` in this module. + +.. autoclass:: _ImageSchema + :members: +""" + +import numpy as np +from pyspark import SparkContext +from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string +from pyspark.sql import DataFrame, SparkSession + + +class _ImageSchema(object): +""" +Internal class for `pyspark.ml.image.ImageSchema` attribute. Meant to be private and +not to be instantized. Use `pyspark.ml.image.ImageSchema` attribute to access the +APIs of this class. +""" + +def __init__(self): +self._imageSchema = None +self._ocvTypes = None +self._imageFields = None +self._undefinedImageType = None + +@property +def imageSchema(self): +""" +Returns the image schema. + +:rtype StructType: a DataFrame with a single column of images + named "image" (nullable) + +.. versionadded:: 2.3.0 +""" + +if self._imageSchema is None: +ctx = SparkContext._active_spark_context +jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema() +self._imageSchema = _parse_datatype_json_string(jschema.json()) +return self._imageSchema + +@property +def ocvTypes(self): +""" +Returns the OpenCV type mapping supported + +:rtype dict: The OpenCV type mapping supported + +.. versionadded:: 2.3.0 +""" + +if self._ocvTypes is None: +ctx = SparkContext._active_spark_context +self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) +return self._ocvTypes + +@property +def imageFields(self): +""" +Returns field names of image columns. + +:rtype list: a list of field names. + +.. versionadded:: 2.3.0 +""" + +if self._imageFields is None: +ctx = SparkContext._active_spark_context +self._imageFields = list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields()) +return self._imageFields + +@property +def undefinedImageType(self): +""" +Returns the name of undefined image type for the invalid image. + +.. versionadded:: 2.3.0 +""" + +if self._undefinedImageType is None: +ctx = SparkContext._active_spark_context +self._undefinedImageType = \ + ctx._jvm.org.apache.spark.ml.image.ImageSchema.undefinedImageType() +return self._undefinedImageType + +def toNDArray(self, image): +""" +Converts an image to a one-dimensional array. + +:param image: The image to be converted +:rtype array: The image as a one-dimensional array + +.. versionadded:: 2.3.0 +""" + +height = image.height +width = image.width +nChannels = image.nChannels +return np.ndarray( +shape=(height, width, nChannels), +dtype=np.uint8, +buffer=image.data, +strides=(width * nChannels, nChannels, 1)) + +def toImage(self, array, origin=""): +""" +Converts an array with metadata to a two-dimensional image. + +:param array array: The array to convert to image +:param str origin: Path to the image, optional +:rtype object: Two dimensional image +
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19543 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19543 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83718/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19543 **[Test build #83718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83718/testReport)** for PR 19543 at commit [`158b6b9`](https://github.com/apache/spark/commit/158b6b998bd774f52430d42791e6e21a0537a110). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19543 **[Test build #83718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83718/testReport)** for PR 19543 at commit [`158b6b9`](https://github.com/apache/spark/commit/158b6b998bd774f52430d42791e6e21a0537a110). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19543 add to white list --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19543 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19370 @jsnowacki, I happened to make another JIRA - https://issues.apache.org/jira/browse/SPARK-22495 for this PR. Mind linking this PR to the JIRA? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19543: [SPARK-19606][MESOS] Support constraints in spark...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19543#discussion_r150376571 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala --- @@ -122,4 +122,11 @@ package object config { "Example: key1:val1,key2:val2") .stringConf .createOptional + + private[spark] val DRIVER_CONSTRAINTS = +ConfigBuilder("spark.mesos.driver.constraints") + .doc("Attribute based constraints on mesos resource offers. Applied by the dispatcher " + +"when launching drivers. Default is to accept all offers with sufficient resources.") + .stringConf + .createWithDefault("") --- End diff -- should this be default to ""? looks like it might still match something https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L267 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150376396 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,196 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +.. attribute:: ImageSchema + +An attribute of :class:`_ImageSchema` in this module. --- End diff -- This actually sounds like a class attribute in `_ImageSchema`. How about ... something like .. ``` An attribute of this module that contains the instance of :class:`_ImageSchema` ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19713: [SPARK-22488] [SQL] Fix the view resolution issue in the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19713 **[Test build #83717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83717/testReport)** for PR 19713 at commit [`c491974`](https://github.com/apache/spark/commit/c49197428168685ec0c3491ad7f16dd10f7e09e1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150375535 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,239 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Defines the image schema and methods to read and manipulate images. + */ +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python --- End diff -- Let's add `(Java-specific)`, for example, ``` (Java-specific) OpenCV type mapping supported ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150375473 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,196 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +.. attribute:: ImageSchema + +An attribute of :class:`_ImageSchema` in this module. + +.. autoclass:: _ImageSchema + :members: +""" + +import numpy as np +from pyspark import SparkContext +from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string +from pyspark.sql import DataFrame, SparkSession + + +class _ImageSchema(object): +""" +Internal class for `pyspark.ml.image.ImageSchema` attribute. Meant to be private and +not to be instantized. Use `pyspark.ml.image.ImageSchema` attribute to access the +APIs of this class. +""" + +def __init__(self): +self._imageSchema = None +self._ocvTypes = None +self._imageFields = None +self._undefinedImageType = None + +@property +def imageSchema(self): +""" +Returns the image schema. + +:rtype StructType: a DataFrame with a single column of images + named "image" (nullable) + +.. versionadded:: 2.3.0 +""" + +if self._imageSchema is None: +ctx = SparkContext._active_spark_context +jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema() +self._imageSchema = _parse_datatype_json_string(jschema.json()) +return self._imageSchema + +@property +def ocvTypes(self): +""" +Returns the OpenCV type mapping supported + +:rtype dict: The OpenCV type mapping supported + +.. versionadded:: 2.3.0 +""" + +if self._ocvTypes is None: +ctx = SparkContext._active_spark_context +self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) +return self._ocvTypes + +@property +def imageFields(self): +""" +Returns field names of image columns. + +:rtype list: a list of field names. + +.. versionadded:: 2.3.0 +""" + +if self._imageFields is None: +ctx = SparkContext._active_spark_context +self._imageFields = list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields()) +return self._imageFields + +@property +def undefinedImageType(self): +""" +Returns the name of undefined image type for the invalid image. + +.. versionadded:: 2.3.0 +""" + +if self._undefinedImageType is None: +ctx = SparkContext._active_spark_context +self._undefinedImageType = \ + ctx._jvm.org.apache.spark.ml.image.ImageSchema.undefinedImageType() +return self._undefinedImageType + +def toNDArray(self, image): +""" +Converts an image to a one-dimensional array. + +:param image: The image to be converted +:rtype array: The image as a one-dimensional array + +.. versionadded:: 2.3.0 +""" + +height = image.height +width = image.width +nChannels = image.nChannels +return np.ndarray( +shape=(height, width, nChannels), +dtype=np.uint8, +buffer=image.data, +strides=(width * nChannels, nChannels, 1)) + +def toImage(self, array, origin=""): +""" +Converts an array with metadata to a two-dimensional image. + +:param array array: The array to convert to image +:param str origin: Path to the image, optional +:rtype object: Two dimensional image +
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150375592 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,239 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Defines the image schema and methods to read and manipulate images. + */ +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val javaOcvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField("origin", StringType, true) :: +StructField("height", IntegerType, false) :: +StructField("width", IntegerType, false) :: +StructField("nChannels", IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField("mode", IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField("data", BinaryType, false) :: Nil) + + val imageFields: Array[String] = columnSchema.fieldNames + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * Default values for the invalid image + * + * @param origin Origin of the invalid image + * @return Row with the default values + */ + private[spark] def invalidImageRow(origin: String): Row = +Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), Array.ofDim[Byte](0))) + + /** + * Convert the compressed image (jpeg, png, etc.) into OpenCV + * representation and store it in DataFrame Row + * + * @param origin Arbitrary string that identifies the image + * @param bytes Image bytes (for example, jpeg) + * @return DataFrame Row or None (if the decompression fails) + */ + private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = { + +val img = ImageIO.read(new ByteArrayInputStream(bytes)) + +if (img == null) { + None +}
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150375543 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,239 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Defines the image schema and methods to read and manipulate images. + */ +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val ocvTypes: Map[String, Int] = Map( --- End diff -- Similarly, ``` (Scala-specific) OpenCV type mapping supported ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19690: [SPARK-22467]Added a switch to support whether `stdout_s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19690 **[Test build #83716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83716/testReport)** for PR 19690 at commit [`37fba60`](https://github.com/apache/spark/commit/37fba60570925be29388a828e9f66caf545ec39e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19077 **[Test build #83715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83715/testReport)** for PR 19077 at commit [`967c73a`](https://github.com/apache/spark/commit/967c73a2462f0948029a37a84f6912983f03c126). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19714 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83714/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19714 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19714 **[Test build #83714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83714/testReport)** for PR 19714 at commit [`c78951b`](https://github.com/apache/spark/commit/c78951b818128024da351412f8a867a66bc2c441). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r150373516 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) instr.logParams(numFolds, seed, parallelism) logTuningParams(instr) +val collectSubModelsParam = $(collectSubModels) + +var subModels: Option[Array[Array[Model[_ = if (collectSubModelsParam) { --- End diff -- Sorry I didn't follow up on this before. I think that @WeichenXu123 's argument is valid, but please say if there are issues I'm missing @holdenk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19689 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83711/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19689 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19689 **[Test build #83711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83711/testReport)** for PR 19689 at commit [`20dc217`](https://github.com/apache/spark/commit/20dc21741a5219dd19bb44806e640b7ea4eed012). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83710/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19651 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #83710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83710/testReport)** for PR 19651 at commit [`4b09faa`](https://github.com/apache/spark/commit/4b09faa5538a0949494693d02b8b95065a616a8c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r150369899 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, SearchArgumentFactory} +import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types._ + +/** + * Utility functions to convert Spark data source filters to ORC filters. + */ +private[orc] object OrcFilters { --- End diff -- Yep. [This](https://github.com/apache/spark/pull/18953/files#diff-6cac9bc2656e3782b0312dceb8c55d47) will show you that explicitly. Mainly, it uses new APIs. ``` -Some(builder.startAnd().equals(attribute, value).end()) +val castedValue = castLiteralValue(value, dataTypeMap(attribute)) +Some(builder.startAnd().equals(attribute, getType(attribute), castedValue).end()) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19719 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user yaooqinn commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150368818 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala --- @@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { conf += resultSet.getString(1) -> resultSet.getString(2) } - assert(conf.get("spark.sql.hive.metastore.version") === Some("1.2.1")) + assert(conf.get("spark.sql.hive.version") === Some("1.2.1")) +} + } + + test("Checks Hive version via SET") { --- End diff -- i might be a session level conf --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19719 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83713/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19719 **[Test build #83713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83713/testReport)** for PR 19719 at commit [`42d05a7`](https://github.com/apache/spark/commit/42d05a7f052d9a4f53030b35799c2ac479fb68f0). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user yaooqinn commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150368744 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging { .stringConf .createWithDefault(builtinHiveVersion) + // A deprecated config which is only used to provide a default value, in case some existing + // applications depend on this config, e.g. Spark SQL ODBC driver. + val HIVE_EXECUTION_VERSION = buildConf("spark.sql.hive.version") --- End diff -- or maybe more meaningful name --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19714 **[Test build #83714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83714/testReport)** for PR 19714 at commit [`c78951b`](https://github.com/apache/spark/commit/c78951b818128024da351412f8a867a66bc2c441). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user yaooqinn commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150368620 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala --- @@ -55,6 +55,7 @@ private[hive] object SparkSQLEnv extends Logging { metadataHive.setOut(new PrintStream(System.out, true, "UTF-8")) metadataHive.setInfo(new PrintStream(System.err, true, "UTF-8")) metadataHive.setError(new PrintStream(System.err, true, "UTF-8")) + sparkSession.conf.set(HiveUtils.HIVE_EXECUTION_VERSION, HiveUtils.builtinHiveVersion) --- End diff -- HiveUtils.builtinHiveVersion => HiveUtils.HIVE_EXECUTION_VERSION? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: wip: [SPARK-22409] Introduce function type argument in p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: wip: [SPARK-22409] Introduce function type argument in p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83709/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: wip: [SPARK-22409] Introduce function type argument in p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19630 **[Test build #83709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83709/testReport)** for PR 19630 at commit [`94cded6`](https://github.com/apache/spark/commit/94cded64333956933f82fdc298d486a9aa49347c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19719 **[Test build #83713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83713/testReport)** for PR 19719 at commit [`42d05a7`](https://github.com/apache/spark/commit/42d05a7f052d9a4f53030b35799c2ac479fb68f0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/19714#discussion_r150368276 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -154,12 +158,12 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { // --- BroadcastHashJoin case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) -if canBuildRight(joinType) && canBroadcast(right) => +if canBuildRight(joinType) && canBroadcast(right, left.stats.hints.broadcast) => --- End diff -- @mgaido91 Thanks for your suggestion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user yaooqinn commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150368111 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging { .stringConf .createWithDefault(builtinHiveVersion) + // A deprecated config which is only used to provide a default value, in case some existing + // applications depend on this config, e.g. Spark SQL ODBC driver. --- End diff -- if Thrift Server only works for 1.2.1,I guess we don’t need an isolated classload for its hive client like SparkSQLCLIDriver --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r150368108 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala --- @@ -0,0 +1,163 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.io._ +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.common.`type`.HiveDecimal +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.execution.datasources.orc.OrcUtils.getTypeDescription +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[orc] class OrcSerializer(dataSchema: StructType) { + + private[this] lazy val orcStruct: OrcStruct = +createOrcValue(dataSchema).asInstanceOf[OrcStruct] + + private[this] val writableWrappers = +dataSchema.fields.map(f => getWritableWrapper(f.dataType)) + + def serialize(row: InternalRow): OrcStruct = { +convertInternalRowToOrcStruct(row, dataSchema, Some(writableWrappers), Some(orcStruct)) + } + + /** + * Return a Orc value object for the given Spark schema. + */ + private[this] def createOrcValue(dataType: DataType) = +OrcStruct.createValue(getTypeDescription(dataType)) + + /** + * Convert Apache Spark InternalRow to Apache ORC OrcStruct. + */ + private[this] def convertInternalRowToOrcStruct( + row: InternalRow, + schema: StructType, + valueWrappers: Option[Seq[Any => Any]] = None, + struct: Option[OrcStruct] = None): OrcStruct = { +val wrappers = + valueWrappers.getOrElse(schema.fields.map(_.dataType).map(getWritableWrapper).toSeq) +val orcStruct = struct.getOrElse(createOrcValue(schema).asInstanceOf[OrcStruct]) + +for (schemaIndex <- 0 until schema.length) { + val fieldType = schema(schemaIndex).dataType + if (row.isNullAt(schemaIndex)) { +orcStruct.setFieldValue(schemaIndex, null) + } else { +val field = row.get(schemaIndex, fieldType) +val fieldValue = wrappers(schemaIndex)(field).asInstanceOf[WritableComparable[_]] +orcStruct.setFieldValue(schemaIndex, fieldValue) + } +} +orcStruct + } + + private[this] def withNullSafe(f: Any => Any): Any => Any = { +input => if (input == null) null else f(input) + } + + /** + * Builds a WritableComparable-return function ahead of time according to DataType + * to avoid pattern matching and branching costs per row. + */ + private[this] def getWritableWrapper(dataType: DataType): Any => Any = dataType match { +case NullType => _ => null + +case BooleanType => withNullSafe(o => new BooleanWritable(o.asInstanceOf[Boolean])) + +case ByteType => withNullSafe(o => new ByteWritable(o.asInstanceOf[Byte])) --- End diff -- we can apply the same technology: pass `SpecializedGetters` as a parameter to avoid boxing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19439 Thanks for the updates! My only remaining comments are about: * Default arguments for readImages in Scala not being Java-friendly (I'd still recommend taking the easy route by having 1 simple method just taking a path + 1 complex method taking all arguments.) * Determinism for sampling (commented above) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19719 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19719 **[Test build #83712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83712/testReport)** for PR 19719 at commit [`28fab02`](https://github.com/apache/spark/commit/28fab02358120f396f3272ca66ae3ac8b0534e4b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r150367926 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala --- @@ -0,0 +1,131 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import java.io.IOException + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.orc.{OrcFile, TypeDescription} + +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.analysis.Resolver +import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.types._ + +object OrcUtils extends Logging { + + // The extensions for ORC compression codecs + val extensionsForCompressionCodecNames = Map( +"NONE" -> "", +"SNAPPY" -> ".snappy", +"ZLIB" -> ".zlib", +"LZO" -> ".lzo") + + def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = { +val origPath = new Path(pathStr) +val fs = origPath.getFileSystem(conf) +val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath) + .filterNot(_.isDirectory) + .map(_.getPath) + .filterNot(_.getName.startsWith("_")) + .filterNot(_.getName.startsWith(".")) +paths + } + + private[orc] def readSchema(file: Path, conf: Configuration): Option[TypeDescription] = { +try { + val fs = file.getFileSystem(conf) + val readerOptions = OrcFile.readerOptions(conf).filesystem(fs) + val reader = OrcFile.createReader(file, readerOptions) + val schema = reader.getSchema + if (schema.getFieldNames.size == 0) { +None + } else { +Some(schema) + } +} catch { + case _: IOException => None +} + } + + private[orc] def readSchema(sparkSession: SparkSession, files: Seq[FileStatus]) + : Option[StructType] = { +val conf = sparkSession.sessionState.newHadoopConf() +files.map(_.getPath).flatMap(readSchema(_, conf)).headOption.map { schema => + logDebug(s"Reading schema from file $files, got Hive schema string: $schema") + CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType] +} + } + + private[orc] def getSchemaString(schema: StructType): String = { +schema.fields.map(f => s"${f.name}:${f.dataType.catalogString}").mkString("struct<", ",", ">") + } + + private[orc] def getTypeDescription(dataType: DataType) = dataType match { +case st: StructType => TypeDescription.fromString(getSchemaString(st)) +case _ => TypeDescription.fromString(dataType.catalogString) + } + + /** + * Return a missing schema in a give ORC file. + */ + private[orc] def getMissingSchema( --- End diff -- will we need the schema in the future? If not, I'd like to change it to return `Seq[String]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19719 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83712/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r150367867 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala --- @@ -0,0 +1,163 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.hadoop.io._ +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.common.`type`.HiveDecimal +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.execution.datasources.orc.OrcUtils.getTypeDescription +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[orc] class OrcSerializer(dataSchema: StructType) { + + private[this] lazy val orcStruct: OrcStruct = +createOrcValue(dataSchema).asInstanceOf[OrcStruct] + + private[this] val writableWrappers = +dataSchema.fields.map(f => getWritableWrapper(f.dataType)) + + def serialize(row: InternalRow): OrcStruct = { +convertInternalRowToOrcStruct(row, dataSchema, Some(writableWrappers), Some(orcStruct)) + } + + /** + * Return a Orc value object for the given Spark schema. + */ + private[this] def createOrcValue(dataType: DataType) = +OrcStruct.createValue(getTypeDescription(dataType)) + + /** + * Convert Apache Spark InternalRow to Apache ORC OrcStruct. + */ + private[this] def convertInternalRowToOrcStruct( --- End diff -- The old `OrcSerializer` doesn't have such a method, and caches the `OrcStruct` for better performance. Can we try to follow it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r150367580 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, SearchArgumentFactory} +import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder +import org.apache.orc.storage.serde2.io.HiveDecimalWritable + +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types._ + +/** + * Utility functions to convert Spark data source filters to ORC filters. + */ +private[orc] object OrcFilters { --- End diff -- This class looks very similar the one in hive module, what's the difference? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r150367382 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala --- @@ -0,0 +1,206 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.io._ +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[orc] class OrcDeserializer( +dataSchema: StructType, +requiredSchema: StructType, +maybeMissingSchema: Option[StructType]) { + + private[this] val mutableRow = new SpecificInternalRow(requiredSchema.map(_.dataType)) + + private[this] val unwrappers = requiredSchema.fields.map(f => unwrapperFor(f.dataType)) + + def deserialize(writable: OrcStruct): InternalRow = { +convertOrcStructToInternalRow(writable, dataSchema, requiredSchema, + maybeMissingSchema, Some(unwrappers), Some(mutableRow)) + } + + /** + * Convert Apache ORC OrcStruct to Apache Spark InternalRow. + * If internalRow is not None, fill into it. Otherwise, create a SpecificInternalRow and use it. + */ + private[this] def convertOrcStructToInternalRow( + orcStruct: OrcStruct, + dataSchema: StructType, + requiredSchema: StructType, + missingSchema: Option[StructType] = None, + valueUnwrappers: Option[Seq[(Any, InternalRow, Int) => Unit]] = None, + internalRow: Option[InternalRow] = None): InternalRow = { +val mutableRow = internalRow.getOrElse(new SpecificInternalRow(requiredSchema.map(_.dataType))) +val unwrappers = + valueUnwrappers.getOrElse(requiredSchema.fields.map(_.dataType).map(unwrapperFor).toSeq) +var i = 0 +val len = requiredSchema.length +val names = orcStruct.getSchema.getFieldNames +while (i < len) { + val name = requiredSchema(i).name + val writable = if (missingSchema.isEmpty || missingSchema.get.getFieldIndex(name).isEmpty) { --- End diff -- Can we strictly follow the style in `OrcFileFormat.unwrapOrcStructs`? i.e. no method like `convertOrcStructToInternalRow`, top-level columns and struct fields are handled with different while loops. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150367190 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import scala.language.existentials +import scala.util.Random + +import org.apache.commons.io.FilenameUtils +import org.apache.hadoop.conf.{Configuration, Configured} +import org.apache.hadoop.fs.{Path, PathFilter} +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat + +import org.apache.spark.sql.SparkSession + +private object RecursiveFlag { + /** + * Sets the spark recursive flag and then restores it. + * + * @param value Value to set + * @param spark Existing spark session + * @param f The function to evaluate after setting the flag + * @return Returns the evaluation result T of the function + */ + def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): T = { +val flagName = FileInputFormat.INPUT_DIR_RECURSIVE +val hadoopConf = spark.sparkContext.hadoopConfiguration +val old = Option(hadoopConf.get(flagName)) +hadoopConf.set(flagName, value.toString) +try f finally { + old match { +case Some(v) => hadoopConf.set(flagName, v) +case None => hadoopConf.unset(flagName) + } +} + } +} + +/** + * Filter that allows loading a fraction of HDFS files. + */ +private class SamplePathFilter extends Configured with PathFilter { --- End diff -- I would prefer determinism since that's a pretty important standard in Spark. I could imagine either (a) using a file hash with a *global* random number or (b) using random numbers if we are certain about how PathFilters work. For (a): * Why is there a worry about duplicate filenames? Is the full path not available? * If you do hash filenames, then I wouldn't generate a random number for each row. (If you're generating a random number per row, then why not just use that for sampling and skip the hash?) You could generate a single random number on the driver and use that in the comparison with each hash. For (b): * If we knew how PathFilters were consumed, then we could presumably figure out a way to make this deterministic just by setting a random seed here. E.g., if a new PathFilter instance were instantiated to read each partition, then that would work. But if PathFilters are shared across reads of multiple partitions, then partition ordering could cause problems with determinism. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19651#discussion_r150367100 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala --- @@ -0,0 +1,206 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.orc + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.io._ +import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp} +import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +private[orc] class OrcDeserializer( +dataSchema: StructType, +requiredSchema: StructType, +maybeMissingSchema: Option[StructType]) { + + private[this] val mutableRow = new SpecificInternalRow(requiredSchema.map(_.dataType)) + + private[this] val unwrappers = requiredSchema.fields.map(f => unwrapperFor(f.dataType)) + + def deserialize(writable: OrcStruct): InternalRow = { +convertOrcStructToInternalRow(writable, dataSchema, requiredSchema, + maybeMissingSchema, Some(unwrappers), Some(mutableRow)) + } + + /** + * Convert Apache ORC OrcStruct to Apache Spark InternalRow. + * If internalRow is not None, fill into it. Otherwise, create a SpecificInternalRow and use it. + */ + private[this] def convertOrcStructToInternalRow( + orcStruct: OrcStruct, + dataSchema: StructType, + requiredSchema: StructType, + missingSchema: Option[StructType] = None, + valueUnwrappers: Option[Seq[(Any, InternalRow, Int) => Unit]] = None, + internalRow: Option[InternalRow] = None): InternalRow = { +val mutableRow = internalRow.getOrElse(new SpecificInternalRow(requiredSchema.map(_.dataType))) +val unwrappers = + valueUnwrappers.getOrElse(requiredSchema.fields.map(_.dataType).map(unwrapperFor).toSeq) +var i = 0 +val len = requiredSchema.length +val names = orcStruct.getSchema.getFieldNames +while (i < len) { + val name = requiredSchema(i).name + val writable = if (missingSchema.isEmpty || missingSchema.get.getFieldIndex(name).isEmpty) { --- End diff -- I think this check is only needed for top-level columns. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19640: [SPARK-16986][CORE][WEB-UI] Support configure history se...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19640 I think it makes sense, but the implementation is too hacky. We should extract the time epoch and convert it to timestamp string with local timezone. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150366571 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging { .stringConf .createWithDefault(builtinHiveVersion) + // A deprecated config which is only used to provide a default value, in case some existing + // applications depend on this config, e.g. Spark SQL ODBC driver. + val HIVE_EXECUTION_VERSION = buildConf("spark.sql.hive.version") --- End diff -- This config is a kind of a fake one, I think comments are enough for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150366512 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging { .stringConf .createWithDefault(builtinHiveVersion) + // A deprecated config which is only used to provide a default value, in case some existing + // applications depend on this config, e.g. Spark SQL ODBC driver. --- End diff -- it's ODBC. Actually it's something out of Spark's control, we must be careful and avoid behavior changes, like `conf.get("spark.sql.hive.version")` should still return `1.2.1` instead of null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19719 **[Test build #83712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83712/testReport)** for PR 19719 at commit [`28fab02`](https://github.com/apache/spark/commit/28fab02358120f396f3272ca66ae3ac8b0534e4b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user yaooqinn commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150366200 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala --- @@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { conf += resultSet.getString(1) -> resultSet.getString(2) } - assert(conf.get("spark.sql.hive.metastore.version") === Some("1.2.1")) + assert(conf.get("spark.sql.hive.version") === Some("1.2.1")) +} + } + + test("Checks Hive version via SET") { +withJdbcStatement() { statement => + val resultSet = statement.executeQuery("SET") + + val conf = mutable.Map.empty[String, String] + while (resultSet.next()) { +conf += resultSet.getString(1) -> resultSet.getString(2) + } + + assert(conf.get("spark.sql.hive.version") === Some("1.2.1")) --- End diff -- In Hive, "SET" returns all changed properties while "SET -v" returns all properties. In Spark, `SET` ueries all key-value pairs that are **set** in the SQLConf of the sparkSession. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19640: [SPARK-16986][CORE][WEB-UI] Support configure history se...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/19640 @cloud-fan For the UI part, how about this PR: https://github.com/apache/spark/pull/14577 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19719 make it an AlternativeConfig maybe better --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user yaooqinn commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150364666 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging { .stringConf .createWithDefault(builtinHiveVersion) + // A deprecated config which is only used to provide a default value, in case some existing + // applications depend on this config, e.g. Spark SQL ODBC driver. --- End diff -- i guess the jdbc driver do not depend on it. it used to only return `1.2.1`, but i think that this server may work fine with other hive versions --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user yaooqinn commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150364544 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -66,6 +66,12 @@ private[spark] object HiveUtils extends Logging { .stringConf .createWithDefault(builtinHiveVersion) + // A deprecated config which is only used to provide a default value, in case some existing + // applications depend on this config, e.g. Spark SQL ODBC driver. + val HIVE_EXECUTION_VERSION = buildConf("spark.sql.hive.version") --- End diff -- doc may needed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18134 Yeah, you always can implement such a UDF. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...
Github user sergiobilello-eb commented on the issue: https://github.com/apache/spark/pull/18134 thanks @gatorsmile! Do you suggest any workaround until then? I mean not rebuilding spark with that patch Can I register my UDF that contains that logic? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18134 https://issues.apache.org/jira/browse/SPARK-20909 This is not part of 2.2. Based on JIRA, it will be included in 2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19689 **[Test build #83711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83711/testReport)** for PR 19689 at commit [`20dc217`](https://github.com/apache/spark/commit/20dc21741a5219dd19bb44806e640b7ea4eed012). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...
Github user sergiobilello-eb commented on the issue: https://github.com/apache/spark/pull/18134 ```spark-sql> select dayofweek('2007-02-03'), dayofweek('2009-07-30'), dayofweek('2017-05-27'), dayofweek(null), dayofweek('1582-10-15 13:10:15'); 17/11/10 16:08:23 INFO SparkSqlParser: Parsing command: select dayofweek('2007-02-03'), dayofweek('2009-07-30'), dayofweek('2017-05-27'), dayofweek(null), dayofweek('1582-10-15 13:10:15') 17/11/10 16:08:23 INFO HiveMetaStore: 0: get_database: default 17/11/10 16:08:23 INFO audit: ugi=sergio.bilelloip=unknown-ip-addr cmd=get_database: default 17/11/10 16:08:23 INFO HiveMetaStore: 0: get_database: default 17/11/10 16:08:23 INFO audit: ugi=sergio.bilelloip=unknown-ip-addr cmd=get_database: default 17/11/10 16:08:23 INFO HiveMetaStore: 0: get_function: default.dayofweek 17/11/10 16:08:23 INFO audit: ugi=sergio.bilelloip=unknown-ip-addr cmd=get_function: default.dayofweek Error in query: Undefined function: 'dayofweek'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...
Github user sergiobilello-eb commented on the issue: https://github.com/apache/spark/pull/18134 thanks @gatorsmile :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150363309 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala --- @@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { conf += resultSet.getString(1) -> resultSet.getString(2) } - assert(conf.get("spark.sql.hive.metastore.version") === Some("1.2.1")) + assert(conf.get("spark.sql.hive.version") === Some("1.2.1")) +} + } + + test("Checks Hive version via SET") { +withJdbcStatement() { statement => + val resultSet = statement.executeQuery("SET") + + val conf = mutable.Map.empty[String, String] + while (resultSet.next()) { +conf += resultSet.getString(1) -> resultSet.getString(2) + } + + assert(conf.get("spark.sql.hive.version") === Some("1.2.1")) --- End diff -- Nope, I think. Try it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18134 @sergiobilello-eb This is just a SQL function. You can call it in SQL interface or using the `df.select(expr("dayofweek('2009-07-30')"))`. It is not part of the DataFrame functions. You can submit a PR or report it as an issue to add such an API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19651 **[Test build #83710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83710/testReport)** for PR 19651 at commit [`4b09faa`](https://github.com/apache/spark/commit/4b09faa5538a0949494693d02b8b95065a616a8c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18134: [SPARK-20909][SQL] Add build-int SQL function - DAYOFWEE...
Github user sergiobilello-eb commented on the issue: https://github.com/apache/spark/pull/18134 This function is not reported in the API documentation: https://spark.apache.org/docs/2.2.0/api/java/index.html?org/apache/spark/sql/functions.html or https://spark.apache.org/docs/2.1.0/api/java/index.html?org/apache/spark/sql/functions.html ? How is it possible? I am trying to use it from spark-sql on a spark 2.1.0 cluster Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150362312 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala --- @@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { conf += resultSet.getString(1) -> resultSet.getString(2) } - assert(conf.get("spark.sql.hive.metastore.version") === Some("1.2.1")) + assert(conf.get("spark.sql.hive.version") === Some("1.2.1")) +} + } + + test("Checks Hive version via SET") { +withJdbcStatement() { statement => + val resultSet = statement.executeQuery("SET") + + val conf = mutable.Map.empty[String, String] + while (resultSet.next()) { +conf += resultSet.getString(1) -> resultSet.getString(2) + } + + assert(conf.get("spark.sql.hive.version") === Some("1.2.1")) --- End diff -- a default value will be returned, isn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r150361985 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,236 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields: Array[String] = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes: Map[String, Int] = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 + ) + + /** + * Used for conversion to python + */ + val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * Default values for the invalid image + * + * @param origin Origin of the invalid image + * @return Row with the default values + */ + private def invalidImageRow(origin: String): Row = +Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), Array.ofDim[Byte](0))) + + /** + * Convert the compressed image (jpeg, png, etc.) into OpenCV + * representation and store it in DataFrame Row + * + * @param origin Arbitrary string that identifies the image + * @param bytes Image bytes (for example, jpeg) + * @return DataFrame Row or None (if the decompression fails) + */ + private[spark] def decode(origin: String, bytes: Array[Byte]): Option[Row] = { + +
[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19601 @cloud-fan could you please review this again? Now, this PR does not apply any change to `ColumnVector` and `WritableColumnVector`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19077 According to [this material](http://s3-eu-west-1.amazonaws.com/presentations2013/10_presentation.pdf#pages=42), `Unsafe.allocateMemory` uses 8-byte boundary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19407: [SPARK-21667][Streaming] ConsoleSink should not f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19407 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19719: [SPARK-22487][SQL][followup] still keep spark.sql...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19719#discussion_r150358375 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala --- @@ -521,7 +521,20 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { conf += resultSet.getString(1) -> resultSet.getString(2) } - assert(conf.get("spark.sql.hive.metastore.version") === Some("1.2.1")) + assert(conf.get("spark.sql.hive.version") === Some("1.2.1")) +} + } + + test("Checks Hive version via SET") { +withJdbcStatement() { statement => + val resultSet = statement.executeQuery("SET") + + val conf = mutable.Map.empty[String, String] + while (resultSet.next()) { +conf += resultSet.getString(1) -> resultSet.getString(2) + } + + assert(conf.get("spark.sql.hive.version") === Some("1.2.1")) --- End diff -- If we do not set it explicitly, this will not be returned. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19407 Thanks! Merging to master and 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19407 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19407 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83704/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19407: [SPARK-21667][Streaming] ConsoleSink should not fail str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19407 **[Test build #83704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83704/testReport)** for PR 19407 at commit [`788fbf3`](https://github.com/apache/spark/commit/788fbf309261f1b003d5047ad4c86039de2fe16e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19459 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83703/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19459 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19459 **[Test build #83703 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83703/testReport)** for PR 19459 at commit [`6c72e37`](https://github.com/apache/spark/commit/6c72e37b0ca520d2756722ce2f18fae3ea32c39e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19719 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83705/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19719 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19719 **[Test build #83705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83705/testReport)** for PR 19719 at commit [`7b767bf`](https://github.com/apache/spark/commit/7b767bfb010f7daeecbc8475c92bc1137e7e019a). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19719 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83708/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19719 **[Test build #83708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83708/testReport)** for PR 19719 at commit [`0ce241d`](https://github.com/apache/spark/commit/0ce241da78c9a59fe255a76a6271313b5bc7d039). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19719: [SPARK-22487][SQL][followup] still keep spark.sql.hive.v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19719 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19720: [SPARK-22494][SQL] Fix 64KB limit exception with Coalesc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19720 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19720: [SPARK-22494][SQL] Fix 64KB limit exception with Coalesc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19720 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83707/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19720: [SPARK-22494][SQL] Fix 64KB limit exception with Coalesc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19720 **[Test build #83707 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83707/testReport)** for PR 19720 at commit [`f0edd7e`](https://github.com/apache/spark/commit/f0edd7e077c84b6a890f7ed9cff2eefadf5eee33). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r150349566 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -852,6 +662,41 @@ private[spark] object RandomForest extends Logging { } /** + * Find the best split for a node. + * + * @param binAggregates Bin statistics. + * @return tuple for best split: (Split, information gain, prediction at node) + */ + private[tree] def binsToBestSplit( + binAggregates: DTStatsAggregator, + splits: Array[Array[Split]], + featuresForNode: Option[Array[Int]], + node: LearningNode): (Split, ImpurityStats) = { +val validFeatureSplits = getNonConstantFeatures(binAggregates.metadata, featuresForNode) +// For each (feature, split), calculate the gain, and select the best (feature, split). +val parentImpurityCalc = if (node.stats == null) None else Some(node.stats.impurityCalculator) --- End diff -- Note to check: Will node.stats == null for the top level for sure? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r150159113 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.ml.tree.{CategoricalSplit, Split} +import org.apache.spark.mllib.tree.impurity.ImpurityCalculator +import org.apache.spark.mllib.tree.model.ImpurityStats + +/** Utility methods for choosing splits during local & distributed tree training. */ +private[impl] object SplitUtils { + + /** Sorts ordered feature categories by label centroid, returning an ordered list of categories */ + private def sortByCentroid( + binAggregates: DTStatsAggregator, + featureIndex: Int, + featureIndexIdx: Int): List[Int] = { +/* Each bin is one category (feature value). + * The bins are ordered based on centroidForCategories, and this ordering determines which + * splits are considered. (With K categories, we consider K - 1 possible splits.) + * + * centroidForCategories is a list: (category, centroid) + */ +val numCategories = binAggregates.metadata.numBins(featureIndex) +val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + +val centroidForCategories = Range(0, numCategories).map { featureValue => + val categoryStats = +binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue) + val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, categoryStats) + (featureValue, centroid) +} +// TODO(smurching): How to handle logging statements like these? +// logDebug("Centroids for categorical variable: " + centroidForCategories.mkString(",")) +// bins sorted by centroids +val categoriesSortedByCentroid = centroidForCategories.toList.sortBy(_._2).map(_._1) +// logDebug("Sorted centroids for categorical variable = " + +// categoriesSortedByCentroid.mkString(",")) +categoriesSortedByCentroid + } + + /** + * Find the best split for an unordered categorical feature at a single node. + * + * Algorithm: + * - Considers all possible subsets (exponentially many) + * + * @param featureIndex Global index of feature being split. + * @param featureIndexIdx Index of feature being split within subset of features for current node. + * @param featureSplits Array of splits for the current feature + * @param parentCalculator Optional: ImpurityCalculator containing impurity stats for current node + * @return (best split, statistics for split) If no valid split was found, the returned + * ImpurityStats instance will be invalid (have member valid = false). + */ + private[impl] def chooseUnorderedCategoricalSplit( + binAggregates: DTStatsAggregator, + featureIndex: Int, + featureIndexIdx: Int, + featureSplits: Array[Split], + parentCalculator: Option[ImpurityCalculator] = None): (Split, ImpurityStats) = { +// Unordered categorical feature +val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx) +val numSplits = binAggregates.metadata.numSplits(featureIndex) +var parentCalc = parentCalculator +val (bestFeatureSplitIndex, bestFeatureGainStats) = + Range(0, numSplits).map { splitIndex => +val leftChildStats = binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex) +val rightChildStats = binAggregates.getParentImpurityCalculator() + .subtract(leftChildStats) +val gainAndImpurityStats = ImpurityUtils.calculateImpurityStats(parentCalc, + leftChildStats, rightChildStats, binAggregates.metadata) +// Compute parent stats once, when considering first split for current feature +if
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r150159513 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala --- @@ -0,0 +1,215 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.tree.impl + +import org.apache.spark.ml.tree.{CategoricalSplit, Split} +import org.apache.spark.mllib.tree.impurity.ImpurityCalculator +import org.apache.spark.mllib.tree.model.ImpurityStats + +/** Utility methods for choosing splits during local & distributed tree training. */ +private[impl] object SplitUtils { + + /** Sorts ordered feature categories by label centroid, returning an ordered list of categories */ + private def sortByCentroid( + binAggregates: DTStatsAggregator, + featureIndex: Int, + featureIndexIdx: Int): List[Int] = { +/* Each bin is one category (feature value). + * The bins are ordered based on centroidForCategories, and this ordering determines which + * splits are considered. (With K categories, we consider K - 1 possible splits.) + * + * centroidForCategories is a list: (category, centroid) + */ +val numCategories = binAggregates.metadata.numBins(featureIndex) +val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx) + +val centroidForCategories = Range(0, numCategories).map { featureValue => + val categoryStats = +binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue) + val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, categoryStats) + (featureValue, centroid) +} +// TODO(smurching): How to handle logging statements like these? +// logDebug("Centroids for categorical variable: " + centroidForCategories.mkString(",")) +// bins sorted by centroids +val categoriesSortedByCentroid = centroidForCategories.toList.sortBy(_._2).map(_._1) +// logDebug("Sorted centroids for categorical variable = " + +// categoriesSortedByCentroid.mkString(",")) +categoriesSortedByCentroid + } + + /** + * Find the best split for an unordered categorical feature at a single node. + * + * Algorithm: + * - Considers all possible subsets (exponentially many) + * + * @param featureIndex Global index of feature being split. + * @param featureIndexIdx Index of feature being split within subset of features for current node. + * @param featureSplits Array of splits for the current feature + * @param parentCalculator Optional: ImpurityCalculator containing impurity stats for current node + * @return (best split, statistics for split) If no valid split was found, the returned + * ImpurityStats instance will be invalid (have member valid = false). + */ + private[impl] def chooseUnorderedCategoricalSplit( + binAggregates: DTStatsAggregator, + featureIndex: Int, + featureIndexIdx: Int, + featureSplits: Array[Split], + parentCalculator: Option[ImpurityCalculator] = None): (Split, ImpurityStats) = { +// Unordered categorical feature +val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx) +val numSplits = binAggregates.metadata.numSplits(featureIndex) +var parentCalc = parentCalculator +val (bestFeatureSplitIndex, bestFeatureGainStats) = + Range(0, numSplits).map { splitIndex => +val leftChildStats = binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex) +val rightChildStats = binAggregates.getParentImpurityCalculator() + .subtract(leftChildStats) +val gainAndImpurityStats = ImpurityUtils.calculateImpurityStats(parentCalc, + leftChildStats, rightChildStats, binAggregates.metadata) +// Compute parent stats once, when considering first split for current feature +if
[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/19433#discussion_r150158027 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -627,221 +621,37 @@ private[spark] object RandomForest extends Logging { } /** - * Calculate the impurity statistics for a given (feature, split) based upon left/right - * aggregates. - * - * @param stats the recycle impurity statistics for this feature's all splits, - * only 'impurity' and 'impurityCalculator' are valid between each iteration - * @param leftImpurityCalculator left node aggregates for this (feature, split) - * @param rightImpurityCalculator right node aggregate for this (feature, split) - * @param metadata learning and dataset metadata for DecisionTree - * @return Impurity statistics for this (feature, split) + * Return a list of pairs (featureIndexIdx, featureIndex) where featureIndex is the global + * (across all trees) index of a feature and featureIndexIdx is the index of a feature within the + * list of features for a given node. Filters out constant features (features with 0 splits) */ - private def calculateImpurityStats( - stats: ImpurityStats, - leftImpurityCalculator: ImpurityCalculator, - rightImpurityCalculator: ImpurityCalculator, - metadata: DecisionTreeMetadata): ImpurityStats = { - -val parentImpurityCalculator: ImpurityCalculator = if (stats == null) { - leftImpurityCalculator.copy.add(rightImpurityCalculator) -} else { - stats.impurityCalculator -} - -val impurity: Double = if (stats == null) { - parentImpurityCalculator.calculate() -} else { - stats.impurity -} - -val leftCount = leftImpurityCalculator.count -val rightCount = rightImpurityCalculator.count - -val totalCount = leftCount + rightCount - -// If left child or right child doesn't satisfy minimum instances per node, -// then this split is invalid, return invalid information gain stats. -if ((leftCount < metadata.minInstancesPerNode) || - (rightCount < metadata.minInstancesPerNode)) { - return ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator) -} - -val leftImpurity = leftImpurityCalculator.calculate() // Note: This equals 0 if count = 0 -val rightImpurity = rightImpurityCalculator.calculate() - -val leftWeight = leftCount / totalCount.toDouble -val rightWeight = rightCount / totalCount.toDouble - -val gain = impurity - leftWeight * leftImpurity - rightWeight * rightImpurity - -// if information gain doesn't satisfy minimum information gain, -// then this split is invalid, return invalid information gain stats. -if (gain < metadata.minInfoGain) { - return ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator) + private[impl] def getNonConstantFeatures( + metadata: DecisionTreeMetadata, + featuresForNode: Option[Array[Int]]): Seq[(Int, Int)] = { +Range(0, metadata.numFeaturesPerNode).map { featureIndexIdx => --- End diff -- Was there a reason to remove the use of view and withFilter here? With the output of this method going through further Seq operations, I would expect the previous implementation to be more efficient. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org