[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19554 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19554 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82967/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19554 **[Test build #82967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82967/testReport)** for PR 19554 at commit [`def29a1`](https://github.com/apache/spark/commit/def29a10ac71522218913d1af041b885ef8b9312). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r146169405 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit} /** Params for linear SVM Classifier. */ private[classification] trait LinearSVCParams extends ClassifierParams with HasRegParam with HasMaxIter with HasFitIntercept with HasTol with HasStandardization with HasWeightCol - with HasAggregationDepth with HasThreshold { + with HasAggregationDepth with HasThreshold with HasSolver { + + /** + * Specifies the loss function. Currently "hinge" and "squared_hinge" are supported. + * "hinge" is the standard SVM loss (a.k.a. L1 loss) while "squared_hinge" is the square of + * the hinge loss (a.k.a. L2 loss). + * + * @see https://en.wikipedia.org/wiki/Hinge_loss;>Hinge loss (Wikipedia) + * + * @group param + */ + @Since("2.3.0") + final val loss: Param[String] = new Param(this, "loss", "Specifies the loss " + +"function. hinge is the standard SVM loss while squared_hinge is the square of the hinge loss.", +(s: String) => LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT))) --- End diff -- Created a jira to address that issue: https://issues.apache.org/jira/browse/SPARK-22331 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r146168734 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") ( @Since("2.2.0") object LinearSVC extends DefaultParamsReadable[LinearSVC] { + /** String name for Limited-memory BFGS. */ + private[classification] val LBFGS: String = "l-bfgs".toLowerCase(Locale.ROOT) + + /** String name for Orthant-Wise Limited-memory Quasi-Newton. */ + private[classification] val OWLQN: String = "owlqn".toLowerCase(Locale.ROOT) + + /* Set of optimizers that LinearSVC supports */ + private[classification] val supportedSolvers = Array(LBFGS, OWLQN) + + /** String name for Hinge Loss. */ + private[classification] val HINGE: String = "hinge".toLowerCase(Locale.ROOT) + + /** String name for Squared Hinge Loss. */ + private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT) --- End diff -- Personally I never, but I cannot grantee it for all the Locales. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r146168660 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit} /** Params for linear SVM Classifier. */ private[classification] trait LinearSVCParams extends ClassifierParams with HasRegParam with HasMaxIter with HasFitIntercept with HasTol with HasStandardization with HasWeightCol - with HasAggregationDepth with HasThreshold { + with HasAggregationDepth with HasThreshold with HasSolver { + + /** + * Specifies the loss function. Currently "hinge" and "squared_hinge" are supported. + * "hinge" is the standard SVM loss (a.k.a. L1 loss) while "squared_hinge" is the square of + * the hinge loss (a.k.a. L2 loss). + * + * @see https://en.wikipedia.org/wiki/Hinge_loss;>Hinge loss (Wikipedia) + * + * @group param + */ + @Since("2.3.0") + final val loss: Param[String] = new Param(this, "loss", "Specifies the loss " + +"function. hinge is the standard SVM loss while squared_hinge is the square of the hinge loss.", +(s: String) => LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT))) --- End diff -- I tend to support case-insensitive params in `LinearRegression`, or change the default behavior of ParamValidators.inArray. And we should improve the consistency in supporting case-insensitive String params anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19439 @thunterdb Thanks for the reply. > It does, indirectly: this is what the field types CV_32FXX do. You need to do some low-level casting to convert the byte array to array of numbers, but that should not be a limitation for most applications. The solution may work but I don't think it's an ideal one, at least there's some performance concern. Another option is that to support all bytes[], short[], int[], float[] and double[] as data storage type candidates, and switch among them according to CvType. (Certainly, with more complexity in code). But as an initial PR, I'm OK with just supporting it with byte[], and see if there's further requirement. > Yes, this feature has been debated. Some developers have had a compelling need for directly accessing some information about the origin of the image directly inside the image. > One of the main reasons is MLlib pipelines: transformers/estimators work on a single dataframe column; so it is much easier when "origin" is a part of this column too. Sure some extra metadata could be found useful. Yet I don't see origin has a superiority over other meta data like Option[Label], Option[Roi]. I'm mostly concerned that path info is not really image data in essence, and the info may become redundant (if there's a path column), stale(if change environment), or not useful (After converting to OpenCV Mat). @dakirsa, @thunterdb would you please kindly provide an example that "origin" info is indispensable in image transformation. Maybe I missed an important scenario. > Indeed, this feature may not seem that useful at first glance. For some hadoop file systems though, in which images are accessed in a batched manner, it is useful to traverse these batches. This is important for performance reasons. This is why it is marked as experimental for the time being Does the wildcard matching still work if recursive == false ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r146167919 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") ( @Since("2.2.0") object LinearSVC extends DefaultParamsReadable[LinearSVC] { + /** String name for Limited-memory BFGS. */ + private[classification] val LBFGS: String = "l-bfgs".toLowerCase(Locale.ROOT) + + /** String name for Orthant-Wise Limited-memory Quasi-Newton. */ + private[classification] val OWLQN: String = "owlqn".toLowerCase(Locale.ROOT) + + /* Set of optimizers that LinearSVC supports */ + private[classification] val supportedSolvers = Array(LBFGS, OWLQN) + + /** String name for Hinge Loss. */ + private[classification] val HINGE: String = "hinge".toLowerCase(Locale.ROOT) + + /** String name for Squared Hinge Loss. */ + private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT) --- End diff -- ditto. IMO these characters are all in ASCII, I think they won't encounter locales issue. (But do you encounter such issue in some env ?) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r146167706 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit} /** Params for linear SVM Classifier. */ private[classification] trait LinearSVCParams extends ClassifierParams with HasRegParam with HasMaxIter with HasFitIntercept with HasTol with HasStandardization with HasWeightCol - with HasAggregationDepth with HasThreshold { + with HasAggregationDepth with HasThreshold with HasSolver { + + /** + * Specifies the loss function. Currently "hinge" and "squared_hinge" are supported. + * "hinge" is the standard SVM loss (a.k.a. L1 loss) while "squared_hinge" is the square of + * the hinge loss (a.k.a. L2 loss). + * + * @see https://en.wikipedia.org/wiki/Hinge_loss;>Hinge loss (Wikipedia) + * + * @group param + */ + @Since("2.3.0") + final val loss: Param[String] = new Param(this, "loss", "Specifies the loss " + +"function. hinge is the standard SVM loss while squared_hinge is the square of the hinge loss.", +(s: String) => LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT))) --- End diff -- Yeah, I thought about this, but `solver` param in `LinearRegression` also ignore the thing. I tend to keep them consistent, what do you think of it ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/17862 Thanks @WeichenXu123 for the comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r146166006 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") ( @Since("2.2.0") object LinearSVC extends DefaultParamsReadable[LinearSVC] { + /** String name for Limited-memory BFGS. */ + private[classification] val LBFGS: String = "l-bfgs".toLowerCase(Locale.ROOT) + + /** String name for Orthant-Wise Limited-memory Quasi-Newton. */ + private[classification] val OWLQN: String = "owlqn".toLowerCase(Locale.ROOT) + + /* Set of optimizers that LinearSVC supports */ + private[classification] val supportedSolvers = Array(LBFGS, OWLQN) + + /** String name for Hinge Loss. */ + private[classification] val HINGE: String = "hinge".toLowerCase(Locale.ROOT) + + /** String name for Squared Hinge Loss. */ + private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT) + + /* Set of loss function that LinearSVC supports */ + private[classification] val supportedLoss = Array(HINGE, SQUARED_HINGE) --- End diff -- Sure. I can update it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r146165706 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") ( @Since("2.2.0") object LinearSVC extends DefaultParamsReadable[LinearSVC] { + /** String name for Limited-memory BFGS. */ + private[classification] val LBFGS: String = "l-bfgs".toLowerCase(Locale.ROOT) + + /** String name for Orthant-Wise Limited-memory Quasi-Newton. */ + private[classification] val OWLQN: String = "owlqn".toLowerCase(Locale.ROOT) + + /* Set of optimizers that LinearSVC supports */ + private[classification] val supportedSolvers = Array(LBFGS, OWLQN) + + /** String name for Hinge Loss. */ + private[classification] val HINGE: String = "hinge".toLowerCase(Locale.ROOT) + + /** String name for Squared Hinge Loss. */ + private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT) --- End diff -- To ensure consistency with param validation in all Locales. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r146165449 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit} /** Params for linear SVM Classifier. */ private[classification] trait LinearSVCParams extends ClassifierParams with HasRegParam with HasMaxIter with HasFitIntercept with HasTol with HasStandardization with HasWeightCol - with HasAggregationDepth with HasThreshold { + with HasAggregationDepth with HasThreshold with HasSolver { + + /** + * Specifies the loss function. Currently "hinge" and "squared_hinge" are supported. + * "hinge" is the standard SVM loss (a.k.a. L1 loss) while "squared_hinge" is the square of + * the hinge loss (a.k.a. L2 loss). + * + * @see https://en.wikipedia.org/wiki/Hinge_loss;>Hinge loss (Wikipedia) + * + * @group param + */ + @Since("2.3.0") + final val loss: Param[String] = new Param(this, "loss", "Specifies the loss " + +"function. hinge is the standard SVM loss while squared_hinge is the square of the hinge loss.", +(s: String) => LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT))) --- End diff -- Correct me if I'm wrong, IMO we need toLowerCase here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r146163447 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -0,0 +1,258 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.image + +import java.awt.Color +import java.awt.color.ColorSpace +import java.io.ByteArrayInputStream +import javax.imageio.ImageIO + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.input.PortableDataStream +import org.apache.spark.sql.{DataFrame, Row, SparkSession} +import org.apache.spark.sql.types._ + +@Experimental +@Since("2.3.0") +object ImageSchema { + + val undefinedImageType = "Undefined" + + val imageFields = Array("origin", "height", "width", "nChannels", "mode", "data") + + val ocvTypes = Map( +undefinedImageType -> -1, +"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, "CV_8UC4" -> 24, +"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, "CV_8SC4" -> 25, +"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, "CV_16UC4" -> 26, +"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, "CV_16SC4" -> 27, +"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, "CV_32SC4" -> 28, +"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, "CV_32FC4" -> 29, +"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, "CV_64FC4" -> 30 + ) + + /** + * Schema for the image column: Row(String, Int, Int, Int, Int, Array[Byte]) + */ + val columnSchema = StructType( +StructField(imageFields(0), StringType, true) :: +StructField(imageFields(1), IntegerType, false) :: +StructField(imageFields(2), IntegerType, false) :: +StructField(imageFields(3), IntegerType, false) :: +// OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields(4), IntegerType, false) :: +// Bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields(5), BinaryType, false) :: Nil) + + /** + * DataFrame with a single column of images named "image" (nullable) + */ + val imageSchema = StructType(StructField("image", columnSchema, true) :: Nil) + + /** + * :: Experimental :: + * Gets the origin of the image + * + * @return The origin of the image + */ + def getOrigin(row: Row): String = row.getString(0) + + /** + * :: Experimental :: + * Gets the height of the image + * + * @return The height of the image + */ + def getHeight(row: Row): Int = row.getInt(1) + + /** + * :: Experimental :: + * Gets the width of the image + * + * @return The width of the image + */ + def getWidth(row: Row): Int = row.getInt(2) + + /** + * :: Experimental :: + * Gets the number of channels in the image + * + * @return The number of channels in the image + */ + def getNChannels(row: Row): Int = row.getInt(3) + + /** + * :: Experimental :: + * Gets the OpenCV representation as an int + * + * @return The OpenCV representation as an int + */ + def getMode(row: Row): Int = row.getInt(4) + + /** + * :: Experimental :: + * Gets the image data + * + * @return The image data + */ + def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5) + + /** + * :: Experimental :: + * Check if the DataFrame column contains images (i.e. has ImageSchema) + * + * @param df DataFrame + * @param column Column name + * @return True if the given column matches the image schema + */ + def isImageColumn(df: DataFrame, column: String): Boolean = +df.schema(column).dataType == columnSchema + +
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r146163650 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,122 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import pyspark +from pyspark import SparkContext +from pyspark.sql.types import * +from pyspark.sql.types import Row, _create_row +from pyspark.sql import DataFrame +from pyspark.ml.param.shared import * +import numpy as np + +undefinedImageType = "Undefined" + +imageFields = ["origin", "height", "width", "nChannels", "mode", "data"] + +ocvTypes = { +undefinedImageType: -1, +"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24, +"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25, +"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, "CV_16UC4": 26, +"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, "CV_16SC4": 27, +"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, "CV_32SC4": 28, +"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, "CV_32FC4": 29, +"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, "CV_64FC4": 30 +} --- End diff -- The hardcoding `ocvTypes` exists both in scala and python side. I think it will bring troubles in future code maintenance, because we need always make sure they are the same. Maybe the better way is, load the `ocvTypes` from a file OR we write a extra script, to generate the `ocvTypes` code block for scala/python ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r146164215 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,122 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import pyspark +from pyspark import SparkContext +from pyspark.sql.types import * +from pyspark.sql.types import Row, _create_row +from pyspark.sql import DataFrame +from pyspark.ml.param.shared import * +import numpy as np + +undefinedImageType = "Undefined" + +imageFields = ["origin", "height", "width", "nChannels", "mode", "data"] + +ocvTypes = { +undefinedImageType: -1, +"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24, +"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25, +"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, "CV_16UC4": 26, +"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, "CV_16SC4": 27, +"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, "CV_32SC4": 28, +"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, "CV_32FC4": 29, +"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, "CV_64FC4": 30 +} + +# DataFrame with a single column of images named "image" (nullable) +imageSchema = StructType(StructField("image", StructType([ +StructField(imageFields[0], StringType(), True), +StructField(imageFields[1], IntegerType(), False), +StructField(imageFields[2], IntegerType(), False), +StructField(imageFields[3], IntegerType(), False), +# OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields[4], StringType(), False), +# bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields[5], BinaryType(), False)]), True)) + + +def toNDArray(image): +""" +Converts an image to a one-dimensional array. + +:param image (object): The image to be converted +:rtype array: The image as a one-dimensional array + +.. versionadded:: 2.3.0 +""" +height = image.height +width = image.width +nChannels = image.nChannels +return np.ndarray( +shape=(height, width, nChannels), +dtype=np.uint8, +buffer=image.data, +strides=(width * nChannels, nChannels, 1)) + + +def toImage(array, origin="", mode=ocvTypes["CV_8UC3"]): +""" +Converts a one-dimensional array to a two-dimensional image. + +:param array (array): The array to convert to image +:param origin (str): Path to the image +:param mode (str): OpenCV compatible type + +:rtype object: Two dimensional image + +.. versionadded:: 2.3.0 +""" --- End diff -- We need to document the channel order requirement for the array passed in, e.g, if use mode "CV_8UC3", then order must be "RGB", is it right ? Or add another param for specify the channel order of the passed in array ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r146163724 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,122 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import pyspark +from pyspark import SparkContext +from pyspark.sql.types import * +from pyspark.sql.types import Row, _create_row +from pyspark.sql import DataFrame +from pyspark.ml.param.shared import * +import numpy as np + +undefinedImageType = "Undefined" + +imageFields = ["origin", "height", "width", "nChannels", "mode", "data"] + +ocvTypes = { +undefinedImageType: -1, +"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24, +"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25, +"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, "CV_16UC4": 26, +"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, "CV_16SC4": 27, +"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, "CV_32SC4": 28, +"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, "CV_32FC4": 29, +"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, "CV_64FC4": 30 +} + +# DataFrame with a single column of images named "image" (nullable) +imageSchema = StructType(StructField("image", StructType([ +StructField(imageFields[0], StringType(), True), +StructField(imageFields[1], IntegerType(), False), +StructField(imageFields[2], IntegerType(), False), +StructField(imageFields[3], IntegerType(), False), +# OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields[4], StringType(), False), --- End diff -- +1. The mode column type is not consistent with scala side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 Documentation removed as per @srowen 's request in the associated JIRA issue [SPARK-22308] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19529 **[Test build #82969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82969/testReport)** for PR 19529 at commit [`4218b86`](https://github.com/apache/spark/commit/4218b86d5a8ff2321232ff38ed3e1b217ff7db2a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19506 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82966/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19506 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19506 **[Test build #82966 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82966/testReport)** for PR 19506 at commit [`1e95a2f`](https://github.com/apache/spark/commit/1e95a2f6b7c934893b05fb396916568e8a73d523). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...
Github user falaki commented on a diff in the pull request: https://github.com/apache/spark/pull/19551#discussion_r146160321 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -499,6 +499,12 @@ test_that("create DataFrame with different data types", { expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE)) }) +test_that("SPARK-17902: collect() with stringsAsFactors enabled", { --- End diff -- Would you please verify that factor orders are identical. I wonder if `expect_equal` really verifies order of values in a factor. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...
Github user falaki commented on a diff in the pull request: https://github.com/apache/spark/pull/19551#discussion_r146160421 --- Diff: R/pkg/R/DataFrame.R --- @@ -1191,6 +1191,9 @@ setMethod("collect", vec <- do.call(c, col) stopifnot(class(vec) != "list") class(vec) <- PRIMITIVE_TYPES[[colType]] +if (stringsAsFactors && is.character(vec)) { --- End diff -- For performance maybe it is better to reverse the order of checks: `is.character(vec) && stringsAsFactors` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...
Github user taroplus commented on a diff in the pull request: https://github.com/apache/spark/pull/19548#discussion_r146160084 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -815,6 +815,12 @@ class JDBCSuite extends SparkFunSuite Some(DecimalType(DecimalType.MAX_PRECISION, 10))) assert(oracleDialect.getCatalystType(java.sql.Types.NUMERIC, "numeric", 0, null) == Some(DecimalType(DecimalType.MAX_PRECISION, 10))) +assert(oracleDialect.getCatalystType(100, "BINARY_FLOAT", 0, null) == --- End diff -- Made them constants defined in OracleDialect --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...
Github user taroplus commented on a diff in the pull request: https://github.com/apache/spark/pull/19548#discussion_r146160039 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -108,6 +112,10 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLCo conn.prepareStatement( "INSERT INTO numerics VALUES (4, 1.23, 99)").executeUpdate(); conn.commit(); + + +conn.prepareStatement("CREATE TABLE oracle_types (d BINARY_DOUBLE, f BINARY_FLOAT)").executeUpdate(); --- End diff -- removed trailing semi-colons throughout this file --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...
Github user taroplus commented on a diff in the pull request: https://github.com/apache/spark/pull/19548#discussion_r146160051 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -307,4 +315,32 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLCo assert(values.getInt(1).equals(1)) assert(values.getBoolean(2).equals(false)) } + + test("SPARK-22303: handle BINARY_DOUBLE and BINARY_FLOAT as DoubleType and FloatType") { +val tableName = "oracle_types" +val schema = StructType(Seq( + StructField("d", DoubleType, true), + StructField("f", FloatType, true))) +val props = new Properties() + +// write it back to the table (append mode) +val data = spark.sparkContext.parallelize(Seq(Row(1.1D, 2.2F))) --- End diff -- updated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...
Github user taroplus commented on a diff in the pull request: https://github.com/apache/spark/pull/19548#discussion_r146160013 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -50,9 +52,11 @@ import org.apache.spark.tags.DockerTest @DockerTest class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLContext { import testImplicits._ + // To make === between double tolerate inexact values + implicit val doubleEquality = TolerantNumerics.tolerantDoubleEquality(0.01) --- End diff -- removed this line and the test still passes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19548 **[Test build #82968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82968/testReport)** for PR 19548 at commit [`28c7ce8`](https://github.com/apache/spark/commit/28c7ce8484d2e0182b213ebc63b857f047c06ca6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...
Github user taroplus commented on a diff in the pull request: https://github.com/apache/spark/pull/19548#discussion_r146159980 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala --- @@ -28,25 +28,28 @@ private case object OracleDialect extends JdbcDialect { override def getCatalystType( sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = { -if (sqlType == Types.NUMERIC) { - val scale = if (null != md) md.build().getLong("scale") else 0L - size match { -// Handle NUMBER fields that have no precision/scale in special way -// because JDBC ResultSetMetaData converts this to 0 precision and -127 scale -// For more details, please see -// https://github.com/apache/spark/pull/8780#issuecomment-145598968 -// and -// https://github.com/apache/spark/pull/8780#issuecomment-144541760 -case 0 => Option(DecimalType(DecimalType.MAX_PRECISION, 10)) -// Handle FLOAT fields in a special way because JDBC ResultSetMetaData converts -// this to NUMERIC with -127 scale -// Not sure if there is a more robust way to identify the field as a float (or other -// numeric types that do not specify a scale. -case _ if scale == -127L => Option(DecimalType(DecimalType.MAX_PRECISION, 10)) -case _ => None - } -} else { - None +sqlType match { + case Types.NUMERIC => +val scale = if (null != md) md.build().getLong("scale") else 0L +size match { + // Handle NUMBER fields that have no precision/scale in special way + // because JDBC ResultSetMetaData converts this to 0 precision and -127 scale + // For more details, please see + // https://github.com/apache/spark/pull/8780#issuecomment-145598968 + // and + // https://github.com/apache/spark/pull/8780#issuecomment-144541760 + case 0 => Option(DecimalType(DecimalType.MAX_PRECISION, 10)) + // Handle FLOAT fields in a special way because JDBC ResultSetMetaData converts + // this to NUMERIC with -127 scale + // Not sure if there is a more robust way to identify the field as a float (or other + // numeric types that do not specify a scale. + case _ if scale == -127L => Option(DecimalType(DecimalType.MAX_PRECISION, 10)) + case _ => None +} + case -101 => Some(TimestampType) // Value for Timestamp with Time Zone in Oracle + case 100 => Some(FloatType) // Value for OracleTypes.BINARY_FLOAT + case 101 => Some(DoubleType) // Value for OracleTypes.BINARY_DOUBLE --- End diff -- This should match java's double / float definition --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19554 **[Test build #82967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82967/testReport)** for PR 19554 at commit [`def29a1`](https://github.com/apache/spark/commit/def29a10ac71522218913d1af041b885ef8b9312). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19554 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19554 Can you please add a tag in PR title `[BACKPORT-2.2]`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19554 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19554 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19554: [SPARK-22319][Core] call loginUserFromKeytab befo...
GitHub user sjrand opened a pull request: https://github.com/apache/spark/pull/19554 [SPARK-22319][Core] call loginUserFromKeytab before accessing hdfs ## What changes were proposed in this pull request? In SparkSubmit, call loginUserFromKeytab before attempting to make RPC calls to the NameNode. Same as #https://github.com/apache/spark/pull/19540, but for branch-2.2. ## How was this patch tested? Manually tested for master as described in https://github.com/apache/spark/pull/19540. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sjrand/spark SPARK-22319-branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19554.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19554 commit def29a10ac71522218913d1af041b885ef8b9312 Author: Steven RandDate: 2017-10-23T02:37:19Z call loginUserFromKeytab before accessing hdfs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user sjrand commented on the issue: https://github.com/apache/spark/pull/19540 Thanks @jerryshao and @jiangxb1987. Created https://github.com/apache/spark/pull/19554 for branch-2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r146157000 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,70 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.UserDefinedType +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, WholeStageCodegenExec} +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + /** + * If true, get data from ColumnVector in ColumnarBatch, which are generally faster. + * If false, get data from UnsafeRow build from ColumnVector + */ + override val supportCodegen: Boolean = { +// In the initial implementation, for ease of review +// support only primitive data types and # of fields is less than wholeStageMaxNumFields +val schema = StructType.fromAttributes(relation.output) +schema.fields.find(f => f.dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType => false + case _ => true +}).isEmpty && !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) + } + + private val columnIndices = +attributes.map(a => relation.output.map(o => o.exprId).indexOf(a.exprId)).toArray + + private val relationSchema = relation.schema.toArray + + private lazy val columnarBatchSchema = new StructType(columnIndices.map(i => relationSchema(i))) + + private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { +val rowCount = cachedColumnarBatch.numRows +val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) --- End diff -- Is there any reason we use `OnHeapColumnVector` instead of `OffHeapColumnVector`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r146156401 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,70 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.UserDefinedType +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, WholeStageCodegenExec} +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + /** + * If true, get data from ColumnVector in ColumnarBatch, which are generally faster. + * If false, get data from UnsafeRow build from ColumnVector + */ + override val supportCodegen: Boolean = { +// In the initial implementation, for ease of review +// support only primitive data types and # of fields is less than wholeStageMaxNumFields +val schema = StructType.fromAttributes(relation.output) +schema.fields.find(f => f.dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType => false + case _ => true +}).isEmpty && !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) + } + + private val columnIndices = +attributes.map(a => relation.output.map(o => o.exprId).indexOf(a.exprId)).toArray + + private val relationSchema = relation.schema.toArray + + private lazy val columnarBatchSchema = new StructType(columnIndices.map(i => relationSchema(i))) + + private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { +val rowCount = cachedColumnarBatch.numRows +val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) +val columnarBatch = new ColumnarBatch( + columnarBatchSchema, columnVectors.asInstanceOf[Array[ColumnVector]], rowCount) +columnarBatch.setNumRows(rowCount) + +for (i <- 0 until attributes.length) { + ColumnAccessor.decompress( +cachedColumnarBatch.buffers(columnIndices(i)), +columnarBatch.column(i).asInstanceOf[WritableColumnVector], +columnarBatchSchema.fields(i).dataType, rowCount) +} +columnarBatch + } + + override def inputRDDs(): Seq[RDD[InternalRow]] = { +if (supportCodegen) { --- End diff -- If `supportCodegen` is false, I think we never call `inputRDDs`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r146155764 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,70 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.UserDefinedType +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, WholeStageCodegenExec} +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + /** + * If true, get data from ColumnVector in ColumnarBatch, which are generally faster. + * If false, get data from UnsafeRow build from ColumnVector + */ + override val supportCodegen: Boolean = { +// In the initial implementation, for ease of review +// support only primitive data types and # of fields is less than wholeStageMaxNumFields +val schema = StructType.fromAttributes(relation.output) --- End diff -- Isn't this just `relation.schema`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19534 @sitalkedia I'm OK with either. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19519: [SPARK-21840][core] Add trait that allows conf to...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19519#discussion_r146154530 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkApplication.scala --- @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import java.lang.reflect.Modifier + +import org.apache.spark.SparkConf + +/** + * Entry point for a Spark application. Implementations must provide a no-argument constructor. + */ +private[spark] trait SparkApplication { + + def start(args: Array[String], conf: SparkConf): Unit + +} + +/** + * Implementation of SparkApplication that wraps a standard Java class with a "main" method. + * + * Configuration is propagated to the application via system properties, so running multiple + * of these in the same JVM may lead to undefined behavior due to configuration leaks. + */ +private[deploy] class JavaMainApplication(klass: Class[_]) extends SparkApplication { + + override def start(args: Array[String], conf: SparkConf): Unit = { +val mainMethod = klass.getMethod("main", new Array[String](0).getClass) +if (!Modifier.isStatic(mainMethod.getModifiers)) { + throw new IllegalStateException("The main method in the given main class must be static") +} + +val sysProps = conf.getAll.toMap +sysProps.foreach { case (k, v) => + sys.props(k) = v +} + +mainMethod.invoke(null, args) + } --- End diff -- @vanzin , do we need to remove all the system properties after `mainMethod` is finished? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146153863 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""\n|Attribute(s) `$commonNames` seem to appear in two + |different datasets, with the same name.""" --- End diff -- `\n|Please check attribute(s) `$commonNames`, they seem to appear in two...` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146153187 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) --- End diff -- nit: ``` val attrsWithSameName = o.missingInput.filter { missing => o.inputSet.exists(input => resolver(missing.name, input.name)) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146153590 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { --- End diff -- `attrsWithSameName.size > 0` -> `attrsWithSameName.nonEmpty` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146153288 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { --- End diff -- `missingAttributes` doesn't need `missingAttributes` or `availableAttributes`, could you move this definition above right after `attrsWithSameName`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146153739 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""\n|Attribute(s) `$commonNames` seem to appear in two + |different datasets, with the same name.""" --- End diff -- The `datasets` concept is a little strange here, would `inputs` or `input operators` be better? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146152572 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.internal.SQLConf --- End diff -- unused import --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146153805 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""\n|Attribute(s) `$commonNames` seem to appear in two + |different datasets, with the same name.""" --- End diff -- I'm +0 for this additional error msg. I'm not sure if it is actually more clear to end users. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19548 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 @sjrand , can you please create another PR against branch-2.2, it is not auto-mergeable, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19548 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82965/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19540: [SPARK-22319][Core] call loginUserFromKeytab befo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19540 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19548 **[Test build #82965 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82965/testReport)** for PR 19548 at commit [`91e911d`](https://github.com/apache/spark/commit/91e911d92f96c251de10a86a73f854b4993ede39). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 LGTM, merging to master and branch 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19527 @BryanCutler @MLnick @WeichenXu123 Thanks for reviewing. Your comments should be all addressed now. Please take a look again when you have more time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r146153067 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,464 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'skip' (filter out rows with invalid data) or 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'skip' (filter out rows with invalid data) or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`. + * + * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories. + * The output vectors are sparse. + * + * @see `StringIndexer` for converting categorical values into category indices + */ +@Since("2.3.0") +class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val uid: String) +extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("oneHotEncoder")) + + /** @group setParam */ + @Since("2.3.0") + def setInputCols(values: Array[String]): this.type = set(inputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setOutputCols(values: Array[String]): this.type = set(outputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setDropLast(value: Boolean): this.type = set(dropLast, value) + + /** @group setParam */ + @Since("2.3.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + + @Since("2.3.0") + override def transformSchema(schema: StructType): StructType = { +val
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146152918 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""\n|Attribute(s) `$commonNames` seem to appear in two + |different datasets, with the same name.""" +} else { + "" +} failAnalysis( - s"resolved attribute(s) $missingAttributes missing from $input " + -s"in operator ${operator.simpleString}") + s"""Some resolved attribute(s) are not present among the available attributes +|for a query. +|$missingAttributes is not in $availableAttributes.$repeatedNameHint --- End diff -- Move `$repeatedNameHint` in next line as: ```scala |$missingAttributes is not in $availableAttributes. |$repeatedNameHint ``` So you can remove the new line above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146152895 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala --- @@ -408,16 +408,28 @@ class AnalysisErrorSuite extends AnalysisTest { // CheckAnalysis should throw AnalysisException when Aggregate contains missing attribute(s) // Since we manually construct the logical plan at here and Sum only accept // LongType, DoubleType, and DecimalType. We use LongType as the type of a. -val plan = - Aggregate( +val attrA = AttributeReference("a", LongType)(exprId = ExprId(1)) +val aId = Array[String](attrA.name, attrA.exprId.id.toString) +val otherA = AttributeReference("a", LongType)(exprId = ExprId(2)) +val otherAId = Array[String](otherA.name, otherA.exprId.id.toString) +val bAlias = Alias(sum(attrA), "b")() :: Nil +val plan = Aggregate( Nil, -Alias(sum(AttributeReference("a", LongType)(exprId = ExprId(1))), "b")() :: Nil, -LocalRelation( - AttributeReference("a", LongType)(exprId = ExprId(2 +bAlias, +LocalRelation(otherA)) assert(plan.resolved) -assertAnalysisError(plan, "resolved attribute(s) a#1L missing from a#2L" :: Nil) +val errorMsg = s"""Some resolved attribute(s) are not present among the available attributes + |for a query. + |${aId.mkString("#")}L is not in ${otherAId.mkString("#")}L. --- End diff -- Seems this is still unchanged? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146152740 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.internal.SQLConf --- End diff -- No need to import SQLConf now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19506#discussion_r146151411 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala --- @@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals( override def dataType: DataType = ArrayType(LongType) override def prettyName: String = "approx_count_distinct_for_intervals" + + override def serialize(obj: Array[Long]): Array[Byte] = { +val byteArray = new Array[Byte](obj.length * 8) +obj.indices.foreach { i => --- End diff -- Fixed. Thanks for the reminder! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19506 **[Test build #82966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82966/testReport)** for PR 19506 at commit [`1e95a2f`](https://github.com/apache/spark/commit/1e95a2f6b7c934893b05fb396916568e8a73d523). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19451 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82964/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19451 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19451 **[Test build #82964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82964/testReport)** for PR 19451 at commit [`7030ab6`](https://github.com/apache/spark/commit/7030ab6a1ee4f798a39bc076595eb5c15c99a8ba). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19553: [SPARK-22330][CORE] Linear containsKey operation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19553#discussion_r146147996 --- Diff: core/src/test/scala/org/apache/spark/util/JavaUtils.scala --- @@ -0,0 +1,37 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +import org.mockito.Mockito._ + +import org.apache.spark.SparkFunSuite +import org.apache.spark.api.java.JavaUtils + +class JavaUtils extends SparkFunSuite { + + test("containsKey implementation without iteratively entrySet call") { --- End diff -- You'd be welcome to add a test that the resulting map is serializable as that's the primary job of this method, but that's of course not directly related. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19553: [SPARK-22330][CORE] Linear containsKey operation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19553#discussion_r146147889 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaUtils.scala --- @@ -43,6 +43,13 @@ private[spark] object JavaUtils { override def size: Int = underlying.size +// Delegate to implementation because AbstractMap implementation iterates over whole key set +override def containsKey(key: AnyRef): Boolean = try { --- End diff -- This reminds me that this whole class could go away once only Scala 2.12 is supported, but that's a while away. Simpler as this? ``` override def containsKey(key: AnyRef): Boolean = key match { case a: A => underlying.contains(a) case _ => false } ``` if so then get() could be: ``` override def get(key: AnyRef): B = key match { case a: A => underlying.getOrElse(a, null.asInstanceOf[B]) case _ => null.asInstanceOf[B] } ``` What about other methods -- anything worth plumbing directly to the underlying implementation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19553: [SPARK-22330][CORE] Linear containsKey operation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19553#discussion_r146147952 --- Diff: core/src/test/scala/org/apache/spark/util/JavaUtils.scala --- @@ -0,0 +1,37 @@ +/* --- End diff -- This class needs to be in the same package as what it tests, and be called `JavaUtilsSuite` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19548#discussion_r146147276 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -108,6 +112,10 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLCo conn.prepareStatement( "INSERT INTO numerics VALUES (4, 1.23, 99)").executeUpdate(); conn.commit(); + + +conn.prepareStatement("CREATE TABLE oracle_types (d BINARY_DOUBLE, f BINARY_FLOAT)").executeUpdate(); --- End diff -- No semicolon at the end of lines --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19548#discussion_r146147286 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -50,9 +52,11 @@ import org.apache.spark.tags.DockerTest @DockerTest class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLContext { import testImplicits._ + // To make === between double tolerate inexact values + implicit val doubleEquality = TolerantNumerics.tolerantDoubleEquality(0.01) --- End diff -- Curious why it would be that inequal somewhere in these tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19548#discussion_r146147293 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -815,6 +815,12 @@ class JDBCSuite extends SparkFunSuite Some(DecimalType(DecimalType.MAX_PRECISION, 10))) assert(oracleDialect.getCatalystType(java.sql.Types.NUMERIC, "numeric", 0, null) == Some(DecimalType(DecimalType.MAX_PRECISION, 10))) +assert(oracleDialect.getCatalystType(100, "BINARY_FLOAT", 0, null) == --- End diff -- Let's define constants somewhere suitable for 100/100/-101 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19548#discussion_r146147267 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -307,4 +315,32 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLCo assert(values.getInt(1).equals(1)) assert(values.getBoolean(2).equals(false)) } + + test("SPARK-22303: handle BINARY_DOUBLE and BINARY_FLOAT as DoubleType and FloatType") { +val tableName = "oracle_types" +val schema = StructType(Seq( + StructField("d", DoubleType, true), + StructField("f", FloatType, true))) +val props = new Properties() + +// write it back to the table (append mode) +val data = spark.sparkContext.parallelize(Seq(Row(1.1D, 2.2F))) --- End diff -- Nit: I think it's more conventional to write "1.1" and "2.2f" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...
Github user taroplus commented on the issue: https://github.com/apache/spark/pull/19548 Okay new integration test has been added According to their [document](https://docs.oracle.com/database/121/TTSQL/types.htm#TTSQL148) > BINARY_DOUBLE is a 64-bit, double-precision, floating-point number. > > BINARY_FLOAT is a 32-bit, single-precision, floating-point number. > > Both BINARY_FLOAT and BINARY_DOUBLE support the special values Inf, -Inf, and NaN (not a number) and conform to the IEEE standard. This should match java's double / float definition --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19548 **[Test build #82965 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82965/testReport)** for PR 19548 at commit [`91e911d`](https://github.com/apache/spark/commit/91e911d92f96c251de10a86a73f854b4993ede39). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19451 **[Test build #82964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82964/testReport)** for PR 19451 at commit [`7030ab6`](https://github.com/apache/spark/commit/7030ab6a1ee4f798a39bc076595eb5c15c99a8ba). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19553: [SPARK-22330][CORE] Linear containsKey operation for ser...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19553 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19553: [SPARK-22330][CORE] Linear containsKey operation ...
GitHub user Whoosh opened a pull request: https://github.com/apache/spark/pull/19553 [SPARK-22330][CORE] Linear containsKey operation for serialized maps â¦alization. ## What changes were proposed in this pull request? Use non-linear containsKey operation for serialized maps, lookup into underlying map. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/Whoosh/spark SPARK-22330 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19553.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19553 commit 518b301d32a77a44812235d42f07302edcc4fda2 Author: Alexander IstominDate: 2017-10-22T21:46:50Z SPARK-22330 use underlying contains instead of default AbstractMap realization. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19451 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82962/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19451 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19451 **[Test build #82962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82962/testReport)** for PR 19451 at commit [`f6c95d3`](https://github.com/apache/spark/commit/f6c95d36c7a91937f0ec27485c8a439d518fdedd). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19552 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82961/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19552 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19552 **[Test build #82961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82961/testReport)** for PR 19552 at commit [`a256627`](https://github.com/apache/spark/commit/a256627dbc2772e69cd0f9f2aa43b384165e3657). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19543 **[Test build #82963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82963/testReport)** for PR 19543 at commit [`c587946`](https://github.com/apache/spark/commit/c587946a03ade7c4f56864e0c0d2c3552d0e6b80). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19543 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82963/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19543 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19543 **[Test build #82963 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82963/testReport)** for PR 19543 at commit [`c587946`](https://github.com/apache/spark/commit/c587946a03ade7c4f56864e0c0d2c3552d0e6b80). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19543 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19451 **[Test build #82962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82962/testReport)** for PR 19451 at commit [`f6c95d3`](https://github.com/apache/spark/commit/f6c95d36c7a91937f0ec27485c8a439d518fdedd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19552 Ping, @budde . You can override this PR whenever you're ready. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19552 **[Test build #82961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82961/testReport)** for PR 19552 at commit [`a256627`](https://github.com/apache/spark/commit/a256627dbc2772e69cd0f9f2aa43b384165e3657). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19552 [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default ## What changes were proposed in this pull request? In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical issue by default. - [SPARK-19611](https://issues.apache.org/jira/browse/SPARK-19611) uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files. > This situation will occur for any Hive table that wasn't created by Spark or that was created prior to Spark 2.1.0. If a user attempts to run a query over such a table containing a case-sensitive field name in the query projection or in the query filter, the query will return 0 results in every case. - However, [SPARK-22306](https://issues.apache.org/jira/browse/SPARK-22306) reports this also corrupts Hive Metastore schema by removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner. This is undesirable side-effects. Hive Metastore is a shared resource and Spark should not corrupt it by default. - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look okay at least. However, we need to figure out the issue of changing owners. Also, we cannot backport bucketing patch into `branch-2.2`. We need to verify this option with more tests before releasing 2.3.0. This PR proposes to recover that option back to `NEVER_INFO` like Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by themselves. ## How was this patch tested? Pass the existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-22329 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19552.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19552 commit a256627dbc2772e69cd0f9f2aa43b384165e3657 Author: Dongjoon HyunDate: 2017-10-22T17:59:15Z [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17100 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82960/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/17100 Applied all changes @viirya & @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17100 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17100 **[Test build #82960 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82960/testReport)** for PR 17100 at commit [`75be390`](https://github.com/apache/spark/commit/75be3900f65e64bd083a91e5c63978489a61ccfe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] PySpark DataFrames: The error message...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17100 **[Test build #82960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82960/testReport)** for PR 17100 at commit [`75be390`](https://github.com/apache/spark/commit/75be3900f65e64bd083a91e5c63978489a61ccfe). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19370 @jsnowacki, per review comment - https://github.com/apache/spark/pull/19370#pullrequestreview-65758626, let's describe the background and the proposal in more details in the PR description. It is true it's the lack of reviewers for Windows specific issues and we should better describe it at our best. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19370#discussion_r146131458 --- Diff: bin/find-spark-home.cmd --- @@ -0,0 +1,44 @@ +@echo off + +rem +rem Licensed to the Apache Software Foundation (ASF) under one or more +rem contributor license agreements. See the NOTICE file distributed with +rem this work for additional information regarding copyright ownership. +rem The ASF licenses this file to You under the Apache License, Version 2.0 +rem (the "License"); you may not use this file except in compliance with +rem the License. You may obtain a copy of the License at +rem +remhttp://www.apache.org/licenses/LICENSE-2.0 +rem +rem Unless required by applicable law or agreed to in writing, software +rem distributed under the License is distributed on an "AS IS" BASIS, +rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +rem See the License for the specific language governing permissions and +rem limitations under the License. +rem + +rem Path to Python script finding SPARK_HOME +set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py + +rem Default to standard python interpreter unless told otherwise +set PYTHON_RUNNER=python +if not "x%PYSPARK_DRIVER_PYTHON%" =="x" ( + set PYTHON_RUNNER=%PYSPARK_DRIVER_PYTHON% --- End diff -- `PYSPARK_PYTHON` is a valid env. Could we consider this env too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19370#discussion_r146132096 --- Diff: bin/find-spark-home.cmd --- @@ -0,0 +1,44 @@ +@echo off + +rem +rem Licensed to the Apache Software Foundation (ASF) under one or more +rem contributor license agreements. See the NOTICE file distributed with +rem this work for additional information regarding copyright ownership. +rem The ASF licenses this file to You under the Apache License, Version 2.0 +rem (the "License"); you may not use this file except in compliance with +rem the License. You may obtain a copy of the License at +rem +remhttp://www.apache.org/licenses/LICENSE-2.0 +rem +rem Unless required by applicable law or agreed to in writing, software +rem distributed under the License is distributed on an "AS IS" BASIS, +rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +rem See the License for the specific language governing permissions and +rem limitations under the License. +rem + +rem Path to Python script finding SPARK_HOME +set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py + +rem Default to standard python interpreter unless told otherwise +set PYTHON_RUNNER=python +if not "x%PYSPARK_DRIVER_PYTHON%" =="x" ( + set PYTHON_RUNNER=%PYSPARK_DRIVER_PYTHON% +) + +rem Only attempt to find SPARK_HOME if it is not set. +if "x%SPARK_HOME%"=="x" ( + rem We are pip installed, use the Python script to resolve a reasonable SPARK_HOME + if exist "%FIND_SPARK_HOME_SCRIPT%" ( +rem If there is no python installed it will fail with message: +rem 'python' is not recognized as an internal or external command, +for /f %%i in ('%PYTHON_RUNNER% %FIND_SPARK_HOME_SCRIPT%') do set SPARK_HOME=%%i --- End diff -- Should we maybe add `"delims="` (`for /f "delims=" %%i`) to overwrite the default space and tab delimiters? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org