[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19554
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19554
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82967/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19554
  
**[Test build #82967 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82967/testReport)**
 for PR 19554 at commit 
[`def29a1`](https://github.com/apache/spark/commit/def29a10ac71522218913d1af041b885ef8b9312).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-22 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r146169405
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit}
 /** Params for linear SVM Classifier. */
 private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
   with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
-  with HasAggregationDepth with HasThreshold {
+  with HasAggregationDepth with HasThreshold with HasSolver {
+
+  /**
+   * Specifies the loss function. Currently "hinge" and "squared_hinge" 
are supported.
+   * "hinge" is the standard SVM loss (a.k.a. L1 loss) while 
"squared_hinge" is the square of
+   * the hinge loss (a.k.a. L2 loss).
+   *
+   * @see https://en.wikipedia.org/wiki/Hinge_loss;>Hinge loss 
(Wikipedia)
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final val loss: Param[String] = new Param(this, "loss", "Specifies the 
loss " +
+"function. hinge is the standard SVM loss while squared_hinge is the 
square of the hinge loss.",
+(s: String) => 
LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT)))
--- End diff --

Created a jira to address that issue: 
https://issues.apache.org/jira/browse/SPARK-22331


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-22 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r146168734
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") (
 @Since("2.2.0")
 object LinearSVC extends DefaultParamsReadable[LinearSVC] {
 
+  /** String name for Limited-memory BFGS. */
+  private[classification] val LBFGS: String = 
"l-bfgs".toLowerCase(Locale.ROOT)
+
+  /** String name for Orthant-Wise Limited-memory Quasi-Newton. */
+  private[classification] val OWLQN: String = 
"owlqn".toLowerCase(Locale.ROOT)
+
+  /* Set of optimizers that LinearSVC supports */
+  private[classification] val supportedSolvers = Array(LBFGS, OWLQN)
+
+  /** String name for Hinge Loss. */
+  private[classification] val HINGE: String = 
"hinge".toLowerCase(Locale.ROOT)
+
+  /** String name for Squared Hinge Loss. */
+  private[classification] val SQUARED_HINGE: String = 
"squared_hinge".toLowerCase(Locale.ROOT)
--- End diff --

Personally I never, but I cannot grantee it for all the Locales.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-22 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r146168660
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit}
 /** Params for linear SVM Classifier. */
 private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
   with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
-  with HasAggregationDepth with HasThreshold {
+  with HasAggregationDepth with HasThreshold with HasSolver {
+
+  /**
+   * Specifies the loss function. Currently "hinge" and "squared_hinge" 
are supported.
+   * "hinge" is the standard SVM loss (a.k.a. L1 loss) while 
"squared_hinge" is the square of
+   * the hinge loss (a.k.a. L2 loss).
+   *
+   * @see https://en.wikipedia.org/wiki/Hinge_loss;>Hinge loss 
(Wikipedia)
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final val loss: Param[String] = new Param(this, "loss", "Specifies the 
loss " +
+"function. hinge is the standard SVM loss while squared_hinge is the 
square of the hinge loss.",
+(s: String) => 
LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT)))
--- End diff --

I tend to support case-insensitive params in `LinearRegression`, or change 
the default behavior of ParamValidators.inArray. And we should improve the 
consistency in supporting case-insensitive String params anyway. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-22 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/19439
  
@thunterdb Thanks for the reply. 

> It does, indirectly: this is what the field types CV_32FXX do. You need 
to do some low-level casting to convert the byte array to array of numbers, but 
that should not be a limitation for most applications.

The solution may work but I don't think it's an ideal one, at least there's 
some performance concern. Another option is that to support all bytes[], 
short[], int[], float[] and double[] as data storage type candidates, and 
switch among them according to CvType. (Certainly, with more complexity in 
code).

But as an initial PR, I'm OK with just supporting it with byte[], and see 
if there's further requirement.

> Yes, this feature has been debated. Some developers have had a compelling 
need for directly accessing some information about the origin of the image 
directly inside the image.

> One of the main reasons is MLlib pipelines: transformers/estimators work 
on a single dataframe column; so it is much easier when "origin" is a part of 
this column too.

Sure some extra metadata could be found useful. Yet I don't see origin has 
a superiority over other meta data like Option[Label], Option[Roi]. I'm mostly 
concerned that path info is not really image data in essence, and the info may 
become redundant (if there's a path column), stale(if change environment), or 
not useful (After converting to OpenCV Mat). @dakirsa, @thunterdb  would you 
please kindly provide an example that "origin" info is indispensable in image 
transformation. Maybe I missed an important scenario.

> Indeed, this feature may not seem that useful at first glance. For some 
hadoop file systems though, in which images are accessed in a batched manner, 
it is useful to traverse these batches. This is important for performance 
reasons. This is why it is marked as experimental for the time being

Does the wildcard matching still work if recursive == false ? 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-22 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r146167919
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") (
 @Since("2.2.0")
 object LinearSVC extends DefaultParamsReadable[LinearSVC] {
 
+  /** String name for Limited-memory BFGS. */
+  private[classification] val LBFGS: String = 
"l-bfgs".toLowerCase(Locale.ROOT)
+
+  /** String name for Orthant-Wise Limited-memory Quasi-Newton. */
+  private[classification] val OWLQN: String = 
"owlqn".toLowerCase(Locale.ROOT)
+
+  /* Set of optimizers that LinearSVC supports */
+  private[classification] val supportedSolvers = Array(LBFGS, OWLQN)
+
+  /** String name for Hinge Loss. */
+  private[classification] val HINGE: String = 
"hinge".toLowerCase(Locale.ROOT)
+
+  /** String name for Squared Hinge Loss. */
+  private[classification] val SQUARED_HINGE: String = 
"squared_hinge".toLowerCase(Locale.ROOT)
--- End diff --

ditto. IMO these characters are all in ASCII, I think they won't encounter 
locales issue. (But do you encounter such issue in some env ?)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-22 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r146167706
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit}
 /** Params for linear SVM Classifier. */
 private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
   with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
-  with HasAggregationDepth with HasThreshold {
+  with HasAggregationDepth with HasThreshold with HasSolver {
+
+  /**
+   * Specifies the loss function. Currently "hinge" and "squared_hinge" 
are supported.
+   * "hinge" is the standard SVM loss (a.k.a. L1 loss) while 
"squared_hinge" is the square of
+   * the hinge loss (a.k.a. L2 loss).
+   *
+   * @see https://en.wikipedia.org/wiki/Hinge_loss;>Hinge loss 
(Wikipedia)
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final val loss: Param[String] = new Param(this, "loss", "Specifies the 
loss " +
+"function. hinge is the standard SVM loss while squared_hinge is the 
square of the hinge loss.",
+(s: String) => 
LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT)))
--- End diff --

Yeah, I thought about this, but `solver` param in `LinearRegression` also 
ignore the thing. I tend to keep them consistent, what do you think of it ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...

2017-10-22 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/17862
  
Thanks @WeichenXu123 for the comments. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-22 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r146166006
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") (
 @Since("2.2.0")
 object LinearSVC extends DefaultParamsReadable[LinearSVC] {
 
+  /** String name for Limited-memory BFGS. */
+  private[classification] val LBFGS: String = 
"l-bfgs".toLowerCase(Locale.ROOT)
+
+  /** String name for Orthant-Wise Limited-memory Quasi-Newton. */
+  private[classification] val OWLQN: String = 
"owlqn".toLowerCase(Locale.ROOT)
+
+  /* Set of optimizers that LinearSVC supports */
+  private[classification] val supportedSolvers = Array(LBFGS, OWLQN)
+
+  /** String name for Hinge Loss. */
+  private[classification] val HINGE: String = 
"hinge".toLowerCase(Locale.ROOT)
+
+  /** String name for Squared Hinge Loss. */
+  private[classification] val SQUARED_HINGE: String = 
"squared_hinge".toLowerCase(Locale.ROOT)
+
+  /* Set of loss function that LinearSVC supports */
+  private[classification] val supportedLoss = Array(HINGE, SQUARED_HINGE)
--- End diff --

Sure. I can update it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-22 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r146165706
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") (
 @Since("2.2.0")
 object LinearSVC extends DefaultParamsReadable[LinearSVC] {
 
+  /** String name for Limited-memory BFGS. */
+  private[classification] val LBFGS: String = 
"l-bfgs".toLowerCase(Locale.ROOT)
+
+  /** String name for Orthant-Wise Limited-memory Quasi-Newton. */
+  private[classification] val OWLQN: String = 
"owlqn".toLowerCase(Locale.ROOT)
+
+  /* Set of optimizers that LinearSVC supports */
+  private[classification] val supportedSolvers = Array(LBFGS, OWLQN)
+
+  /** String name for Hinge Loss. */
+  private[classification] val HINGE: String = 
"hinge".toLowerCase(Locale.ROOT)
+
+  /** String name for Squared Hinge Loss. */
+  private[classification] val SQUARED_HINGE: String = 
"squared_hinge".toLowerCase(Locale.ROOT)
--- End diff --

To ensure consistency with param validation in all Locales.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-22 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r146165449
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit}
 /** Params for linear SVM Classifier. */
 private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
   with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
-  with HasAggregationDepth with HasThreshold {
+  with HasAggregationDepth with HasThreshold with HasSolver {
+
+  /**
+   * Specifies the loss function. Currently "hinge" and "squared_hinge" 
are supported.
+   * "hinge" is the standard SVM loss (a.k.a. L1 loss) while 
"squared_hinge" is the square of
+   * the hinge loss (a.k.a. L2 loss).
+   *
+   * @see https://en.wikipedia.org/wiki/Hinge_loss;>Hinge loss 
(Wikipedia)
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final val loss: Param[String] = new Param(this, "loss", "Specifies the 
loss " +
+"function. hinge is the standard SVM loss while squared_hinge is the 
square of the hinge loss.",
+(s: String) => 
LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT)))
--- End diff --

Correct me if I'm wrong, IMO we need toLowerCase here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-22 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r146163447
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields = Array("origin", "height", "width", "nChannels", 
"mode", "data")
+
+  val ocvTypes = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, 
"CV_8UC4" -> 24,
+"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, 
"CV_8SC4" -> 25,
+"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, 
"CV_16UC4" -> 26,
+"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, 
"CV_16SC4" -> 27,
+"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, 
"CV_32SC4" -> 28,
+"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, 
"CV_32FC4" -> 29,
+"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, 
"CV_64FC4" -> 30
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * :: Experimental ::
+   * Check if the DataFrame column contains images (i.e. has ImageSchema)
+   *
+   * @param df DataFrame
+   * @param column Column name
+   * @return True if the given column matches the image schema
+   */
+  def isImageColumn(df: DataFrame, column: String): Boolean =
+df.schema(column).dataType == columnSchema
+
+  

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-22 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r146163650
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
--- End diff --

The hardcoding `ocvTypes` exists both in scala and python side. I think it 
will bring troubles in future code maintenance, because we need always make 
sure they are the same. Maybe the better way is, load the `ocvTypes` from a 
file OR we write a extra script, to generate the `ocvTypes` code block for 
scala/python ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-22 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r146164215
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+# DataFrame with a single column of images named "image" (nullable)
+imageSchema = StructType(StructField("image", StructType([
+StructField(imageFields[0], StringType(),  True),
+StructField(imageFields[1], IntegerType(), False),
+StructField(imageFields[2], IntegerType(), False),
+StructField(imageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields[5], BinaryType(), False)]), True))
+
+
+def toNDArray(image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image (object): The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+
+def toImage(array, origin="", mode=ocvTypes["CV_8UC3"]):
+"""
+Converts a one-dimensional array to a two-dimensional image.
+
+:param array (array): The array to convert to image
+:param origin (str): Path to the image
+:param mode (str): OpenCV compatible type
+
+:rtype object: Two dimensional image
+
+.. versionadded:: 2.3.0
+"""
--- End diff --

We need to document the channel order requirement for the array passed in, 
e.g, if use mode "CV_8UC3", then order must be "RGB", is it right ?
Or add another param for specify the channel order of the passed in array ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-22 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r146163724
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+# DataFrame with a single column of images named "image" (nullable)
+imageSchema = StructType(StructField("image", StructType([
+StructField(imageFields[0], StringType(),  True),
+StructField(imageFields[1], IntegerType(), False),
+StructField(imageFields[2], IntegerType(), False),
+StructField(imageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields[4], StringType(), False),
--- End diff --

+1. The mode column type is not consistent with scala side. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-22 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
Documentation removed as per @srowen 's request in the associated JIRA 
issue [SPARK-22308]



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19529
  
**[Test build #82969 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82969/testReport)**
 for PR 19529 at commit 
[`4218b86`](https://github.com/apache/spark/commit/4218b86d5a8ff2321232ff38ed3e1b217ff7db2a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19506
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82966/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19506
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19506
  
**[Test build #82966 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82966/testReport)**
 for PR 19506 at commit 
[`1e95a2f`](https://github.com/apache/spark/commit/1e95a2f6b7c934893b05fb396916568e8a73d523).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...

2017-10-22 Thread falaki
Github user falaki commented on a diff in the pull request:

https://github.com/apache/spark/pull/19551#discussion_r146160321
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -499,6 +499,12 @@ test_that("create DataFrame with different data 
types", {
   expect_equal(collect(df), data.frame(l, stringsAsFactors = FALSE))
 })
 
+test_that("SPARK-17902: collect() with stringsAsFactors enabled", {
--- End diff --

Would you please verify that factor orders are identical. I wonder if 
`expect_equal` really verifies order of values in a factor.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19551: [WIP][SPARK-17902][R] Revive stringsAsFactors opt...

2017-10-22 Thread falaki
Github user falaki commented on a diff in the pull request:

https://github.com/apache/spark/pull/19551#discussion_r146160421
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1191,6 +1191,9 @@ setMethod("collect",
 vec <- do.call(c, col)
 stopifnot(class(vec) != "list")
 class(vec) <- PRIMITIVE_TYPES[[colType]]
+if (stringsAsFactors && is.character(vec)) {
--- End diff --

For performance maybe it is better to reverse the order of checks: 
`is.character(vec) && stringsAsFactors`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...

2017-10-22 Thread taroplus
Github user taroplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/19548#discussion_r146160084
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -815,6 +815,12 @@ class JDBCSuite extends SparkFunSuite
   Some(DecimalType(DecimalType.MAX_PRECISION, 10)))
 assert(oracleDialect.getCatalystType(java.sql.Types.NUMERIC, 
"numeric", 0, null) ==
   Some(DecimalType(DecimalType.MAX_PRECISION, 10)))
+assert(oracleDialect.getCatalystType(100, "BINARY_FLOAT", 0, null) ==
--- End diff --

Made them constants defined in OracleDialect


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...

2017-10-22 Thread taroplus
Github user taroplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/19548#discussion_r146160039
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -108,6 +112,10 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSQLCo
 conn.prepareStatement(
   "INSERT INTO numerics VALUES (4, 1.23, 99)").executeUpdate();
 conn.commit();
+
+
+conn.prepareStatement("CREATE TABLE oracle_types (d BINARY_DOUBLE, f 
BINARY_FLOAT)").executeUpdate();
--- End diff --

removed trailing semi-colons throughout this file


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...

2017-10-22 Thread taroplus
Github user taroplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/19548#discussion_r146160051
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -307,4 +315,32 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSQLCo
 assert(values.getInt(1).equals(1))
 assert(values.getBoolean(2).equals(false))
   }
+
+  test("SPARK-22303: handle BINARY_DOUBLE and BINARY_FLOAT as DoubleType 
and FloatType") {
+val tableName = "oracle_types"
+val schema = StructType(Seq(
+  StructField("d", DoubleType, true),
+  StructField("f", FloatType, true)))
+val props = new Properties()
+
+// write it back to the table (append mode)
+val data = spark.sparkContext.parallelize(Seq(Row(1.1D, 2.2F)))
--- End diff --

updated


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...

2017-10-22 Thread taroplus
Github user taroplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/19548#discussion_r146160013
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -50,9 +52,11 @@ import org.apache.spark.tags.DockerTest
 @DockerTest
 class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with 
SharedSQLContext {
   import testImplicits._
+  // To make === between double tolerate inexact values
+  implicit val doubleEquality = 
TolerantNumerics.tolerantDoubleEquality(0.01)
--- End diff --

removed this line and the test still passes


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19548
  
**[Test build #82968 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82968/testReport)**
 for PR 19548 at commit 
[`28c7ce8`](https://github.com/apache/spark/commit/28c7ce8484d2e0182b213ebc63b857f047c06ca6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...

2017-10-22 Thread taroplus
Github user taroplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/19548#discussion_r146159980
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala ---
@@ -28,25 +28,28 @@ private case object OracleDialect extends JdbcDialect {
 
   override def getCatalystType(
   sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): 
Option[DataType] = {
-if (sqlType == Types.NUMERIC) {
-  val scale = if (null != md) md.build().getLong("scale") else 0L
-  size match {
-// Handle NUMBER fields that have no precision/scale in special way
-// because JDBC ResultSetMetaData converts this to 0 precision and 
-127 scale
-// For more details, please see
-// https://github.com/apache/spark/pull/8780#issuecomment-145598968
-// and
-// https://github.com/apache/spark/pull/8780#issuecomment-144541760
-case 0 => Option(DecimalType(DecimalType.MAX_PRECISION, 10))
-// Handle FLOAT fields in a special way because JDBC 
ResultSetMetaData converts
-// this to NUMERIC with -127 scale
-// Not sure if there is a more robust way to identify the field as 
a float (or other
-// numeric types that do not specify a scale.
-case _ if scale == -127L => 
Option(DecimalType(DecimalType.MAX_PRECISION, 10))
-case _ => None
-  }
-} else {
-  None
+sqlType match {
+  case Types.NUMERIC =>
+val scale = if (null != md) md.build().getLong("scale") else 0L
+size match {
+  // Handle NUMBER fields that have no precision/scale in special 
way
+  // because JDBC ResultSetMetaData converts this to 0 precision 
and -127 scale
+  // For more details, please see
+  // 
https://github.com/apache/spark/pull/8780#issuecomment-145598968
+  // and
+  // 
https://github.com/apache/spark/pull/8780#issuecomment-144541760
+  case 0 => Option(DecimalType(DecimalType.MAX_PRECISION, 10))
+  // Handle FLOAT fields in a special way because JDBC 
ResultSetMetaData converts
+  // this to NUMERIC with -127 scale
+  // Not sure if there is a more robust way to identify the field 
as a float (or other
+  // numeric types that do not specify a scale.
+  case _ if scale == -127L => 
Option(DecimalType(DecimalType.MAX_PRECISION, 10))
+  case _ => None
+}
+  case -101 => Some(TimestampType) // Value for Timestamp with Time 
Zone in Oracle
+  case 100 => Some(FloatType) // Value for OracleTypes.BINARY_FLOAT
+  case 101 => Some(DoubleType) // Value for OracleTypes.BINARY_DOUBLE
--- End diff --

This should match java's double / float definition


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19554
  
**[Test build #82967 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82967/testReport)**
 for PR 19554 at commit 
[`def29a1`](https://github.com/apache/spark/commit/def29a10ac71522218913d1af041b885ef8b9312).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19554: [SPARK-22319][Core][BACKPORT-2.2] call loginUserFromKeyt...

2017-10-22 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19554
  
ok to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19554
  
Can you please add a tag in PR title `[BACKPORT-2.2]`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19554
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19554: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19554
  
ok to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19554: [SPARK-22319][Core] call loginUserFromKeytab befo...

2017-10-22 Thread sjrand
GitHub user sjrand opened a pull request:

https://github.com/apache/spark/pull/19554

[SPARK-22319][Core] call loginUserFromKeytab before accessing hdfs

## What changes were proposed in this pull request?

In SparkSubmit, call loginUserFromKeytab before attempting to make RPC 
calls to the NameNode.

Same as #https://github.com/apache/spark/pull/19540, but for branch-2.2.

## How was this patch tested?

Manually tested for master as described in 
https://github.com/apache/spark/pull/19540.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sjrand/spark SPARK-22319-branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19554.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19554


commit def29a10ac71522218913d1af041b885ef8b9312
Author: Steven Rand 
Date:   2017-10-23T02:37:19Z

call loginUserFromKeytab before accessing hdfs




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread sjrand
Github user sjrand commented on the issue:

https://github.com/apache/spark/pull/19540
  
Thanks @jerryshao and @jiangxb1987. Created 
https://github.com/apache/spark/pull/19554 for branch-2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r146157000
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,70 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.UserDefinedType
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  /**
+   * If true, get data from ColumnVector in ColumnarBatch, which are 
generally faster.
+   * If false, get data from UnsafeRow build from ColumnVector
+   */
+  override val supportCodegen: Boolean = {
+// In the initial implementation, for ease of review
+// support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
+val schema = StructType.fromAttributes(relation.output)
+schema.fields.find(f => f.dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType |
+   FloatType | DoubleType => false
+  case _ => true
+}).isEmpty && !WholeStageCodegenExec.isTooManyFields(conf, 
relation.schema)
+  }
+
+  private val columnIndices =
+attributes.map(a => relation.output.map(o => 
o.exprId).indexOf(a.exprId)).toArray
+
+  private val relationSchema = relation.schema.toArray
+
+  private lazy val columnarBatchSchema = new 
StructType(columnIndices.map(i => relationSchema(i)))
+
+  private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): 
ColumnarBatch = {
+val rowCount = cachedColumnarBatch.numRows
+val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, 
columnarBatchSchema)
--- End diff --

Is there any reason we use `OnHeapColumnVector` instead of 
`OffHeapColumnVector`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r146156401
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,70 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.UserDefinedType
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  /**
+   * If true, get data from ColumnVector in ColumnarBatch, which are 
generally faster.
+   * If false, get data from UnsafeRow build from ColumnVector
+   */
+  override val supportCodegen: Boolean = {
+// In the initial implementation, for ease of review
+// support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
+val schema = StructType.fromAttributes(relation.output)
+schema.fields.find(f => f.dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType |
+   FloatType | DoubleType => false
+  case _ => true
+}).isEmpty && !WholeStageCodegenExec.isTooManyFields(conf, 
relation.schema)
+  }
+
+  private val columnIndices =
+attributes.map(a => relation.output.map(o => 
o.exprId).indexOf(a.exprId)).toArray
+
+  private val relationSchema = relation.schema.toArray
+
+  private lazy val columnarBatchSchema = new 
StructType(columnIndices.map(i => relationSchema(i)))
+
+  private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): 
ColumnarBatch = {
+val rowCount = cachedColumnarBatch.numRows
+val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, 
columnarBatchSchema)
+val columnarBatch = new ColumnarBatch(
+  columnarBatchSchema, 
columnVectors.asInstanceOf[Array[ColumnVector]], rowCount)
+columnarBatch.setNumRows(rowCount)
+
+for (i <- 0 until attributes.length) {
+  ColumnAccessor.decompress(
+cachedColumnarBatch.buffers(columnIndices(i)),
+columnarBatch.column(i).asInstanceOf[WritableColumnVector],
+columnarBatchSchema.fields(i).dataType, rowCount)
+}
+columnarBatch
+  }
+
+  override def inputRDDs(): Seq[RDD[InternalRow]] = {
+if (supportCodegen) {
--- End diff --

If `supportCodegen` is false, I think we never call `inputRDDs`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r146155764
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,70 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.UserDefinedType
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  /**
+   * If true, get data from ColumnVector in ColumnarBatch, which are 
generally faster.
+   * If false, get data from UnsafeRow build from ColumnVector
+   */
+  override val supportCodegen: Boolean = {
+// In the initial implementation, for ease of review
+// support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
+val schema = StructType.fromAttributes(relation.output)
--- End diff --

Isn't this just `relation.schema`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-22 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19534
  
@sitalkedia I'm OK with either.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19519: [SPARK-21840][core] Add trait that allows conf to...

2017-10-22 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19519#discussion_r146154530
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkApplication.scala ---
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import java.lang.reflect.Modifier
+
+import org.apache.spark.SparkConf
+
+/**
+ * Entry point for a Spark application. Implementations must provide a 
no-argument constructor.
+ */
+private[spark] trait SparkApplication {
+
+  def start(args: Array[String], conf: SparkConf): Unit
+
+}
+
+/**
+ * Implementation of SparkApplication that wraps a standard Java class 
with a "main" method.
+ *
+ * Configuration is propagated to the application via system properties, 
so running multiple
+ * of these in the same JVM may lead to undefined behavior due to 
configuration leaks.
+ */
+private[deploy] class JavaMainApplication(klass: Class[_]) extends 
SparkApplication {
+
+  override def start(args: Array[String], conf: SparkConf): Unit = {
+val mainMethod = klass.getMethod("main", new Array[String](0).getClass)
+if (!Modifier.isStatic(mainMethod.getModifiers)) {
+  throw new IllegalStateException("The main method in the given main 
class must be static")
+}
+
+val sysProps = conf.getAll.toMap
+sysProps.foreach { case (k, v) =>
+  sys.props(k) = v
+}
+
+mainMethod.invoke(null, args)
+  }
--- End diff --

@vanzin , do we need to remove all the system properties after `mainMethod` 
is finished?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146153863
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""\n|Attribute(s) `$commonNames` seem to appear in two
+  |different datasets, with the same name."""
--- End diff --

`\n|Please check attribute(s) `$commonNames`, they seem to appear in two...`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146153187
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
--- End diff --

nit:
```
val attrsWithSameName = o.missingInput.filter { missing =>
  o.inputSet.exists(input => resolver(missing.name, input.name))
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146153590
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
--- End diff --

`attrsWithSameName.size > 0` -> `attrsWithSameName.nonEmpty`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146153288
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
--- End diff --

`missingAttributes` doesn't need `missingAttributes` or 
`availableAttributes`, could you move this definition above right after 
`attrsWithSameName`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146153739
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""\n|Attribute(s) `$commonNames` seem to appear in two
+  |different datasets, with the same name."""
--- End diff --

The `datasets` concept is a little strange here, would `inputs` or `input 
operators` be better?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146152572
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -24,6 +24,7 @@ import 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
 import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.internal.SQLConf
--- End diff --

unused import


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146153805
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""\n|Attribute(s) `$commonNames` seem to appear in two
+  |different datasets, with the same name."""
--- End diff --

I'm +0 for this additional error msg. I'm not sure if it is actually more 
clear to end users.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19548
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
@sjrand , can you please create another PR against branch-2.2, it is not 
auto-mergeable, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19548
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82965/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19540: [SPARK-22319][Core] call loginUserFromKeytab befo...

2017-10-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19540


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19548
  
**[Test build #82965 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82965/testReport)**
 for PR 19548 at commit 
[`91e911d`](https://github.com/apache/spark/commit/91e911d92f96c251de10a86a73f854b4993ede39).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-22 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
LGTM, merging to master and branch 2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19527
  
@BryanCutler @MLnick @WeichenXu123 Thanks for reviewing. Your comments 
should be all addressed now. Please take a look again when you have more time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r146153067
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,464 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'skip' (filter out rows with invalid data) or 'error' 
(throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'skip' (filter out rows with invalid data) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
+ *
+ * @note This is different from scikit-learn's OneHotEncoder, which keeps 
all categories.
+ * The output vectors are sparse.
+ *
+ * @see `StringIndexer` for converting categorical values into category 
indices
+ */
+@Since("2.3.0")
+class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("oneHotEncoder"))
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setInputCols(values: Array[String]): this.type = set(inputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setOutputCols(values: Array[String]): this.type = set(outputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setDropLast(value: Boolean): this.type = set(dropLast, value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  @Since("2.3.0")
+  override def transformSchema(schema: StructType): StructType = {
+val 

[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146152918
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""\n|Attribute(s) `$commonNames` seem to appear in two
+  |different datasets, with the same name."""
+} else {
+  ""
+}
 
 failAnalysis(
-  s"resolved attribute(s) $missingAttributes missing from 
$input " +
-s"in operator ${operator.simpleString}")
+  s"""Some resolved attribute(s) are not present among the 
available attributes
+|for a query.
+|$missingAttributes is not in 
$availableAttributes.$repeatedNameHint
--- End diff --

Move `$repeatedNameHint` in next line as:
```scala
|$missingAttributes is not in $availableAttributes.
|$repeatedNameHint
```
So you can remove the new line above?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146152895
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 ---
@@ -408,16 +408,28 @@ class AnalysisErrorSuite extends AnalysisTest {
 // CheckAnalysis should throw AnalysisException when Aggregate 
contains missing attribute(s)
 // Since we manually construct the logical plan at here and Sum only 
accept
 // LongType, DoubleType, and DecimalType. We use LongType as the type 
of a.
-val plan =
-  Aggregate(
+val attrA = AttributeReference("a", LongType)(exprId = ExprId(1))
+val aId = Array[String](attrA.name, attrA.exprId.id.toString)
+val otherA = AttributeReference("a", LongType)(exprId = ExprId(2))
+val otherAId = Array[String](otherA.name, otherA.exprId.id.toString)
+val bAlias = Alias(sum(attrA), "b")() :: Nil
+val plan = Aggregate(
 Nil,
-Alias(sum(AttributeReference("a", LongType)(exprId = ExprId(1))), 
"b")() :: Nil,
-LocalRelation(
-  AttributeReference("a", LongType)(exprId = ExprId(2
+bAlias,
+LocalRelation(otherA))
 
 assert(plan.resolved)
 
-assertAnalysisError(plan, "resolved attribute(s) a#1L missing from 
a#2L" :: Nil)
+val errorMsg = s"""Some resolved attribute(s) are not present among 
the available attributes
+ |for a query.
+ |${aId.mkString("#")}L is not in 
${otherAId.mkString("#")}L.
--- End diff --

Seems this is still unchanged?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146152740
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -24,6 +24,7 @@ import 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
 import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.internal.SQLConf
--- End diff --

No need to import SQLConf now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

2017-10-22 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19506#discussion_r146151411
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala
 ---
@@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals(
   override def dataType: DataType = ArrayType(LongType)
 
   override def prettyName: String = "approx_count_distinct_for_intervals"
+
+  override def serialize(obj: Array[Long]): Array[Byte] = {
+val byteArray = new Array[Byte](obj.length * 8)
+obj.indices.foreach { i =>
--- End diff --

Fixed. Thanks for the reminder!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19506
  
**[Test build #82966 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82966/testReport)**
 for PR 19506 at commit 
[`1e95a2f`](https://github.com/apache/spark/commit/1e95a2f6b7c934893b05fb396916568e8a73d523).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19451
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82964/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19451
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19451
  
**[Test build #82964 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82964/testReport)**
 for PR 19451 at commit 
[`7030ab6`](https://github.com/apache/spark/commit/7030ab6a1ee4f798a39bc076595eb5c15c99a8ba).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19553: [SPARK-22330][CORE] Linear containsKey operation ...

2017-10-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19553#discussion_r146147996
  
--- Diff: core/src/test/scala/org/apache/spark/util/JavaUtils.scala ---
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import org.mockito.Mockito._
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.api.java.JavaUtils
+
+class JavaUtils extends SparkFunSuite {
+
+  test("containsKey implementation without iteratively entrySet call") {
--- End diff --

You'd be welcome to add a test that the resulting map is serializable as 
that's the primary job of this method, but that's of course not directly 
related. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19553: [SPARK-22330][CORE] Linear containsKey operation ...

2017-10-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19553#discussion_r146147889
  
--- Diff: core/src/main/scala/org/apache/spark/api/java/JavaUtils.scala ---
@@ -43,6 +43,13 @@ private[spark] object JavaUtils {
 
 override def size: Int = underlying.size
 
+// Delegate to implementation because AbstractMap implementation 
iterates over whole key set
+override def containsKey(key: AnyRef): Boolean = try {
--- End diff --

This reminds me that this whole class could go away once only Scala 2.12 is 
supported, but that's a while away.

Simpler as this?

```
override def containsKey(key: AnyRef): Boolean = key match {
   case a: A => underlying.contains(a)
   case _ => false
}
```

if so then get() could be:

```
override def get(key: AnyRef): B = key match {
  case a: A => underlying.getOrElse(a, null.asInstanceOf[B])
  case _ => null.asInstanceOf[B]
}
```

What about other methods -- anything worth plumbing directly to the 
underlying implementation?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19553: [SPARK-22330][CORE] Linear containsKey operation ...

2017-10-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19553#discussion_r146147952
  
--- Diff: core/src/test/scala/org/apache/spark/util/JavaUtils.scala ---
@@ -0,0 +1,37 @@
+/*
--- End diff --

This class needs to be in the same package as what it tests, and be called 
`JavaUtilsSuite`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...

2017-10-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19548#discussion_r146147276
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -108,6 +112,10 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSQLCo
 conn.prepareStatement(
   "INSERT INTO numerics VALUES (4, 1.23, 99)").executeUpdate();
 conn.commit();
+
+
+conn.prepareStatement("CREATE TABLE oracle_types (d BINARY_DOUBLE, f 
BINARY_FLOAT)").executeUpdate();
--- End diff --

No semicolon at the end of lines


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...

2017-10-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19548#discussion_r146147286
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -50,9 +52,11 @@ import org.apache.spark.tags.DockerTest
 @DockerTest
 class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with 
SharedSQLContext {
   import testImplicits._
+  // To make === between double tolerate inexact values
+  implicit val doubleEquality = 
TolerantNumerics.tolerantDoubleEquality(0.01)
--- End diff --

Curious why it would be that inequal somewhere in these tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...

2017-10-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19548#discussion_r146147293
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -815,6 +815,12 @@ class JDBCSuite extends SparkFunSuite
   Some(DecimalType(DecimalType.MAX_PRECISION, 10)))
 assert(oracleDialect.getCatalystType(java.sql.Types.NUMERIC, 
"numeric", 0, null) ==
   Some(DecimalType(DecimalType.MAX_PRECISION, 10)))
+assert(oracleDialect.getCatalystType(100, "BINARY_FLOAT", 0, null) ==
--- End diff --

Let's define constants somewhere suitable for 100/100/-101


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc ty...

2017-10-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19548#discussion_r146147267
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -307,4 +315,32 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSQLCo
 assert(values.getInt(1).equals(1))
 assert(values.getBoolean(2).equals(false))
   }
+
+  test("SPARK-22303: handle BINARY_DOUBLE and BINARY_FLOAT as DoubleType 
and FloatType") {
+val tableName = "oracle_types"
+val schema = StructType(Seq(
+  StructField("d", DoubleType, true),
+  StructField("f", FloatType, true)))
+val props = new Properties()
+
+// write it back to the table (append mode)
+val data = spark.sparkContext.parallelize(Seq(Row(1.1D, 2.2F)))
--- End diff --

Nit: I think it's more conventional to write "1.1" and "2.2f"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...

2017-10-22 Thread taroplus
Github user taroplus commented on the issue:

https://github.com/apache/spark/pull/19548
  
Okay new integration test has been added

According to their 
[document](https://docs.oracle.com/database/121/TTSQL/types.htm#TTSQL148) 

> BINARY_DOUBLE is a 64-bit, double-precision, floating-point number.
> 
> BINARY_FLOAT is a 32-bit, single-precision, floating-point number.
> 
> Both BINARY_FLOAT and BINARY_DOUBLE support the special values Inf, -Inf, 
and NaN (not a number) and conform to the IEEE standard.

This should match java's double / float definition


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19548: [SPARK-22303][SQL] Handle Oracle specific jdbc types in ...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19548
  
**[Test build #82965 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82965/testReport)**
 for PR 19548 at commit 
[`91e911d`](https://github.com/apache/spark/commit/91e911d92f96c251de10a86a73f854b4993ede39).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19451
  
**[Test build #82964 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82964/testReport)**
 for PR 19451 at commit 
[`7030ab6`](https://github.com/apache/spark/commit/7030ab6a1ee4f798a39bc076595eb5c15c99a8ba).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19553: [SPARK-22330][CORE] Linear containsKey operation for ser...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19553
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19553: [SPARK-22330][CORE] Linear containsKey operation ...

2017-10-22 Thread Whoosh
GitHub user Whoosh opened a pull request:

https://github.com/apache/spark/pull/19553

[SPARK-22330][CORE] Linear containsKey operation for serialized maps

…alization.

## What changes were proposed in this pull request?

Use non-linear containsKey operation for serialized maps, lookup into 
underlying map.

## How was this patch tested?

unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Whoosh/spark SPARK-22330

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19553.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19553


commit 518b301d32a77a44812235d42f07302edcc4fda2
Author: Alexander Istomin 
Date:   2017-10-22T21:46:50Z

SPARK-22330 use underlying contains instead of default AbstractMap 
realization.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19451
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82962/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19451
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19451
  
**[Test build #82962 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82962/testReport)**
 for PR 19451 at commit 
[`f6c95d3`](https://github.com/apache/spark/commit/f6c95d36c7a91937f0ec27485c8a439d518fdedd).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19552
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82961/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19552
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19552
  
**[Test build #82961 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82961/testReport)**
 for PR 19552 at commit 
[`a256627`](https://github.com/apache/spark/commit/a256627dbc2772e69cd0f9f2aa43b384165e3657).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19543
  
**[Test build #82963 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82963/testReport)**
 for PR 19543 at commit 
[`c587946`](https://github.com/apache/spark/commit/c587946a03ade7c4f56864e0c0d2c3552d0e6b80).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19543
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82963/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19543
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19543
  
**[Test build #82963 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82963/testReport)**
 for PR 19543 at commit 
[`c587946`](https://github.com/apache/spark/commit/c587946a03ade7c4f56864e0c0d2c3552d0e6b80).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19543: [SPARK-19606][MESOS] Support constraints in spark-dispat...

2017-10-22 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19543
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19451
  
**[Test build #82962 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82962/testReport)**
 for PR 19451 at commit 
[`f6c95d3`](https://github.com/apache/spark/commit/f6c95d36c7a91937f0ec27485c8a439d518fdedd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

2017-10-22 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19552
  
Ping, @budde .
You can override this PR whenever you're ready.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.c...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19552
  
**[Test build #82961 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82961/testReport)**
 for PR 19552 at commit 
[`a256627`](https://github.com/apache/spark/commit/a256627dbc2772e69cd0f9f2aa43b384165e3657).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...

2017-10-22 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19552

[SPARK-22329][SQL] Use NEVER_INFER for 
`spark.sql.hive.caseSensitiveInferenceMode` by default

## What changes were proposed in this pull request?

In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical 
issue by default. 

- [SPARK-19611](https://issues.apache.org/jira/browse/SPARK-19611) uses 
`INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some Hive tables backed by 
case-sensitive data files.

  > This situation will occur for any Hive table that wasn't created by 
Spark or that was created prior to Spark 2.1.0. If a user attempts to run a 
query over such a table containing a case-sensitive field name in the query 
projection or in the query filter, the query will return 0 results in every 
case.

- However, [SPARK-22306](https://issues.apache.org/jira/browse/SPARK-22306) 
reports this also corrupts Hive Metastore schema by removing bucketing 
information (BUCKETING_COLS, SORT_COLS) and changing owner. This is undesirable 
side-effects. Hive Metastore is a shared resource and Spark should not corrupt 
it by default. 

- Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look 
okay at least. However, we need to figure out the issue of changing owners. 
Also, we cannot backport bucketing patch into `branch-2.2`. We need to verify 
this option with more tests before releasing 2.3.0.

This PR proposes to recover that option back to `NEVER_INFO` like Spark 
2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by 
themselves.

## How was this patch tested?

Pass the existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-22329

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19552.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19552


commit a256627dbc2772e69cd0f9f2aa43b384165e3657
Author: Dongjoon Hyun 
Date:   2017-10-22T17:59:15Z

[SPARK-22329][SQL] Use NEVER_INFER for 
`spark.sql.hive.caseSensitiveInferenceMode` by default




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17100
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82960/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...

2017-10-22 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/17100
  
Applied all changes @viirya & @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...

2017-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17100
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17100
  
**[Test build #82960 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82960/testReport)**
 for PR 17100 at commit 
[`75be390`](https://github.com/apache/spark/commit/75be3900f65e64bd083a91e5c63978489a61ccfe).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] PySpark DataFrames: The error message...

2017-10-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17100
  
**[Test build #82960 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82960/testReport)**
 for PR 17100 at commit 
[`75be390`](https://github.com/apache/spark/commit/75be3900f65e64bd083a91e5c63978489a61ccfe).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on Window...

2017-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19370
  
@jsnowacki, per review comment - 
https://github.com/apache/spark/pull/19370#pullrequestreview-65758626, let's 
describe the background and the proposal in more details in the PR description. 
It is true it's the lack of reviewers for Windows specific issues and we should 
better describe it at our best.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on...

2017-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19370#discussion_r146131458
  
--- Diff: bin/find-spark-home.cmd ---
@@ -0,0 +1,44 @@
+@echo off
+
+rem
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements.  See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License.  You may obtain a copy of the License at
+rem
+remhttp://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+rem
+
+rem Path to Python script finding SPARK_HOME
+set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py
+
+rem Default to standard python interpreter unless told otherwise
+set PYTHON_RUNNER=python
+if not "x%PYSPARK_DRIVER_PYTHON%" =="x" (
+  set PYTHON_RUNNER=%PYSPARK_DRIVER_PYTHON%
--- End diff --

`PYSPARK_PYTHON` is a valid env. Could we consider this env too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19370: [SPARK-18136] Fix setup of SPARK_HOME variable on...

2017-10-22 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19370#discussion_r146132096
  
--- Diff: bin/find-spark-home.cmd ---
@@ -0,0 +1,44 @@
+@echo off
+
+rem
+rem Licensed to the Apache Software Foundation (ASF) under one or more
+rem contributor license agreements.  See the NOTICE file distributed with
+rem this work for additional information regarding copyright ownership.
+rem The ASF licenses this file to You under the Apache License, Version 2.0
+rem (the "License"); you may not use this file except in compliance with
+rem the License.  You may obtain a copy of the License at
+rem
+remhttp://www.apache.org/licenses/LICENSE-2.0
+rem
+rem Unless required by applicable law or agreed to in writing, software
+rem distributed under the License is distributed on an "AS IS" BASIS,
+rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+rem See the License for the specific language governing permissions and
+rem limitations under the License.
+rem
+
+rem Path to Python script finding SPARK_HOME
+set FIND_SPARK_HOME_SCRIPT=%~dp0find_spark_home.py
+
+rem Default to standard python interpreter unless told otherwise
+set PYTHON_RUNNER=python
+if not "x%PYSPARK_DRIVER_PYTHON%" =="x" (
+  set PYTHON_RUNNER=%PYSPARK_DRIVER_PYTHON%
+)
+
+rem Only attempt to find SPARK_HOME if it is not set.
+if "x%SPARK_HOME%"=="x" (
+  rem We are pip installed, use the Python script to resolve a reasonable 
SPARK_HOME
+  if exist "%FIND_SPARK_HOME_SCRIPT%" (
+rem If there is no python installed it will fail with message:
+rem 'python' is not recognized as an internal or external command,
+for /f %%i in ('%PYTHON_RUNNER% %FIND_SPARK_HOME_SCRIPT%') do set 
SPARK_HOME=%%i
--- End diff --

Should we maybe add `"delims="` (`for /f "delims=" %%i`) to overwrite the 
default space and tab delimiters?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >