[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/18624
  
But, I agree the issue @MLnick mentioned, the code now looks convoluted, 
can you try to simplify it ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19661: [SPARK-22450][Core][Mllib]safely register class f...

2017-11-09 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19661#discussion_r150171482
  
--- Diff: 
core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala ---
@@ -108,6 +108,27 @@ class KryoSerializerSuite extends SparkFunSuite with 
SharedSparkContext {
 check(Array(Array("1", "2"), Array("1", "2", "3", "4")))
   }
 
+  test("safely register class for mllib/ml") {
+val conf = new SparkConf(false)
+val ser = new KryoSerializer(conf)
+
+Seq("org.apache.spark.mllib.linalg.Vector",
+  "org.apache.spark.mllib.linalg.DenseVector",
+  "org.apache.spark.mllib.linalg.SparseVector",
+  "org.apache.spark.mllib.linalg.Matrix",
+  "org.apache.spark.mllib.linalg.DenseMatrix",
+  "org.apache.spark.mllib.linalg.SparseMatrix",
+  "org.apache.spark.ml.linalg.Vector",
+  "org.apache.spark.ml.linalg.DenseVector",
+  "org.apache.spark.ml.linalg.SparseVector",
+  "org.apache.spark.ml.linalg.Matrix",
+  "org.apache.spark.ml.linalg.DenseMatrix",
+  "org.apache.spark.ml.linalg.SparseMatrix",
+  "org.apache.spark.ml.feature.Instance",
+  "org.apache.spark.ml.feature.OffsetInstance"
+).foreach(!Utils.classIsLoadable(_))
--- End diff --

This UT looks doesn't actually reflect your purpose above, seems this 
always be passed. Also `conf` and `ser` above seems never used here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18624#discussion_r150170451
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -286,40 +288,119 @@ object MatrixFactorizationModel extends 
Loader[MatrixFactorizationModel] {
   srcFeatures: RDD[(Int, Array[Double])],
   dstFeatures: RDD[(Int, Array[Double])],
   num: Int): RDD[(Int, Array[(Int, Double)])] = {
-val srcBlocks = blockify(srcFeatures)
-val dstBlocks = blockify(dstFeatures)
-val ratings = srcBlocks.cartesian(dstBlocks).flatMap { case (srcIter, 
dstIter) =>
-  val m = srcIter.size
-  val n = math.min(dstIter.size, num)
-  val output = new Array[(Int, (Int, Double))](m * n)
+val srcBlocks = blockify(rank, srcFeatures).zipWithIndex()
+val dstBlocks = blockify(rank, dstFeatures)
+val ratings = srcBlocks.cartesian(dstBlocks).map {
+  case (((srcIds, srcFactors), index), (dstIds, dstFactors)) =>
+val m = srcIds.length
+val n = dstIds.length
+val dstIdMatrix = new Array[Int](m * num)
+val scoreMatrix = Array.fill[Double](m * 
num)(Double.NegativeInfinity)
+val pq = new BoundedPriorityQueue[(Int, 
Double)](num)(Ordering.by(_._2))
+
+val ratings = srcFactors.transpose.multiply(dstFactors)
+var i = 0
+var j = 0
+while (i < m) {
+  var k = 0
+  while (k < n) {
+pq += dstIds(k) -> ratings(i, k)
+k += 1
+  }
+  k = 0
+  pq.toArray.sortBy(-_._2).foreach { case (id, score) =>
+dstIdMatrix(j + k) = id
+scoreMatrix(j + k) = score
+k += 1
+  }
+  // pq.size maybe less than num, corner case
+  j += num
+  i += 1
+  pq.clear()
+}
+(index, (srcIds, dstIdMatrix, new DenseMatrix(m, num, scoreMatrix, 
true)))
+}
+ratings.aggregateByKey(null: Array[Int], null: Array[Int], null: 
DenseMatrix)(
+  (rateSum, rate) => mergeFunc(rateSum, rate, num),
+  (rateSum1, rateSum2) => mergeFunc(rateSum1, rateSum2, num)
+).flatMap { case (index, (srcIds, dstIdMatrix, scoreMatrix)) =>
+  // to avoid corner case that the number of items is less than 
recommendation num
+  var col: Int = 0
+  while (col < num && scoreMatrix(0, col) > Double.NegativeInfinity) {
+col += 1
+  }
+  val row = scoreMatrix.numRows
+  val output = new Array[(Int, Array[(Int, Double)])](row)
   var i = 0
-  val pq = new BoundedPriorityQueue[(Int, 
Double)](n)(Ordering.by(_._2))
-  srcIter.foreach { case (srcId, srcFactor) =>
-dstIter.foreach { case (dstId, dstFactor) =>
-  // We use F2jBLAS which is faster than a call to native BLAS for 
vector dot product
-  val score = BLAS.f2jBLAS.ddot(rank, srcFactor, 1, dstFactor, 1)
-  pq += dstId -> score
+  while (i < row) {
+val factors = new Array[(Int, Double)](col)
+var j = 0
+while (j < col) {
+  factors(j) = (dstIdMatrix(i * num + j), scoreMatrix(i, j))
+  j += 1
 }
-pq.foreach { case (dstId, score) =>
-  output(i) = (srcId, (dstId, score))
-  i += 1
+output(i) = (srcIds(i), factors)
+i += 1
+  }
+ output.toSeq}
+  }
+
+  private def mergeFunc(rateSum: (Array[Int], Array[Int], DenseMatrix),
+rate: (Array[Int], Array[Int], DenseMatrix),
+num: Int): (Array[Int], Array[Int], DenseMatrix) = 
{
+if (rateSum._1 == null) {
+  rate
+} else {
+  val row = rateSum._3.numRows
+  var i = 0
+  val tempIdMatrix = new Array[Int](row * num)
+  val tempScoreMatrix = Array.fill[Double](row * 
num)(Double.NegativeInfinity)
+  while (i < row) {
+var j = 0
+var sum_index = 0
+var rate_index = 0
+val matrixIndex = i * num
+while (j < num) {
+  if (rate._3(i, rate_index) > rateSum._3(i, sum_index)) {
+tempIdMatrix(matrixIndex + j) = rate._2(matrixIndex + 
rate_index)
+tempScoreMatrix(matrixIndex + j) = rate._3(i, rate_index)
+rate_index += 1
+  } else {
+tempIdMatrix(matrixIndex + j) = rateSum._2(matrixIndex + 
sum_index)
+tempScoreMatrix(matrixIndex + j) = rateSum._3(i, sum_index)
+sum_index += 1
+  }
+  j += 1
 }
-pq.clear()
+i += 1
  

[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...

2017-11-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19257
  
Hi, All.
Master branch still has this problem. Can we proceed this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #83672 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83672/testReport)**
 for PR 19439 at commit 
[`04db0fd`](https://github.com/apache/spark/commit/04db0fd02ee1abacc65d20c8d12eab8b6539e09f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #83671 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83671/testReport)**
 for PR 19439 at commit 
[`a6c82ce`](https://github.com/apache/spark/commit/a6c82ceb1752345a2379e8e26f66bbf91b579991).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83671/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

2017-11-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19702#discussion_r150168485
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1143,6 +1159,18 @@ class SQLConf extends Serializable with Logging {
 
   def isParquetINT64AsTimestampMillis: Boolean = 
getConf(PARQUET_INT64_AS_TIMESTAMP_MILLIS)
 
+  def parquetOutputTimestampType: ParquetOutputTimestampType.Value = {
+val isOutputTimestampTypeSet = 
settings.containsKey(PARQUET_OUTPUT_TIMESTAMP_TYPE.key)
+if (!isOutputTimestampTypeSet && isParquetINT64AsTimestampMillis) {
+  // If PARQUET_OUTPUT_TIMESTAMP_TYPE is not set and 
PARQUET_INT64_AS_TIMESTAMP_MILLIS is set,
+  // respect PARQUET_INT64_AS_TIMESTAMP_MILLIS and use 
TIMESTAMP_MILLIS. Otherwise,
+  // PARQUET_OUTPUT_TIMESTAMP_TYPE has higher priority.
--- End diff --

if `isParquetINT64AsTimestampMillis` is false, we will go to the else 
branch, and pick `PARQUET_OUTPUT_TIMESTAMP_TYPE`, which by default is INT96(the 
current behavior). Let me add a test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19661: [SPARK-22450][Core][Mllib]safely register class for mlli...

2017-11-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19661
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83670/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #83670 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83670/testReport)**
 for PR 19439 at commit 
[`c2a4e19`](https://github.com/apache/spark/commit/c2a4e197eec7749eb660b09a1fd6a7a27df32c39).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #83671 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83671/testReport)**
 for PR 19439 at commit 
[`a6c82ce`](https://github.com/apache/spark/commit/a6c82ceb1752345a2379e8e26f66bbf91b579991).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #83670 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83670/testReport)**
 for PR 19439 at commit 
[`c2a4e19`](https://github.com/apache/spark/commit/c2a4e197eec7749eb660b09a1fd6a7a27df32c39).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150166261
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
+
+.. autoclass:: _ImageSchema
+   :members:
+"""
+
+from pyspark import SparkContext
+from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
+from pyspark.sql import DataFrame, SparkSession
+import numpy as np
+
+
+class _ImageSchema(object):
+"""
+Internal class for `pyspark.ml.image.ImageSchema` attribute. Meant to 
be private and
+not to be instantized. Use `pyspark.ml.image.ImageSchema` attribute to 
access the
+APIs of this class.
+"""
+
+def __init__(self):
+self._imageSchema = None
+self._ocvTypes = None
+self._imageFields = None
+self._undefinedImageType = None
+
+@property
+def imageSchema(self):
+"""
+Returns the image schema.
+
+:rtype StructType: a DataFrame with a single column of images
+   named "image" (nullable)
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageSchema is None:
+ctx = SparkContext._active_spark_context
+jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
+self._imageSchema = _parse_datatype_json_string(jschema.json())
+return self._imageSchema
+
+@property
+def ocvTypes(self):
+"""
+Returns the OpenCV type mapping supported
+
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+
+if self._ocvTypes is None:
+ctx = SparkContext._active_spark_context
+self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema._ocvTypes())
+return self._ocvTypes
+
+@property
+def imageFields(self):
+"""
+Returns field names of image columns.
+
+:rtype list: a list of field names.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageFields is None:
+ctx = SparkContext._active_spark_context
+self._imageFields = 
list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields())
+return self._imageFields
+
+@property
+def undefinedImageType(self):
+"""
+Returns the name of undefined image type for the invalid image.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._undefinedImageType is None:
+ctx = SparkContext._active_spark_context
+self._undefinedImageType = \
+
ctx._jvm.org.apache.spark.ml.image.ImageSchema.undefinedImageType()
+return self._undefinedImageType
+
+def toNDArray(self, image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image: The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+def toImage(self, array, origin=""):
+"""
+Converts a one-dimensional array to a two-dimensional image.
+
+:param array array: The array to convert to image
+:param str origin: Path to the image
+:rtype object: Two dimensional image
   

[GitHub] spark issue #19715: [SPARK-22397][ML]add multiple columns support to Quantil...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19715
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150166092
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
+
+.. autoclass:: _ImageSchema
+   :members:
+"""
+
+from pyspark import SparkContext
+from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
+from pyspark.sql import DataFrame, SparkSession
+import numpy as np
+
+
+class _ImageSchema(object):
+"""
+Internal class for `pyspark.ml.image.ImageSchema` attribute. Meant to 
be private and
+not to be instantized. Use `pyspark.ml.image.ImageSchema` attribute to 
access the
+APIs of this class.
+"""
+
+def __init__(self):
+self._imageSchema = None
+self._ocvTypes = None
+self._imageFields = None
+self._undefinedImageType = None
+
+@property
+def imageSchema(self):
+"""
+Returns the image schema.
+
+:rtype StructType: a DataFrame with a single column of images
+   named "image" (nullable)
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageSchema is None:
+ctx = SparkContext._active_spark_context
+jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
+self._imageSchema = _parse_datatype_json_string(jschema.json())
+return self._imageSchema
+
+@property
+def ocvTypes(self):
+"""
+Returns the OpenCV type mapping supported
+
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+
+if self._ocvTypes is None:
+ctx = SparkContext._active_spark_context
+self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema._ocvTypes())
+return self._ocvTypes
+
+@property
+def imageFields(self):
+"""
+Returns field names of image columns.
+
+:rtype list: a list of field names.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageFields is None:
+ctx = SparkContext._active_spark_context
+self._imageFields = 
list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields())
+return self._imageFields
+
+@property
+def undefinedImageType(self):
+"""
+Returns the name of undefined image type for the invalid image.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._undefinedImageType is None:
+ctx = SparkContext._active_spark_context
+self._undefinedImageType = \
+
ctx._jvm.org.apache.spark.ml.image.ImageSchema.undefinedImageType()
+return self._undefinedImageType
+
+def toNDArray(self, image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image: The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+def toImage(self, array, origin=""):
+"""
+Converts a one-dimensional array to a two-dimensional image.
+
+:param array array: The array to convert to image
+:param str origin: Path to the image
--- End diff --

yes, do I need to 

[GitHub] spark issue #19715: [SPARK-22397][ML]add multiple columns support to Quantil...

2017-11-09 Thread huaxingao
Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/19715
  
@MLnick @viirya Could you please review? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...

2017-11-09 Thread huaxingao
GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/19715

[SPARK-22397][ML]add multiple columns support to QuantileDiscretizer

## What changes were proposed in this pull request?

add multi columns support to  QuantileDiscretizer
## How was this patch tested?

add UT in QuantileDiscretizerSuite to test multi columns supports 





You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark_22397

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19715.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19715


commit 07bd868956e8d63294b2acb0b5d01a7ca2b35866
Author: Huaxin Gao 
Date:   2017-11-10T06:57:04Z

[SPARK-22397][ML]add multiple columns support to QuantileDiscretizer




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150165810
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
+
+.. autoclass:: _ImageSchema
+   :members:
+"""
+
+from pyspark import SparkContext
+from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
+from pyspark.sql import DataFrame, SparkSession
+import numpy as np
+
+
+class _ImageSchema(object):
+"""
+Internal class for `pyspark.ml.image.ImageSchema` attribute. Meant to 
be private and
+not to be instantized. Use `pyspark.ml.image.ImageSchema` attribute to 
access the
+APIs of this class.
+"""
+
+def __init__(self):
+self._imageSchema = None
+self._ocvTypes = None
+self._imageFields = None
+self._undefinedImageType = None
+
+@property
+def imageSchema(self):
+"""
+Returns the image schema.
+
+:rtype StructType: a DataFrame with a single column of images
+   named "image" (nullable)
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageSchema is None:
+ctx = SparkContext._active_spark_context
+jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
+self._imageSchema = _parse_datatype_json_string(jschema.json())
+return self._imageSchema
+
+@property
+def ocvTypes(self):
+"""
+Returns the OpenCV type mapping supported
+
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+
+if self._ocvTypes is None:
+ctx = SparkContext._active_spark_context
+self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema._ocvTypes())
+return self._ocvTypes
+
+@property
+def imageFields(self):
+"""
+Returns field names of image columns.
+
+:rtype list: a list of field names.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageFields is None:
+ctx = SparkContext._active_spark_context
+self._imageFields = 
list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields())
+return self._imageFields
+
+@property
+def undefinedImageType(self):
+"""
+Returns the name of undefined image type for the invalid image.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._undefinedImageType is None:
+ctx = SparkContext._active_spark_context
+self._undefinedImageType = \
+
ctx._jvm.org.apache.spark.ml.image.ImageSchema.undefinedImageType()
+return self._undefinedImageType
+
+def toNDArray(self, image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image: The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+def toImage(self, array, origin=""):
+"""
+Converts a one-dimensional array to a two-dimensional image.
--- End diff --

@holdenk done, good catch, changed wording to "Converts an array with 
metadata to a two-dimensional image."


---


[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150165229
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
+
+.. autoclass:: _ImageSchema
+   :members:
+"""
+
+from pyspark import SparkContext
+from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
+from pyspark.sql import DataFrame, SparkSession
+import numpy as np
+
+
+class _ImageSchema(object):
+"""
+Internal class for `pyspark.ml.image.ImageSchema` attribute. Meant to 
be private and
+not to be instantized. Use `pyspark.ml.image.ImageSchema` attribute to 
access the
+APIs of this class.
+"""
+
+def __init__(self):
+self._imageSchema = None
+self._ocvTypes = None
+self._imageFields = None
+self._undefinedImageType = None
+
+@property
+def imageSchema(self):
+"""
+Returns the image schema.
+
+:rtype StructType: a DataFrame with a single column of images
+   named "image" (nullable)
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageSchema is None:
+ctx = SparkContext._active_spark_context
+jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
+self._imageSchema = _parse_datatype_json_string(jschema.json())
+return self._imageSchema
+
+@property
+def ocvTypes(self):
+"""
+Returns the OpenCV type mapping supported
+
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+
+if self._ocvTypes is None:
+ctx = SparkContext._active_spark_context
+self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema._ocvTypes())
+return self._ocvTypes
+
+@property
+def imageFields(self):
+"""
+Returns field names of image columns.
+
+:rtype list: a list of field names.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._imageFields is None:
+ctx = SparkContext._active_spark_context
+self._imageFields = 
list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields())
+return self._imageFields
+
+@property
+def undefinedImageType(self):
+"""
+Returns the name of undefined image type for the invalid image.
+
+.. versionadded:: 2.3.0
+"""
+
+if self._undefinedImageType is None:
+ctx = SparkContext._active_spark_context
+self._undefinedImageType = \
+
ctx._jvm.org.apache.spark.ml.image.ImageSchema.undefinedImageType()
+return self._undefinedImageType
+
+def toNDArray(self, image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image: The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+def toImage(self, array, origin=""):
+"""
+Converts a one-dimensional array to a two-dimensional image.
+
+:param array array: The array to convert to image
+:param str origin: Path to the image
+:rtype object: Two dimensional image
   

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150164867
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
--- End diff --

removed the "singleton-like" wording in the doc - please let me know if any 
other changes are needed here



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #83669 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83669/testReport)**
 for PR 19651 at commit 
[`f644c6a`](https://github.com/apache/spark/commit/f644c6a88b4f24376c67028d0e927a2ee49fedbe).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19651
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150164406
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1818,6 +1819,24 @@ def tearDown(self):
 del self.data
 
 
+class ImageReaderTest(SparkSessionTestCase):
+
+def test_read_images(self):
+data_path = 'python/test_support/image/kittens'
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150164008
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+.. attribute:: ImageSchema
+
+A singleton-like attribute of :class:`_ImageSchema` in this module.
+
+.. autoclass:: _ImageSchema
+   :members:
+"""
+
+from pyspark import SparkContext
+from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
+from pyspark.sql import DataFrame, SparkSession
+import numpy as np
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150163944
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala ---
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.nio.file.Paths
+import java.util.Arrays
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types._
+
+class ImageSchemaSuite extends SparkFunSuite with MLlibTestSparkContext {
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images"
+
+  test("Smoke test: create basic ImageSchema dataframe") {
+val origin = "path"
+val width = 1
+val height = 1
+val nChannels = 3
+val data = Array[Byte](0, 0, 0)
+val mode = ocvTypes("CV_8UC3")
+
+// Internal Row corresponds to image StructType
+val rows = Seq(Row(Row(origin, height, width, nChannels, mode, data)),
+  Row(Row(null, height, width, nChannels, mode, data)))
+val rdd = sc.makeRDD(rows)
+val df = spark.createDataFrame(rdd, ImageSchema.imageSchema)
+
+assert(df.count === 2, "incorrect image count")
+assert(df.schema("image").dataType == columnSchema, "data do not fit 
ImageSchema")
+  }
+
+  test("readImages count test") {
+var df = readImages(imagePath, recursive = false)
+assert(df.count === 1)
+
+df = readImages(imagePath, recursive = true, dropImageFailures = false)
+assert(df.count === 9)
+
+df = readImages(imagePath, recursive = true, dropImageFailures = true)
+val countTotal = df.count
+assert(countTotal === 7)
+
+df = readImages(imagePath, recursive = true, sampleRatio = 0.5, 
dropImageFailures = true)
--- End diff --

agreed +1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150163710
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
   

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150162532
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
   

[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19712
  
cc @liancheng @cloud-fan 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19712
  
**[Test build #83668 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83668/testReport)**
 for PR 19712 at commit 
[`e760f52`](https://github.com/apache/spark/commit/e760f52d1c207b63c7ca6ce9de4bd91363e8f28b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150161698
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
--- End diff --

good idea, done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EX...

2017-11-09 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/19712#discussion_r150161651
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -521,20 +521,7 @@ class HiveThriftBinaryServerSuite extends 
HiveThriftJdbcTest {
 conf += resultSet.getString(1) -> resultSet.getString(2)
   }
 
-  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
-}
-  }
-
-  test("Checks Hive version via SET") {
-withJdbcStatement() { statement =>
-  val resultSet = statement.executeQuery("SET")
-
-  val conf = mutable.Map.empty[String, String]
-  while (resultSet.next()) {
-conf += resultSet.getString(1) -> resultSet.getString(2)
-  }
-
-  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
--- End diff --

the first commit fails this ut checking  spark.sql.hive.metastore.version. 
`set` cmd only shows the changed variables, if more unit tests are needed, i 
can add some.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19713: [SPARK-22488] [SQL] Fix the view resolution issue in the...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19713
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19713: [SPARK-22488] [SQL] Fix the view resolution issue in the...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19713
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83664/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19712
  
add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19713: [SPARK-22488] [SQL] Fix the view resolution issue in the...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19713
  
**[Test build #83664 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83664/testReport)**
 for PR 19713 at commit 
[`d87f333`](https://github.com/apache/spark/commit/d87f33327b351cea493a065d144044cf2c1a069f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EX...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19712#discussion_r150161287
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -521,20 +521,7 @@ class HiveThriftBinaryServerSuite extends 
HiveThriftJdbcTest {
 conf += resultSet.getString(1) -> resultSet.getString(2)
   }
 
-  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
-}
-  }
-
-  test("Checks Hive version via SET") {
-withJdbcStatement() { statement =>
-  val resultSet = statement.executeQuery("SET")
-
-  val conf = mutable.Map.empty[String, String]
-  while (resultSet.next()) {
-conf += resultSet.getString(1) -> resultSet.getString(2)
-  }
-
-  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
--- End diff --

Just make a try?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150161295
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
--- End diff --

done, renamed as javaOcvTypes


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EX...

2017-11-09 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/19712#discussion_r150161137
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -521,20 +521,7 @@ class HiveThriftBinaryServerSuite extends 
HiveThriftJdbcTest {
 conf += resultSet.getString(1) -> resultSet.getString(2)
   }
 
-  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
-}
-  }
-
-  test("Checks Hive version via SET") {
-withJdbcStatement() { statement =>
-  val resultSet = statement.executeQuery("SET")
-
-  val conf = mutable.Map.empty[String, String]
-  while (resultSet.next()) {
-conf += resultSet.getString(1) -> resultSet.getString(2)
-  }
-
-  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
--- End diff --

this might need to set spark.sql.hive.metastore.version explicitly


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EX...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19712#discussion_r150160911
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala
 ---
@@ -521,20 +521,7 @@ class HiveThriftBinaryServerSuite extends 
HiveThriftJdbcTest {
 conf += resultSet.getString(1) -> resultSet.getString(2)
   }
 
-  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
-}
-  }
-
-  test("Checks Hive version via SET") {
-withJdbcStatement() { statement =>
-  val resultSet = statement.executeQuery("SET")
-
-  val conf = mutable.Map.empty[String, String]
-  while (resultSet.next()) {
-conf += resultSet.getString(1) -> resultSet.getString(2)
-  }
-
-  assert(conf.get("spark.sql.hive.version") === Some("1.2.1"))
--- End diff --

change it to `spark.sql.hive.metastore.version`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150160540
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala 
---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import scala.language.existentials
+import scala.util.Random
+
+import org.apache.commons.io.FilenameUtils
+import org.apache.hadoop.conf.{Configuration, Configured}
+import org.apache.hadoop.fs.{Path, PathFilter}
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
+
+import org.apache.spark.sql.SparkSession
+
+private object RecursiveFlag {
+  /**
+   * Sets the spark recursive flag and then restores it.
+   *
+   * @param value Value to set
+   * @param spark Existing spark session
+   * @param f The function to evaluate after setting the flag
+   * @return Returns the evaluation result T of the function
+   */
+  def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T): 
T = {
+val flagName = FileInputFormat.INPUT_DIR_RECURSIVE
+val hadoopConf = spark.sparkContext.hadoopConfiguration
+val old = Option(hadoopConf.get(flagName))
+hadoopConf.set(flagName, value.toString)
+try f finally {
+  old match {
+case Some(v) => hadoopConf.set(flagName, v)
+case None => hadoopConf.unset(flagName)
+  }
+}
+  }
+}
+
+/**
+ * Filter that allows loading a fraction of HDFS files.
+ */
+private class SamplePathFilter extends Configured with PathFilter {
--- End diff --

yes, I'm not sure about whether it will be deterministic even if we set a 
seed, but I can try to do that for now.  As @thunterdb suggested, we could use 
some sort of a hash on the filename - but I'm not sure on how I would make that 
implementation work with a specified ratio - could you give me more info on the 
design:

"I would prefer that we do not use a seed and that the result is 
deterministic, based for example on some hash of the file name, to make it more 
robust to future code changes. That being said, there is no fundamental issues 
with the current implementation and other developers may have differing 
opinions, so the current implementation is fine as far as I am concerned."



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread yaooqinn
Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19712
  
cc again @gatorsmile and would you mind adding me to the jenkins' white 
list?  thanks, hoping not bother you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19712
  
**[Test build #83667 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83667/testReport)**
 for PR 19712 at commit 
[`e760f52`](https://github.com/apache/spark/commit/e760f52d1c207b63c7ca6ce9de4bd91363e8f28b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19714
  
**[Test build #83666 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83666/testReport)**
 for PR 19714 at commit 
[`68dfc42`](https://github.com/apache/spark/commit/68dfc42d80548c1eeb75275df43d4542146a60d4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...

2017-11-09 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/19714

[SPARK-22489][SQL] Shouldn't change broadcast join buildSide if user 
clearly specified

## What changes were proposed in this pull request?

How to reproduce:
```scala
import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec

spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value").createTempView("table1")
spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value").createTempView("table2")

val bl = sql(s"SELECT /*+ MAPJOIN(t1) */ * FROM table1 t1 JOIN table2 t2 ON 
t1.key = t2.key").queryExecution.executedPlan

println(bl.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide)
```
The result is `BuildRight`, but should be `BuildLeft`. This PR fix this 
issue.
## How was this patch tested?

unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-22489

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19714.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19714


commit 68dfc42d80548c1eeb75275df43d4542146a60d4
Author: Yuming Wang 
Date:   2017-11-10T05:55:51Z

Shouldn't change broadcast join buildSide if user clearly specified




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19707
  
Thanks! Merged to master/2.2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19707: [SPARK-22472][SQL] add null check for top-level p...

2017-11-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19707


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19707: [SPARK-22472][SQL] add null check for top-level primitiv...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19707
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150157767
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
   

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-11-09 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r150157663
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Used for conversion to python
+   */
+  val _ocvTypes: java.util.Map[String, Int] = ocvTypes.asJava
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19661: [SPARK-22450][Core][Mllib]safely register class f...

2017-11-09 Thread ConeyLiu
Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19661#discussion_r150157116
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -178,6 +179,28 @@ class KryoSerializer(conf: SparkConf)
 
kryo.register(Utils.classForName("scala.collection.immutable.Map$EmptyMap$"))
 kryo.register(classOf[ArrayBuffer[Any]])
 
+// We can't load those class directly in order to avoid unnecessary 
jar dependencies.
+// We load them safely, ignore it if the class not found.
+Seq("org.apache.spark.mllib.linalg.Vector",
+  "org.apache.spark.mllib.linalg.DenseVector",
+  "org.apache.spark.mllib.linalg.SparseVector",
+  "org.apache.spark.mllib.linalg.Matrix",
+  "org.apache.spark.mllib.linalg.DenseMatrix",
+  "org.apache.spark.mllib.linalg.SparseMatrix",
+  "org.apache.spark.ml.linalg.Vector",
+  "org.apache.spark.ml.linalg.DenseVector",
+  "org.apache.spark.ml.linalg.SparseVector",
+  "org.apache.spark.ml.linalg.Matrix",
+  "org.apache.spark.ml.linalg.DenseMatrix",
+  "org.apache.spark.ml.linalg.SparseMatrix",
+  "org.apache.spark.ml.feature.Instance",
+  "org.apache.spark.ml.feature.OffsetInstance"
+).map(name => Try(Utils.classForName(name))).foreach { t =>
--- End diff --

updated. thanks for the advice.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19661: [SPARK-22450][Core][Mllib]safely register class for mlli...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19661
  
**[Test build #83665 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83665/testReport)**
 for PR 19661 at commit 
[`d7090bb`](https://github.com/apache/spark/commit/d7090bbf60ea98e9ade9534b78e249b0f25621e4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-11-09 Thread pralabhkumar
Github user pralabhkumar commented on the issue:

https://github.com/apache/spark/pull/18118
  
@MLnick Please find some time to review it and let me know if we can 
proceed with this. Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19702: [SPARK-10365][SQL] Support Parquet logical type TIMESTAM...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19702
  
Will review it tomorrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19272
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83661/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19272
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #83661 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83661/testReport)**
 for PR 19272 at commit 
[`45b46ed`](https://github.com/apache/spark/commit/45b46ed6768ea50ddf23063b2a925c2a4794acc7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13599
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83662/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13599
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13599
  
**[Test build #83662 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83662/testReport)**
 for PR 13599 at commit 
[`8474fbc`](https://github.com/apache/spark/commit/8474fbc001a8c418b210d014b55f5ee71c683d06).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19272
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83660/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19272
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #83660 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83660/testReport)**
 for PR 19272 at commit 
[`8df7e37`](https://github.com/apache/spark/commit/8df7e37517a21d5fbaa2c0e7abfa248fd3ff9be3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19713: [SPARK-22488] [SQL] Fix the view resolution issue in the...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19713
  
**[Test build #83664 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83664/testReport)**
 for PR 19713 at commit 
[`d87f333`](https://github.com/apache/spark/commit/d87f33327b351cea493a065d144044cf2c1a069f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19713: [SPARK-22488] [SQL] Fix the view resolution issue in the...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19713
  
cc @cloud-fan @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19713: [SPARK-22488] [SQL] Fix the view resolution issue...

2017-11-09 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/19713

[SPARK-22488] [SQL] Fix the view resolution issue in the SparkSession 
internal table() API 

## What changes were proposed in this pull request?
The current internal `table()` API of `SparkSession` bypasses the Analyzer 
and directly calls `sessionState.catalog.lookupRelation` API. This skips the 
view resolution logics in our Analyzer rule `ResolveRelations`. This internal 
API is widely used by various DDL commands or the other internal APIs.

Users might get the strange error caused by view resolution when the 
default database is different.
```
Table or view not found: t1; line 1 pos 14
org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 
pos 14
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

This PR is to fix it by enforcing it to use `ResolveRelations` to resolve 
the table.

## How was this patch tested?
Added a test case and modified the existing test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark viewResolution

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19713.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19713


commit d87f33327b351cea493a065d144044cf2c1a069f
Author: gatorsmile 
Date:   2017-11-10T03:47:59Z

fix.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19712
  
**[Test build #83663 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83663/testReport)**
 for PR 19712 at commit 
[`6071926`](https://github.com/apache/spark/commit/607192603b88f6ed4543587489188f20b9b236e0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19712
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83663/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19712
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19708: [SPARK-22479][SQL] Exclude credentials from Savei...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19708#discussion_r150150720
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala
 ---
@@ -46,4 +46,6 @@ case class SaveIntoDataSourceCommand(
 
 Seq.empty[Row]
   }
+
+  override def simpleString: String = s"SaveIntoDataSourceCommand 
${dataSource}, ${mode}"
--- End diff --


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L2631-L2638

Reuse `spark.redaction.regex`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19712
  
**[Test build #83663 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83663/testReport)**
 for PR 19712 at commit 
[`6071926`](https://github.com/apache/spark/commit/607192603b88f6ed4543587489188f20b9b236e0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19712
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19705: [SPARK-22308][test-maven] Support alternative uni...

2017-11-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19705


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19712
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19712: [SPARK-22487][SQL][Hive]Remove the unused HIVE_EX...

2017-11-09 Thread yaooqinn
GitHub user yaooqinn opened a pull request:

https://github.com/apache/spark/pull/19712

[SPARK-22487][SQL][Hive]Remove the unused HIVE_EXECUTION_VERSION property

## What changes were proposed in this pull request?

Actually there is no hive client for executions in spark now and there are 
no usages of HIVE_EXECUTION_VERSION found in whole spark project. 
HIVE_EXECUTION_VERSION is set by `spark.sql.hive.version`, which is still set 
internally in some places or by users, this may confuse developers and users 
with HIVE_METASTORE_VERSION(spark.sql.hive.metastore.version).

It might better to be removed.

## How was this patch tested?

modify some existing ut 

cc @cloud-fan @gatorsmile 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yaooqinn/spark SPARK-22487

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19712.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19712


commit 607192603b88f6ed4543587489188f20b9b236e0
Author: Kent Yao 
Date:   2017-11-10T03:06:32Z

rm unused hive_execution_version




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19705: [SPARK-22308][test-maven] Support alternative unit testi...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19705
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19705: [SPARK-22308][test-maven] Support alternative unit testi...

2017-11-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19705
  
To check the syntax, you can run the following command
> dev/lint-scala


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19681: [SPARK-20652][sql] Store SQL UI data in the new app stat...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19681
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83658/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19681: [SPARK-20652][sql] Store SQL UI data in the new app stat...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19681
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15770
  
LGTM. ping @yanboliang 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19681: [SPARK-20652][sql] Store SQL UI data in the new app stat...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19681
  
**[Test build #83658 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83658/testReport)**
 for PR 19681 at commit 
[`1a31665`](https://github.com/apache/spark/commit/1a31665ab6d3352dee3e15c87a697a7e655eb34c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-11-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19459
  
Looks pretty solid. Will take a another look today (KST) and merge this one 
in few days if there are no more comments and/or other committers are busy to 
take a look and merge.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-11-09 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/19459
  
@ueshin @HyukjinKwon does this look ready to merge?  cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19698: [SPARK-20648][core] Port JobsTab and StageTab to the new...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19698
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83659/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19698: [SPARK-20648][core] Port JobsTab and StageTab to the new...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19698
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19698: [SPARK-20648][core] Port JobsTab and StageTab to the new...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19698
  
**[Test build #83659 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83659/testReport)**
 for PR 19698 at commit 
[`1d7242b`](https://github.com/apache/spark/commit/1d7242b340b9525feab941c7d61a6dccb8ccc14c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19705: [SPARK-22308][test-maven] Support alternative unit testi...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19705
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19705: [SPARK-22308][test-maven] Support alternative unit testi...

2017-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19705
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83657/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19705: [SPARK-22308][test-maven] Support alternative unit testi...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19705
  
**[Test build #83657 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83657/testReport)**
 for PR 19705 at commit 
[`12a1d37`](https://github.com/apache/spark/commit/12a1d37ec721a556592cae3c5aff129b6a0663d0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

2017-11-09 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19479#discussion_r150134850
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1034,11 +1034,18 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   schema.fields.map(f => (f.name, f.dataType)).toMap
 stats.colStats.foreach { case (colName, colStat) =>
   colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, 
v) =>
-statsProperties += (columnStatKeyPropName(colName, k) -> v)
+val statKey = columnStatKeyPropName(colName, k)
+val threshold = conf.get(SCHEMA_STRING_LENGTH_THRESHOLD)
+if (v.length > threshold) {
+  throw new AnalysisException(s"Cannot persist '$statKey' into 
hive metastore as " +
--- End diff --

Hive's exception is not friendly to Spark users. Spark user may not know 
what's wrong in his operation:
```
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. 
Put request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) 
VALUES (?,?,?) 
org.datanucleus.exceptions.NucleusDataStoreException: Put request failed : 
INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) VALUES (?,?,?) 
...
Caused by: java.sql.SQLDataException: A truncation error was encountered 
trying to shrink VARCHAR 
'TFo0QmxvY2smeREAANBdAAALz3IBM0AUAAEAQgPoP/ALAAQUACNAJBAAEy4I&' to length 4000.
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...

2017-11-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19689
  
and also I believe anyone can leave the sign-off too if it looks good :). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13599
  
**[Test build #83662 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83662/testReport)**
 for PR 13599 at commit 
[`8474fbc`](https://github.com/apache/spark/commit/8474fbc001a8c418b210d014b55f5ee71c683d06).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #83661 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83661/testReport)**
 for PR 19272 at commit 
[`45b46ed`](https://github.com/apache/spark/commit/45b46ed6768ea50ddf23063b2a925c2a4794acc7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-11-09 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@weichenXu123 Any other comments? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...

2017-11-09 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19689
  
cc @cloud-fan for review too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19689: [SPARK-22462][SQL] Make rdd-based actions in Dataset tra...

2017-11-09 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19689
  
@juliuszsompolski No problem. Non-committer can still review. :)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19702: [SPARK-10365][SQL] Support Parquet logical type T...

2017-11-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19702#discussion_r150131141
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1143,6 +1159,18 @@ class SQLConf extends Serializable with Logging {
 
   def isParquetINT64AsTimestampMillis: Boolean = 
getConf(PARQUET_INT64_AS_TIMESTAMP_MILLIS)
 
+  def parquetOutputTimestampType: ParquetOutputTimestampType.Value = {
+val isOutputTimestampTypeSet = 
settings.containsKey(PARQUET_OUTPUT_TIMESTAMP_TYPE.key)
+if (!isOutputTimestampTypeSet && isParquetINT64AsTimestampMillis) {
+  // If PARQUET_OUTPUT_TIMESTAMP_TYPE is not set and 
PARQUET_INT64_AS_TIMESTAMP_MILLIS is set,
+  // respect PARQUET_INT64_AS_TIMESTAMP_MILLIS and use 
TIMESTAMP_MILLIS. Otherwise,
+  // PARQUET_OUTPUT_TIMESTAMP_TYPE has higher priority.
--- End diff --

BTW, do we have a simple test for this priority? seems 
`isParquetINT64AsTimestampMillis` defaults to `false`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-11-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #83660 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83660/testReport)**
 for PR 19272 at commit 
[`8df7e37`](https://github.com/apache/spark/commit/8df7e37517a21d5fbaa2c0e7abfa248fd3ff9be3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >