date:20180115

[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...

2018-01-15 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20204#discussion_r161679707
  
--- Diff: python/run-tests-with-coverage ---
@@ -0,0 +1,69 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -o pipefail
+set -e
+
+# This variable indicates which coverage executable to run to combine 
coverages
+# and generate HTMLs, for example, 'coverage3' in Python 3.
+COV_EXEC="${COV_EXEC:-coverage}"
+FWDIR="$(cd "`dirname $0`"; pwd)"
+pushd "$FWDIR" > /dev/null
--- End diff --

I see, no problem at all. I just wanted to confirm. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20153
  
**[Test build #86160 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86160/testReport)**
 for PR 20153 at commit 
[`d666110`](https://github.com/apache/spark/commit/d6661104f314c88ff84057fd4830e7a5fbe964d9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17280: [SPARK-19939] [ML] Add support for association ru...

2018-01-15 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/17280#discussion_r161679593
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -319,9 +323,11 @@ object FPGrowthModel extends MLReadable[FPGrowthModel] 
{
 
 override def load(path: String): FPGrowthModel = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+  implicit val format = DefaultFormats
+  val numTrainingRecords = (metadata.metadata \ 
"numTrainingRecords").extract[Long]
--- End diff --

Does this break backward compatibility for loading?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

2018-01-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20153
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...

2018-01-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/20204#discussion_r161677975
  
--- Diff: python/run-tests-with-coverage ---
@@ -0,0 +1,69 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -o pipefail
+set -e
+
+# This variable indicates which coverage executable to run to combine 
coverages
+# and generate HTMLs, for example, 'coverage3' in Python 3.
+COV_EXEC="${COV_EXEC:-coverage}"
+FWDIR="$(cd "`dirname $0`"; pwd)"
+pushd "$FWDIR" > /dev/null
--- End diff --

my 2 c: I think it's ok, I'd prefer it; might be useful in the future when 
more cd are added


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20232
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86154/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20232
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20232
  
**[Test build #86154 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86154/testReport)**
 for PR 20232 at commit 
[`20bbf64`](https://github.com/apache/spark/commit/20bbf64e0ce99538f80e5b6f360a69de93f4d9fc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class OneHotEncoderEstimator(JavaEstimator, HasInputCols, 
HasOutputCols, HasHandleInvalid,`
  * `class OneHotEncoderModel(JavaModel, JavaMLReadable, JavaMLWritable):`
  * `class HasOutputCols(Params):`
  * `sealed trait Distribution `
  * `case class HashClusteredDistribution(expressions: Seq[Expression]) 
extends Distribution `
  * `case class BroadcastDistribution(mode: BroadcastMode) extends 
Distribution `
  * `case class UnknownPartitioning(numPartitions: Int) extends 
Partitioning`
  * `case class RoundRobinPartitioning(numPartitions: Int) extends 
Partitioning`
  * `public class OrcColumnVector extends 
org.apache.spark.sql.vectorized.ColumnVector `
  * `public class OrcColumnarBatchReader extends RecordReader `
  * `class StreamingDataSourceV2Relation(`
  * `case class RateStreamPartitionOffset(`
  * `class RateStreamContinuousReader(options: DataSourceV2Options)`
  * `case class RateStreamContinuousReadTask(`
  * `class RateStreamContinuousDataReader(`
  * `class RateSourceProviderV2 extends DataSourceV2 with 
MicroBatchReadSupport with DataSourceRegister `
  * `class RateStreamMicroBatchReader(options: DataSourceV2Options)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161677327
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

btw, the array case seems a bit weird?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20267: [SPARK-23068][BUILD][RELEASE] doc build error from jekyl...

2018-01-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20267
  
I see. Thanks for info! So, is it ready to go anyway @felixcheung?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20265: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...

2018-01-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20265
  
I'll update the PR tomorrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sp...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20275
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sp...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20275
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86158/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sp...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20275
  
**[Test build #86158 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86158/testReport)**
 for PR 20275 at commit 
[`1a3cd3a`](https://github.com/apache/spark/commit/1a3cd3aab355a00a73993979896624a8684a9aad).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-15 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20168
  
Overall looks good to me. Just some minor comments regarding with code 
comments and naming.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20168
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86156/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20168
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20168
  
**[Test build #86156 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86156/testReport)**
 for PR 20168 at commit 
[`896ccc2`](https://github.com/apache/spark/commit/896ccc21582f1610e38dc91a67eca90c8914e2e5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161671821
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+// scalastyle:off line.size.limit
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  // Set default configs. Individual cases will change them if necessary.
+  spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true")
+  spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true")
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, df: DataFrame): Unit = {
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit 
= {
+val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values)
+
+withTempPath { dir =>
+  withTempTable("t1", "orcTable", "patquetTable") {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) 
c$i")
+val df = spark.range(values).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+df.createOrReplaceTempView("t1")
+prepareTable(dir, spark.sql("SELECT * FROM t1"))
+
+Seq(false, true).foreach { value =>
+  benchmark.addCase(s"Parquet Vectorized ${if (value) 
s"(Pushdown)" else ""}") { _ =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> 
s"$value") {
+  spark.sql(s"SELECT * FROM parquetTable WHERE 
$expr").collect()
+}
+  }
+}
+
+Seq(false, true).foreach { value =>
+  benchmark.addCase(s"Native ORC Vectorized ${if (value) 
s"(Pushdown)" else ""}") { _ =>
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> 
s"$value") {
+  spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect()
+}
+  }
+}
+
+// Positive cases: Select one or no rows
+/*
+Java HotSpot(TM)

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20164
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20164
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86157/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20164
  
**[Test build #86157 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86157/testReport)**
 for PR 20164 at commit 
[`e44d764`](https://github.com/apache/spark/commit/e44d7647fe6596c70a28527f893bdbdcb373c190).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161672411
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+// scalastyle:off line.size.limit
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  // Set default configs. Individual cases will change them if necessary.
+  spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true")
+  spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true")
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, df: DataFrame): Unit = {
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit 
= {
+val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values)
+
+withTempPath { dir =>
+  withTempTable("t1", "orcTable", "patquetTable") {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) 
c$i")
+val df = spark.range(values).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+df.createOrReplaceTempView("t1")
+prepareTable(dir, spark.sql("SELECT * FROM t1"))
+
+Seq(false, true).foreach { value =>
+  benchmark.addCase(s"Parquet Vectorized ${if (value) 
s"(Pushdown)" else ""}") { _ =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> 
s"$value") {
+  spark.sql(s"SELECT * FROM parquetTable WHERE 
$expr").collect()
+}
+  }
+}
+
+Seq(false, true).foreach { value =>
+  benchmark.addCase(s"Native ORC Vectorized ${if (value) 
s"(Pushdown)" else ""}") { _ =>
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> 
s"$value") {
+  spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect()
+}
+  }
+}
+
+// Positive cases: Select one or no rows
+/*
+Java HotSpot(TM)

[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161672316
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+// scalastyle:off line.size.limit
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  // Set default configs. Individual cases will change them if necessary.
+  spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true")
+  spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true")
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, df: DataFrame): Unit = {
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit 
= {
+val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values)
+
+withTempPath { dir =>
+  withTempTable("t1", "orcTable", "patquetTable") {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) 
c$i")
+val df = spark.range(values).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+df.createOrReplaceTempView("t1")
+prepareTable(dir, spark.sql("SELECT * FROM t1"))
+
+Seq(false, true).foreach { value =>
+  benchmark.addCase(s"Parquet Vectorized ${if (value) 
s"(Pushdown)" else ""}") { _ =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> 
s"$value") {
+  spark.sql(s"SELECT * FROM parquetTable WHERE 
$expr").collect()
+}
+  }
+}
+
+Seq(false, true).foreach { value =>
+  benchmark.addCase(s"Native ORC Vectorized ${if (value) 
s"(Pushdown)" else ""}") { _ =>
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> 
s"$value") {
+  spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect()
+}
+  }
+}
+
+// Positive cases: Select one or no rows
+/*
+Java HotSpot(TM)

[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161671868
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+// scalastyle:off line.size.limit
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  // Set default configs. Individual cases will change them if necessary.
+  spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true")
+  spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true")
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, df: DataFrame): Unit = {
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit 
= {
+val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values)
+
+withTempPath { dir =>
+  withTempTable("t1", "orcTable", "patquetTable") {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) 
c$i")
+val df = spark.range(values).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+df.createOrReplaceTempView("t1")
+prepareTable(dir, spark.sql("SELECT * FROM t1"))
+
+Seq(false, true).foreach { value =>
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...

2018-01-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20265#discussion_r161671835
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.File
+
+import scala.util.{Random, Try}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.functions.monotonically_increasing_id
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.{Benchmark, Utils}
+
+
+/**
+ * Benchmark to measure read performance with Filter pushdown.
+ */
+// scalastyle:off line.size.limit
+object FilterPushdownBenchmark {
+  val conf = new SparkConf()
+  conf.set("orc.compression", "snappy")
+  conf.set("spark.sql.parquet.compression.codec", "snappy")
+
+  private val spark = SparkSession.builder()
+.master("local[1]")
+.appName("FilterPushdownBenchmark")
+.config(conf)
+.getOrCreate()
+
+  // Set default configs. Individual cases will change them if necessary.
+  spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true")
+  spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true")
+
+  def withTempPath(f: File => Unit): Unit = {
+val path = Utils.createTempDir()
+path.delete()
+try f(path) finally Utils.deleteRecursively(path)
+  }
+
+  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
+try f finally tableNames.foreach(spark.catalog.dropTempView)
+  }
+
+  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
+val (keys, values) = pairs.unzip
+val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
+(keys, values).zipped.foreach(spark.conf.set)
+try f finally {
+  keys.zip(currentValues).foreach {
+case (key, Some(value)) => spark.conf.set(key, value)
+case (key, None) => spark.conf.unset(key)
+  }
+}
+  }
+
+  private def prepareTable(dir: File, df: DataFrame): Unit = {
+val dirORC = dir.getCanonicalPath + "/orc"
+val dirParquet = dir.getCanonicalPath + "/parquet"
+
+df.write.mode("overwrite").orc(dirORC)
+df.write.mode("overwrite").parquet(dirParquet)
+
+spark.read.orc(dirORC).createOrReplaceTempView("orcTable")
+spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable")
+  }
+
+  def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit 
= {
+val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values)
+
+withTempPath { dir =>
+  withTempTable("t1", "orcTable", "patquetTable") {
+import spark.implicits._
+val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) 
c$i")
+val df = spark.range(values).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*)
+  .withColumn("id", monotonically_increasing_id())
+
+df.createOrReplaceTempView("t1")
+prepareTable(dir, spark.sql("SELECT * FROM t1"))
+
+Seq(false, true).foreach { value =>
+  benchmark.addCase(s"Parquet Vectorized ${if (value) 
s"(Pushdown)" else ""}") { _ =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> 
s"$value") {
+  spark.sql(s"SELECT * FROM parquetTable WHERE 
$expr").collect()
+}
+  }
+}
+
+Seq(false, true).foreach { value =>
+  benchmark.addCase(s"Native ORC Vectorized ${if (value) 
s"(Pushdown)" else ""}") { _ =>
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> 
s"$value") {
+  spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect()
+}
+  }
+}
+
+// Positive cases: Select one or no rows
+/*
+Java HotSpot(TM)

[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...

2018-01-15 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20223


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20223: [SPARK-23020][core] Fix races in launcher code, test.

2018-01-15 Thread sameeragarwal

Github user sameeragarwal commented on the issue:

https://github.com/apache/spark/pull/20223
  
merging to master/2.3. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema test for f...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20266
  
**[Test build #86159 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86159/testReport)**
 for PR 20266 at commit 
[`5afaa28`](https://github.com/apache/spark/commit/5afaa2836133cfc18a52de38d666817991d62c5d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20216: [SPARK-23024][WEB-UI]Spark ui about the contents of the ...

2018-01-15 Thread ajbozarth

Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/20216
  
LGTM now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema tes...

2018-01-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20266#discussion_r161668457
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.test.SharedSQLContext
+
+class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  Seq("orc", "parquet", "csv", "json", "text").foreach { format =>
+test(s"Writing empty datasets should not fail - $format") {
+  withTempDir { dir =>
+
Seq("str").toDS.limit(0).write.format(format).save(dir.getCanonicalPath + 
"/tmp")
--- End diff --

Yep. It's fixed by using `withTempPath`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema tes...

2018-01-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20266#discussion_r161668286
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.test.SharedSQLContext
+
+class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  Seq("orc", "parquet", "csv", "json", "text").foreach { format =>
+test(s"Writing empty datasets should not fail - $format") {
+  withTempDir { dir =>
+
Seq("str").toDS.limit(0).write.format(format).save(dir.getCanonicalPath + 
"/tmp")
+  }
+}
+  }
+
+  Seq("orc", "parquet", "csv", "json").foreach { format =>
+test(s"Write and read back unicode schema - $format") {
+  withTempPath { path =>
+val dir = path.getCanonicalPath
+
+// scalastyle:off nonascii
+val df = Seq("a").toDF("íê¸")
+// scalastyle:on nonascii
+
+df.write.format(format).option("header", "true").save(dir)
+val answerDf = spark.read.format(format).option("header", 
"true").load(dir)
+
+assert(df.schema === answerDf.schema)
+checkAnswer(df, answerDf)
+  }
+}
+  }
+
+  // Only New OrcFileFormat supports this
+  
Seq(classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat].getCanonicalName,
--- End diff --

Yep.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20216: [SPARK-23024][WEB-UI]Spark ui about the contents of the ...

2018-01-15 Thread guoxiaolongzte

Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/20216
  
I agree with your second suggestion, before I did not understand what you 
mean, now I passed the test I understand what you mean.

1.In order for collapsible tables to persist on reload each table much be 
added to the function at the bottom on web.js. When I refresh the page, if it 
is hidden, will still be hidden; if it is displayed, will still be displayed.

2.to ensure user interface consistency.

@ajbozarth @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sp...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20275
  
**[Test build #86158 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86158/testReport)**
 for PR 20275 at commit 
[`1a3cd3a`](https://github.com/apache/spark/commit/1a3cd3aab355a00a73993979896624a8684a9aad).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-15 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20168
  
Btw, I think this isn't only to add non-integer image formats. So the PR 
title may be changed too. Like "Add ImageSchema support for all OpenCV image 
types"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vec...

2018-01-15 Thread zhengruifeng

GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/20275

[SPARK-23085][ML] API parity for mllib.linalg.Vectors.sparse 

## What changes were proposed in this pull request?
`ML.Vectors#sparse(size: Int, elements: Seq[(Int, Double)])` support 
zero-length

## How was this patch tested?
existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark SparseVector_size

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20275


commit 8b2876e5c5059a1fb258bff53ae6667df80d3205
Author: Zheng RuiFeng 
Date:   2018-01-16T01:46:47Z

nit

commit 1a3cd3aab355a00a73993979896624a8684a9aad
Author: Zheng RuiFeng 
Date:   2018-01-16T05:49:40Z

update pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161665015
  
--- Diff: python/pyspark/ml/image.py ---
@@ -128,11 +183,17 @@ def toNDArray(self, image):
 height = image.height
 width = image.width
 nChannels = image.nChannels
+ocvType = self.ocvTypeByMode(image.mode)
+if nChannels != ocvType.nChannels:
+raise ValueError(
+"Image has %d channels but OcvType '%s' expects %d 
channels." %
--- End diff --

`Image has %d channels but its OcvType ...`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161665832
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1843,6 +1844,28 @@ def tearDown(self):
 
 class ImageReaderTest(SparkSessionTestCase):
 
+def test_ocv_types(self):
+ocvList = ImageSchema.ocvTypes
+self.assertEqual("Undefined", ocvList[0].name)
+self.assertEqual(-1, ocvList[0].mode)
+self.assertEqual("N/A", ocvList[0].dataType)
+for x in ocvList:
+self.assertEqual(x, ImageSchema.ocvTypeByName(x.name))
+self.assertEqual(x, ImageSchema.ocvTypeByMode(x.mode))
+
+def test_conversions(self):
+s = np.random.RandomState(seed=987)
+ary_src = s.rand(4, 10, 10)
--- End diff --

ary_src -> array_src?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161666778
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1843,6 +1844,28 @@ def tearDown(self):
 
 class ImageReaderTest(SparkSessionTestCase):
 
+def test_ocv_types(self):
+ocvList = ImageSchema.ocvTypes
+self.assertEqual("Undefined", ocvList[0].name)
+self.assertEqual(-1, ocvList[0].mode)
+self.assertEqual("N/A", ocvList[0].dataType)
+for x in ocvList:
+self.assertEqual(x, ImageSchema.ocvTypeByName(x.name))
+self.assertEqual(x, ImageSchema.ocvTypeByMode(x.mode))
+
+def test_conversions(self):
+s = np.random.RandomState(seed=987)
+ary_src = s.rand(4, 10, 10)
--- End diff --

s.rand(4, 10, 10) -> s.rand(10, 10, 4)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161664859
  
--- Diff: python/pyspark/ml/image.py ---
@@ -55,25 +72,66 @@ def imageSchema(self):
 """
 
 if self._imageSchema is None:
-ctx = SparkContext._active_spark_context
+ctx = SparkContext.getOrCreate()
 jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
 self._imageSchema = _parse_datatype_json_string(jschema.json())
 return self._imageSchema
 
 @property
 def ocvTypes(self):
 """
-Returns the OpenCV type mapping supported.
+Return the supported OpenCV types.
 
-:return: a dictionary containing the OpenCV type mapping supported.
+:return: a list containing the supported OpenCV types.
 
 .. versionadded:: 2.3.0
 """
 
 if self._ocvTypes is None:
-ctx = SparkContext._active_spark_context
-self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
-return self._ocvTypes
+ctx = SparkContext.getOrCreate()
+ocvTypeList = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()
+self._ocvTypes = [self._OcvType(name=x.name(),
+mode=x.mode(),
+nChannels=x.nChannels(),
+dataType=x.dataType(),
+
nptype=self._ocvToNumpyMap[x.dataType()])
+  for x in ocvTypeList]
+return self._ocvTypes[:]
+
+
+def ocvTypeByName(self, name):
+"""
+Return the supported OpenCvType with matching name or raise error 
if there is no matching type.
+
+:param: str name: OpenCv type name; must be equal to name of one 
of the supported types.
+:return: OpenCvType with matching name.
+
+"""
+
+if self._ocvTypesByName is None:
+self._ocvTypesByName = {x.name: x for x in self.ocvTypes}
+if name not in self._ocvTypesByName:
+raise ValueError(
+"Can not find matching OpenCvFormat for type = '%s'; 
supported formats are = %s" %
+(name, str(self._ocvTypesByName.keys(
+return self._ocvTypesByName[name]
+
+def ocvTypeByMode(self, mode):
+"""
+Return the supported OpenCvType with matching mode or raise error 
if there is no matching type.
+
+:param: int mode: OpenCv type mode; must be equal to mode of one 
of the supported types.
+:return: OpenCvType with matching mode.
--- End diff --

`OpenCvType` -> `OcvType`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161661795
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -37,20 +37,67 @@ import org.apache.spark.sql.types._
 @Since("2.3.0")
 object ImageSchema {
 
-  val undefinedImageType = "Undefined"
+  /**
+   * OpenCv type representation
+   *
+   * @param mode ordinal for the type
+   * @param dataType open cv data type
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161664852
  
--- Diff: python/pyspark/ml/image.py ---
@@ -55,25 +72,66 @@ def imageSchema(self):
 """
 
 if self._imageSchema is None:
-ctx = SparkContext._active_spark_context
+ctx = SparkContext.getOrCreate()
 jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
 self._imageSchema = _parse_datatype_json_string(jschema.json())
 return self._imageSchema
 
 @property
 def ocvTypes(self):
 """
-Returns the OpenCV type mapping supported.
+Return the supported OpenCV types.
 
-:return: a dictionary containing the OpenCV type mapping supported.
+:return: a list containing the supported OpenCV types.
 
 .. versionadded:: 2.3.0
 """
 
 if self._ocvTypes is None:
-ctx = SparkContext._active_spark_context
-self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
-return self._ocvTypes
+ctx = SparkContext.getOrCreate()
+ocvTypeList = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()
+self._ocvTypes = [self._OcvType(name=x.name(),
+mode=x.mode(),
+nChannels=x.nChannels(),
+dataType=x.dataType(),
+
nptype=self._ocvToNumpyMap[x.dataType()])
+  for x in ocvTypeList]
+return self._ocvTypes[:]
+
+
+def ocvTypeByName(self, name):
+"""
+Return the supported OpenCvType with matching name or raise error 
if there is no matching type.
+
+:param: str name: OpenCv type name; must be equal to name of one 
of the supported types.
+:return: OpenCvType with matching name.
+
+"""
+
+if self._ocvTypesByName is None:
+self._ocvTypesByName = {x.name: x for x in self.ocvTypes}
+if name not in self._ocvTypesByName:
+raise ValueError(
+"Can not find matching OpenCvFormat for type = '%s'; 
supported formats are = %s" %
+(name, str(self._ocvTypesByName.keys(
+return self._ocvTypesByName[name]
+
+def ocvTypeByMode(self, mode):
+"""
+Return the supported OpenCvType with matching mode or raise error 
if there is no matching type.
--- End diff --

`OpenCvType` -> `OcvType`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161664806
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -37,20 +37,67 @@ import org.apache.spark.sql.types._
 @Since("2.3.0")
 object ImageSchema {
 
-  val undefinedImageType = "Undefined"
+  /**
+   * OpenCv type representation
+   *
+   * @param mode ordinal for the type
+   * @param dataType open cv data type
+   * @param nChannels number of color channels
+   */
+  case class OpenCvType(mode: Int, dataType: String, nChannels: Int) {
+def name: String = if (mode == -1) { "Undefined" } else { 
s"CV_$dataType" + s"C$nChannels" }
+override def toString: String = s"OpenCvType(mode = $mode, name = 
$name)"
+  }
 
   /**
-   * (Scala-specific) OpenCV type mapping supported
+   * Return the supported OpenCvType with matching name or raise error if 
there is no matching type.
+   *
+   * @param name: name of existing OpenCvType
+   * @return OpenCvType that matches the given name
*/
-  val ocvTypes: Map[String, Int] = Map(
-undefinedImageType -> -1,
-"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
-  )
+  def ocvTypeByName(name: String): OpenCvType = {
+ocvTypes.find(x => x.name == name).getOrElse(
+  throw new IllegalArgumentException("Unknown open cv type " + name))
+  }
+
+  /**
+   * Return the supported OpenCvType with matching mode or raise error if 
there is no matching type.
+   *
+   * @param mode: mode of existing OpenCvType
+   * @return OpenCvType that matches the given mode
+   */
+  def ocvTypeByMode(mode: Int): OpenCvType = {
--- End diff --

`getOcvTypeByMode` or `findOcvTypeByMode`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161664786
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -37,20 +37,67 @@ import org.apache.spark.sql.types._
 @Since("2.3.0")
 object ImageSchema {
 
-  val undefinedImageType = "Undefined"
+  /**
+   * OpenCv type representation
+   *
+   * @param mode ordinal for the type
+   * @param dataType open cv data type
+   * @param nChannels number of color channels
+   */
+  case class OpenCvType(mode: Int, dataType: String, nChannels: Int) {
+def name: String = if (mode == -1) { "Undefined" } else { 
s"CV_$dataType" + s"C$nChannels" }
+override def toString: String = s"OpenCvType(mode = $mode, name = 
$name)"
+  }
 
   /**
-   * (Scala-specific) OpenCV type mapping supported
+   * Return the supported OpenCvType with matching name or raise error if 
there is no matching type.
+   *
+   * @param name: name of existing OpenCvType
+   * @return OpenCvType that matches the given name
*/
-  val ocvTypes: Map[String, Int] = Map(
-undefinedImageType -> -1,
-"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
-  )
+  def ocvTypeByName(name: String): OpenCvType = {
--- End diff --

`getOcvTypeByName` or `findOcvTypeByName`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161663005
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -37,20 +37,67 @@ import org.apache.spark.sql.types._
 @Since("2.3.0")
 object ImageSchema {
 
-  val undefinedImageType = "Undefined"
+  /**
+   * OpenCv type representation
+   *
+   * @param mode ordinal for the type
+   * @param dataType open cv data type
+   * @param nChannels number of color channels
+   */
+  case class OpenCvType(mode: Int, dataType: String, nChannels: Int) {
+def name: String = if (mode == -1) { "Undefined" } else { 
s"CV_$dataType" + s"C$nChannels" }
+override def toString: String = s"OpenCvType(mode = $mode, name = 
$name)"
+  }
 
   /**
-   * (Scala-specific) OpenCV type mapping supported
+   * Return the supported OpenCvType with matching name or raise error if 
there is no matching type.
+   *
+   * @param name: name of existing OpenCvType
+   * @return OpenCvType that matches the given name
*/
-  val ocvTypes: Map[String, Int] = Map(
-undefinedImageType -> -1,
-"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
-  )
+  def ocvTypeByName(name: String): OpenCvType = {
+ocvTypes.find(x => x.name == name).getOrElse(
+  throw new IllegalArgumentException("Unknown open cv type " + name))
+  }
+
+  /**
+   * Return the supported OpenCvType with matching mode or raise error if 
there is no matching type.
+   *
+   * @param mode: mode of existing OpenCvType
+   * @return OpenCvType that matches the given mode
+   */
+  def ocvTypeByMode(mode: Int): OpenCvType = {
+ocvTypes.find(x => x.mode == mode).getOrElse(
+  throw new IllegalArgumentException("Unknown open cv mode " + mode))
+  }
+
+  val undefinedImageType = OpenCvType(-1, "N/A", -1)
+
+  /**
+   * A Mapping of Type to Numbers in OpenCV
+   *
+   *C1 C2  C3  C4
--- End diff --

Add a brief header for row/column.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161664731
  
--- Diff: python/pyspark/ml/image.py ---
@@ -55,25 +72,66 @@ def imageSchema(self):
 """
 
 if self._imageSchema is None:
-ctx = SparkContext._active_spark_context
+ctx = SparkContext.getOrCreate()
 jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
 self._imageSchema = _parse_datatype_json_string(jschema.json())
 return self._imageSchema
 
 @property
 def ocvTypes(self):
 """
-Returns the OpenCV type mapping supported.
+Return the supported OpenCV types.
 
-:return: a dictionary containing the OpenCV type mapping supported.
+:return: a list containing the supported OpenCV types.
 
 .. versionadded:: 2.3.0
 """
 
 if self._ocvTypes is None:
-ctx = SparkContext._active_spark_context
-self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
-return self._ocvTypes
+ctx = SparkContext.getOrCreate()
+ocvTypeList = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()
+self._ocvTypes = [self._OcvType(name=x.name(),
+mode=x.mode(),
+nChannels=x.nChannels(),
+dataType=x.dataType(),
+
nptype=self._ocvToNumpyMap[x.dataType()])
+  for x in ocvTypeList]
+return self._ocvTypes[:]
+
+
+def ocvTypeByName(self, name):
+"""
+Return the supported OpenCvType with matching name or raise error 
if there is no matching type.
+
+:param: str name: OpenCv type name; must be equal to name of one 
of the supported types.
+:return: OpenCvType with matching name.
+
+"""
+
+if self._ocvTypesByName is None:
+self._ocvTypesByName = {x.name: x for x in self.ocvTypes}
+if name not in self._ocvTypesByName:
+raise ValueError(
+"Can not find matching OpenCvFormat for type = '%s'; 
supported formats are = %s" %
+(name, str(self._ocvTypesByName.keys(
+return self._ocvTypesByName[name]
+
+def ocvTypeByMode(self, mode):
--- End diff --

`getOcvTypeByMode` or `findOcvTypeByMode`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161665337
  
--- Diff: python/pyspark/ml/image.py ---
@@ -55,25 +72,66 @@ def imageSchema(self):
 """
 
 if self._imageSchema is None:
-ctx = SparkContext._active_spark_context
+ctx = SparkContext.getOrCreate()
 jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
 self._imageSchema = _parse_datatype_json_string(jschema.json())
 return self._imageSchema
 
 @property
 def ocvTypes(self):
 """
-Returns the OpenCV type mapping supported.
+Return the supported OpenCV types.
 
-:return: a dictionary containing the OpenCV type mapping supported.
+:return: a list containing the supported OpenCV types.
 
 .. versionadded:: 2.3.0
 """
 
 if self._ocvTypes is None:
-ctx = SparkContext._active_spark_context
-self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
-return self._ocvTypes
+ctx = SparkContext.getOrCreate()
+ocvTypeList = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()
+self._ocvTypes = [self._OcvType(name=x.name(),
+mode=x.mode(),
+nChannels=x.nChannels(),
+dataType=x.dataType(),
+
nptype=self._ocvToNumpyMap[x.dataType()])
+  for x in ocvTypeList]
+return self._ocvTypes[:]
+
+
+def ocvTypeByName(self, name):
--- End diff --

`getOcvTypeByName` or `findOcvTypeByName`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161664566
  
--- Diff: python/pyspark/ml/image.py ---
@@ -55,25 +72,66 @@ def imageSchema(self):
 """
 
 if self._imageSchema is None:
-ctx = SparkContext._active_spark_context
+ctx = SparkContext.getOrCreate()
 jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
 self._imageSchema = _parse_datatype_json_string(jschema.json())
 return self._imageSchema
 
 @property
 def ocvTypes(self):
 """
-Returns the OpenCV type mapping supported.
+Return the supported OpenCV types.
 
-:return: a dictionary containing the OpenCV type mapping supported.
+:return: a list containing the supported OpenCV types.
 
 .. versionadded:: 2.3.0
 """
 
 if self._ocvTypes is None:
-ctx = SparkContext._active_spark_context
-self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
-return self._ocvTypes
+ctx = SparkContext.getOrCreate()
+ocvTypeList = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()
+self._ocvTypes = [self._OcvType(name=x.name(),
+mode=x.mode(),
+nChannels=x.nChannels(),
+dataType=x.dataType(),
+
nptype=self._ocvToNumpyMap[x.dataType()])
+  for x in ocvTypeList]
+return self._ocvTypes[:]
+
+
+def ocvTypeByName(self, name):
+"""
+Return the supported OpenCvType with matching name or raise error 
if there is no matching type.
--- End diff --

`OpenCvType` -> `OcvType`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161662177
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -37,20 +37,67 @@ import org.apache.spark.sql.types._
 @Since("2.3.0")
 object ImageSchema {
 
-  val undefinedImageType = "Undefined"
+  /**
+   * OpenCv type representation
+   *
+   * @param mode ordinal for the type
+   * @param dataType open cv data type
+   * @param nChannels number of color channels
+   */
+  case class OpenCvType(mode: Int, dataType: String, nChannels: Int) {
+def name: String = if (mode == -1) { "Undefined" } else { 
s"CV_$dataType" + s"C$nChannels" }
+override def toString: String = s"OpenCvType(mode = $mode, name = 
$name)"
+  }
 
   /**
-   * (Scala-specific) OpenCV type mapping supported
+   * Return the supported OpenCvType with matching name or raise error if 
there is no matching type.
+   *
+   * @param name: name of existing OpenCvType
+   * @return OpenCvType that matches the given name
*/
-  val ocvTypes: Map[String, Int] = Map(
-undefinedImageType -> -1,
-"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
-  )
+  def ocvTypeByName(name: String): OpenCvType = {
+ocvTypes.find(x => x.name == name).getOrElse(
+  throw new IllegalArgumentException("Unknown open cv type " + name))
+  }
+
+  /**
+   * Return the supported OpenCvType with matching mode or raise error if 
there is no matching type.
+   *
+   * @param mode: mode of existing OpenCvType
+   * @return OpenCvType that matches the given mode
+   */
+  def ocvTypeByMode(mode: Int): OpenCvType = {
+ocvTypes.find(x => x.mode == mode).getOrElse(
+  throw new IllegalArgumentException("Unknown open cv mode " + mode))
+  }
+
+  val undefinedImageType = OpenCvType(-1, "N/A", -1)
+
+  /**
+   * A Mapping of Type to Numbers in OpenCV
+   *
+   *C1 C2  C3  C4
+   * CV_8U   0  8  16  24
+   * CV_8S   1  9  17  25
+   * CV_16U  2 10  18  26
+   * CV_16S  3 11  19  27
+   * CV_32S  4 12  20  28
+   * CV_32F  5 13  21  29
+   * CV_64F  6 14  22  30
+   */
+  val ocvTypes: IndexedSeq[OpenCvType] = {
+val types =
+  for (nc <- Array(1, 2, 3, 4);
--- End diff --

`numChannel`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161664060
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala ---
@@ -83,7 +83,8 @@ class ImageSchemaSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val bytes20 = getData(row).slice(0, 20)
 
 val (expectedMode, expectedBytes) = firstBytes20(filename)
--- End diff --

Since you use `ocvTypeByName` below to look up for it, it should be named 
as `expectedType` or `expectedTypeName`, other than `expectedMode`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161663481
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -37,20 +37,67 @@ import org.apache.spark.sql.types._
 @Since("2.3.0")
 object ImageSchema {
 
-  val undefinedImageType = "Undefined"
+  /**
+   * OpenCv type representation
--- End diff --

Add a reference link for OpenCV data type? Like this one: 
https://docs.opencv.org/2.4/modules/core/doc/basic_structures.html



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161664597
  
--- Diff: python/pyspark/ml/image.py ---
@@ -55,25 +72,66 @@ def imageSchema(self):
 """
 
 if self._imageSchema is None:
-ctx = SparkContext._active_spark_context
+ctx = SparkContext.getOrCreate()
 jschema = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
 self._imageSchema = _parse_datatype_json_string(jschema.json())
 return self._imageSchema
 
 @property
 def ocvTypes(self):
 """
-Returns the OpenCV type mapping supported.
+Return the supported OpenCV types.
 
-:return: a dictionary containing the OpenCV type mapping supported.
+:return: a list containing the supported OpenCV types.
 
 .. versionadded:: 2.3.0
 """
 
 if self._ocvTypes is None:
-ctx = SparkContext._active_spark_context
-self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
-return self._ocvTypes
+ctx = SparkContext.getOrCreate()
+ocvTypeList = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()
+self._ocvTypes = [self._OcvType(name=x.name(),
+mode=x.mode(),
+nChannels=x.nChannels(),
+dataType=x.dataType(),
+
nptype=self._ocvToNumpyMap[x.dataType()])
+  for x in ocvTypeList]
+return self._ocvTypes[:]
+
+
+def ocvTypeByName(self, name):
+"""
+Return the supported OpenCvType with matching name or raise error 
if there is no matching type.
+
+:param: str name: OpenCv type name; must be equal to name of one 
of the supported types.
+:return: OpenCvType with matching name.
--- End diff --

`OpenCvType` -> `OcvType`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20249: [SPARK-23057][SPARK-19235][SQL] SET LOCATION should chan...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20249
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86152/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20249: [SPARK-23057][SPARK-19235][SQL] SET LOCATION should chan...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20249
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20249: [SPARK-23057][SPARK-19235][SQL] SET LOCATION should chan...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20249
  
**[Test build #86152 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86152/testReport)**
 for PR 20249 at commit 
[`90c4980`](https://github.com/apache/spark/commit/90c49809886e2f487dc4c4dc6ba45aa16bae8933).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20150: [SPARK-22956][SS] Bug fix for 2 streams union fai...

2018-01-15 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20150


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20150: [SPARK-22956][SS] Bug fix for 2 streams union failover s...

2018-01-15 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/20150
  
Thanks for your review! Shixiong


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20150: [SPARK-22956][SS] Bug fix for 2 streams union failover s...

2018-01-15 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/20150
  
Thanks! Merging to master and 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20168
  
**[Test build #86156 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86156/testReport)**
 for PR 20168 at commit 
[`896ccc2`](https://github.com/apache/spark/commit/896ccc21582f1610e38dc91a67eca90c8914e2e5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20164
  
**[Test build #86157 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86157/testReport)**
 for PR 20164 at commit 
[`e44d764`](https://github.com/apache/spark/commit/e44d7647fe6596c70a28527f893bdbdcb373c190).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread zhengruifeng

Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/20164
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20258: [SPARK-23060][Python] New feature - apply method to exte...

2018-01-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20258
  
Oh, I see! Yea, they look quite same.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20258: [SPARK-23060][Python] New feature - apply method to exte...

2018-01-15 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/20258
  
Is this similar to `Dataset.transform()` in Java/Scala API? But we don't 
have similar APIs for RDDs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20164
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20164
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86151/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20164
  
**[Test build #86151 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86151/testReport)**
 for PR 20164 at commit 
[`e44d764`](https://github.com/apache/spark/commit/e44d7647fe6596c70a28527f893bdbdcb373c190).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20232
  
**[Test build #86155 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86155/testReport)**
 for PR 20232 at commit 
[`77af798`](https://github.com/apache/spark/commit/77af798ab356c23cd4a65c6c7c3f93c66c026982).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20138: [SPARK-20664][core] Delete stale application data...

2018-01-15 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/20138#discussion_r161660926
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -663,6 +665,95 @@ class FsHistoryProviderSuite extends SparkFunSuite 
with BeforeAndAfter with Matc
 freshUI.get.ui.store.job(0)
   }
 
+  test("clean up stale app information") {
+val storeDir = Utils.createTempDir()
+val conf = createTestConf().set(LOCAL_STORE_DIR, 
storeDir.getAbsolutePath())
+val provider = spy(new FsHistoryProvider(conf))
+val appId = "new1"
+
+// Write logs for two app attempts.
+doReturn(1L).when(provider).getNewLastScanTime()
+val attempt1 = newLogFile(appId, Some("1"), inProgress = false)
+writeFile(attempt1, true, None,
+  SparkListenerApplicationStart(appId, Some(appId), 1L, "test", 
Some("1")),
+  SparkListenerJobStart(0, 1L, Nil, null),
+  SparkListenerApplicationEnd(5L)
+  )
+val attempt2 = newLogFile(appId, Some("2"), inProgress = false)
+writeFile(attempt2, true, None,
+  SparkListenerApplicationStart(appId, Some(appId), 1L, "test", 
Some("2")),
+  SparkListenerJobStart(0, 1L, Nil, null),
+  SparkListenerApplicationEnd(5L)
+  )
+updateAndCheck(provider) { list =>
+  assert(list.size === 1)
+  assert(list(0).id === appId)
+  assert(list(0).attempts.size === 2)
+}
+
+// Load the app's UI.
+val ui = provider.getAppUI(appId, Some("1"))
+assert(ui.isDefined)
+
+// Delete the underlying log file for attempt 1 and rescan. The UI 
should go away, but since
+// attempt 2 still exists, listing data should be there.
+doReturn(2L).when(provider).getNewLastScanTime()
+attempt1.delete()
+updateAndCheck(provider) { list =>
+  assert(list.size === 1)
+  assert(list(0).id === appId)
+  assert(list(0).attempts.size === 1)
+}
+assert(!ui.get.valid)
+assert(provider.getAppUI(appId, None) === None)
+
+// Delete the second attempt's log file. Now everything should go away.
+doReturn(3L).when(provider).getNewLastScanTime()
+attempt2.delete()
+updateAndCheck(provider) { list =>
+  assert(list.isEmpty)
+}
+  }
+
+  test("SPARK-21571: clean up removes invalid history files") {
+val clock = new ManualClock(TimeUnit.DAYS.toMillis(120))
+val conf = createTestConf().set("spark.history.fs.cleaner.maxAge", 
s"2d")
+val provider = new FsHistoryProvider(conf, clock) {
+  override def getNewLastScanTime(): Long = clock.getTimeMillis()
+}
+
+// Create 0-byte size inprogress and complete files
+val logfile1 = newLogFile("emptyInprogressLogFile", None, inProgress = 
true)
+logfile1.createNewFile()
+logfile1.setLastModified(clock.getTimeMillis())
+
+val logfile2 = newLogFile("emptyFinishedLogFile", None, inProgress = 
false)
+logfile2.createNewFile()
+logfile2.setLastModified(clock.getTimeMillis())
+
+// Create an incomplete log file, has an end record but no start 
record.
+val logfile3 = newLogFile("nonEmptyCorruptLogFile", None, inProgress = 
false)
+writeFile(logfile3, true, None, SparkListenerApplicationEnd(0))
+logfile3.setLastModified(clock.getTimeMillis())
+
+provider.checkForLogs()
+provider.cleanLogs()
+assert(new File(testDir.toURI).listFiles().size === 3)
+
+// Move the clock forward 1 day and scan the files again. They should 
still be there.
+clock.advance(TimeUnit.DAYS.toMillis(1))
+provider.checkForLogs()
+provider.cleanLogs()
+assert(new File(testDir.toURI).listFiles().size === 3)
+
+// Move the clock forward another 2 days and scan the files again. 
This time the cleaner should
+// pick up the invalid files and get rid of them.
+clock.advance(TimeUnit.DAYS.toMillis(2))
+provider.checkForLogs()
+provider.cleanLogs()
+assert(new File(testDir.toURI).listFiles().size === 0)
--- End diff --

I think you should add a case where one file starts out empty, say even for 
one full day, but then becomes valid before the expiration time, and make sure 
it does *not* get cleaned up.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encod...

2018-01-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20232#discussion_r161660735
  
--- Diff: R/pkg/tests/fulltests/test_mllib_classification.R ---
@@ -382,10 +382,10 @@ test_that("spark.mlp", {
   trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
   traindf <- as.DataFrame(data[trainidxs, ])
   testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
-  model <- spark.mlp(traindf, clicked ~ ., layers = c(1, 3))
+  model <- spark.mlp(traindf, clicked ~ ., layers = c(1, 2))
--- End diff --

Added. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...

2018-01-15 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20211#discussion_r161659513
  
--- Diff: python/pyspark/sql/group.py ---
@@ -233,6 +233,27 @@ def apply(self, udf):
 |  2| 1.1094003924504583|
 +---+---+
 
+Notes on grouping column:
--- End diff --

It's interesting to see the discussion in `SPARK-16258`. I think this is 
quite hard for the API to meet all cases...But the `foo(key, pdf)` is the best 
so far I think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20171: [SPARK-22978] [PySpark] Register Vectorized UDFs ...

2018-01-15 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20171#discussion_r161659245
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -256,27 +258,58 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
 
+>>> from pyspark.sql.types import IntegerType
+>>> from pyspark.sql.functions import udf
+>>> slen = udf(lambda s: len(s), IntegerType())
+>>> _ = spark.udf.register("slen", slen)
+>>> spark.sql("SELECT slen('test')").collect()
+[Row(slen(test)=4)]
+
 >>> import random
 >>> from pyspark.sql.functions import udf
->>> from pyspark.sql.types import IntegerType, StringType
+>>> from pyspark.sql.types import IntegerType
 >>> random_udf = udf(lambda: random.randint(0, 100), 
IntegerType()).asNondeterministic()
->>> newRandom_udf = spark.catalog.registerFunction("random_udf", 
random_udf, StringType())
+>>> newRandom_udf = spark.udf.register("random_udf", random_udf)
 >>> spark.sql("SELECT random_udf()").collect()  # doctest: +SKIP
-[Row(random_udf()=u'82')]
+[Row(random_udf()=82)]
 >>> spark.range(1).select(newRandom_udf()).collect()  # doctest: 
+SKIP
-[Row(random_udf()=u'62')]
+[Row(()=26)]
+
+>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
+>>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
+... def add_one(x):
+... return x + 1
+...
+>>> _ = spark.udf.register("add_one", add_one)  # doctest: +SKIP
+>>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
+[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
 """
 
 # This is to check whether the input function is a wrapped/native 
UserDefinedFunction
 if hasattr(f, 'asNondeterministic'):
-udf = UserDefinedFunction(f.func, returnType=returnType, 
name=name,
-  
evalType=PythonEvalType.SQL_BATCHED_UDF,
-  deterministic=f.deterministic)
+if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF,
+  PythonEvalType.SQL_PANDAS_SCALAR_UDF]:
+raise ValueError(
+"Invalid f: f must be either SQL_BATCHED_UDF or 
SQL_PANDAS_SCALAR_UDF")
+if returnType is not None and not isinstance(returnType, 
DataType):
+returnType = _parse_datatype_string(returnType)
+if returnType is not None and returnType != f.returnType:
--- End diff --

I see what you mean. Now I became neutral but slightly on your side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...

2018-01-15 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20211#discussion_r161659200
  
--- Diff: python/pyspark/sql/group.py ---
@@ -233,6 +233,27 @@ def apply(self, udf):
 |  2| 1.1094003924504583|
 +---+---+
 
+Notes on grouping column:
--- End diff --

Sorry for the late reply. I agree with @HyukjinKwon, I think we can do 
support both `foo(pdf)` and `foo(key, pdf)` through inspection.

I will try to put up a PR soon.

As to how to represent key, I think a tuple might be enough but I think a 
row also works. What do you guys think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...

2018-01-15 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r161658799
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2214,6 +2216,37 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 
.. seealso:: :meth:`pyspark.sql.GroupedData.apply`
 
+3. GROUP_AGG
+
+   A group aggregate UDF defines a transformation: One or more 
`pandas.Series` -> A scalar
+   The returnType should be a primitive data type, e.g, `DoubleType()`.
+   The returned scalar can be either a python primitive type, e.g., 
`int` or `float`
+   or a numpy data type, e.g., `numpy.int64` or `numpy.float64`.
+
+   StructType and ArrayType are currently not supported.
+
+   Group aggregate UDFs are used with 
:meth:`pyspark.sql.GroupedData.agg`
+
+   >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
+   >>> df = spark.createDataFrame(
+   ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
+   ... ("id", "v"))
+   >>> @pandas_udf("double", PandasUDFType.GROUP_AGG)
+   ... def mean_udf(v):
--- End diff --

Sorry @cloud-fan, I don't understand this comment, could you elaborate? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...

2018-01-15 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/20056
  
I see that `LiveListenerBus.droppedEventsCounter` and `lastReportTimestamp` 
are unused, so it certainly makes sense to clean them up one way or the other 
-- but that might mean we should delete them, not that we necessarily need to 
do something else with them.

I could see an argument that there are already monitoring systems hooked up 
to the old metric, 
["numEventsDropped"](https://github.com/apache/spark/blob/718bbc939037929ef5b8f4b4fe10aadfbab4408e/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L266),
 so maybe we should bring back the total with that metric.

But do you really want even more logging of the total, beyond the logging 
from each queue?  Seems like it would only be more confusing to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20171: [SPARK-22978] [PySpark] Register Vectorized UDFs ...

2018-01-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20171#discussion_r161657719
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -256,27 +258,58 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
 
+>>> from pyspark.sql.types import IntegerType
+>>> from pyspark.sql.functions import udf
+>>> slen = udf(lambda s: len(s), IntegerType())
+>>> _ = spark.udf.register("slen", slen)
+>>> spark.sql("SELECT slen('test')").collect()
+[Row(slen(test)=4)]
+
 >>> import random
 >>> from pyspark.sql.functions import udf
->>> from pyspark.sql.types import IntegerType, StringType
+>>> from pyspark.sql.types import IntegerType
 >>> random_udf = udf(lambda: random.randint(0, 100), 
IntegerType()).asNondeterministic()
->>> newRandom_udf = spark.catalog.registerFunction("random_udf", 
random_udf, StringType())
+>>> newRandom_udf = spark.udf.register("random_udf", random_udf)
 >>> spark.sql("SELECT random_udf()").collect()  # doctest: +SKIP
-[Row(random_udf()=u'82')]
+[Row(random_udf()=82)]
 >>> spark.range(1).select(newRandom_udf()).collect()  # doctest: 
+SKIP
-[Row(random_udf()=u'62')]
+[Row(()=26)]
+
+>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
+>>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
+... def add_one(x):
+... return x + 1
+...
+>>> _ = spark.udf.register("add_one", add_one)  # doctest: +SKIP
+>>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
+[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
 """
 
 # This is to check whether the input function is a wrapped/native 
UserDefinedFunction
 if hasattr(f, 'asNondeterministic'):
-udf = UserDefinedFunction(f.func, returnType=returnType, 
name=name,
-  
evalType=PythonEvalType.SQL_BATCHED_UDF,
-  deterministic=f.deterministic)
+if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF,
+  PythonEvalType.SQL_PANDAS_SCALAR_UDF]:
+raise ValueError(
+"Invalid f: f must be either SQL_BATCHED_UDF or 
SQL_PANDAS_SCALAR_UDF")
+if returnType is not None and not isinstance(returnType, 
DataType):
+returnType = _parse_datatype_string(returnType)
+if returnType is not None and returnType != f.returnType:
--- End diff --

Optional value is okay but I mean it's better to throw an exception. I am 
not seeing the advantage of supporting this optionally. @ueshin do you think 
it's better to support this case?

I am less sure of the point of supporting `returnType` with UDF when we are 
disallowed to change. It causes confusion like we allow it but then if the type 
is different, we will issue an exception.

Is it more important to allow this corner case than we make the APIs clear 
as if we have `def register(name, f) # for UDF` alone? We can keep clear about 
disallowing `returnType` at register time too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20257
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20257
  
**[Test build #86153 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86153/testReport)**
 for PR 20257 at commit 
[`262c046`](https://github.com/apache/spark/commit/262c0461bd5226b2e99ca5b0c35cf2a372a4892c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20257
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86153/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-15 Thread imatiach-msft

Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/20168
  
@MrBago @tomasatdatabricks the changes look good to me, I went through 
everything one more time, I'll sign off as soon as the python tests are fixed 
(it looks like there were some style issues in last commit) and all other dev 
comments are resolved, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2018-01-15 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r161656633
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala ---
@@ -136,10 +137,52 @@ object DecimalType extends AbstractDataType {
 case DoubleType => DoubleDecimal
   }
 
+  private[sql] def forLiteral(literal: Literal): DecimalType = 
literal.value match {
+case v: Short => fromBigDecimal(BigDecimal(v))
+case v: Int => fromBigDecimal(BigDecimal(v))
+case v: Long => fromBigDecimal(BigDecimal(v))
+case _ => forType(literal.dataType)
+  }
+
+  private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = {
+DecimalType(Math.max(d.precision, d.scale), d.scale)
+  }
+
   private[sql] def bounded(precision: Int, scale: Int): DecimalType = {
 DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE))
   }
 
+  /**
+   * Scale adjustment implementation is based on Hive's one, which is 
itself inspired to
+   * SQLServer's one. In particular, when a result precision is greater 
than
+   * {@link #MAX_PRECISION}, the corresponding scale is reduced to prevent 
the integral part of a
+   * result from being truncated.
+   *
+   * This method is used only when 
`spark.sql.decimalOperations.allowPrecisionLoss` is set to true.
+   *
+   * @param precision
+   * @param scale
+   * @return
+   */
+  private[sql] def adjustPrecisionScale(precision: Int, scale: Int): 
DecimalType = {
--- End diff --

So the rule in document is
```
val resultPrecision = 38
if (intDigits < 32) { // This means scale > 6, as iniDigits = precision - 
scale and precision > 38
  val maxScale = 38 - intDigits
  val resultScale = min(scale, maxScale)
} else {
  if (scale < 6) {
// can't round as scale is already small
val resultScale = scale
  } else {
val resltScale = 6
  }
}
```
I think this is a little different from the current rule
```
val minScaleValue = Math.min(scale, 6)
val resultScale = max(38 - intDigits, minScaleValue)
```
Think aboout the case `iniDigits < 32`, SQL server is `min(scale, 38 - 
intDigits)`, we are `38 - intDigits`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread imatiach-msft

Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161656541
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -37,20 +37,67 @@ import org.apache.spark.sql.types._
 @Since("2.3.0")
 object ImageSchema {
 
-  val undefinedImageType = "Undefined"
+  /**
+   * OpenCv type representation
+   *
+   * @param mode ordinal for the type
+   * @param dataType open cv data type
+   * @param nChannels number of color channels
+   */
+  case class OpenCvType(mode: Int, dataType: String, nChannels: Int) {
+def name: String = if (mode == -1) { "Undefined" } else { 
s"CV_$dataType" + s"C$nChannels" }
+override def toString: String = s"OpenCvType(mode = $mode, name = 
$name)"
+  }
 
   /**
-   * (Scala-specific) OpenCV type mapping supported
+   * Return the supported OpenCvType with matching name or raise error if 
there is no matching type.
+   *
+   * @param name: name of existing OpenCvType
+   * @return OpenCvType that matches the given name
*/
-  val ocvTypes: Map[String, Int] = Map(
-undefinedImageType -> -1,
-"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
-  )
+  def ocvTypeByName(name: String): OpenCvType = {
+ocvTypes.find(x => x.name == name).getOrElse(
+  throw new IllegalArgumentException("Unknown open cv type " + name))
--- End diff --

same minor nitpick: "OpenCV" instead of "open cv", and in code below as well


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20232
  
**[Test build #86154 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86154/testReport)**
 for PR 20232 at commit 
[`20bbf64`](https://github.com/apache/spark/commit/20bbf64e0ce99538f80e5b6f360a69de93f4d9fc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...

2018-01-15 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20204#discussion_r161655584
  
--- Diff: python/run-tests-with-coverage ---
@@ -0,0 +1,69 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -o pipefail
+set -e
+
+# This variable indicates which coverage executable to run to combine 
coverages
+# and generate HTMLs, for example, 'coverage3' in Python 3.
+COV_EXEC="${COV_EXEC:-coverage}"
+FWDIR="$(cd "`dirname $0`"; pwd)"
+pushd "$FWDIR" > /dev/null
--- End diff --

Do we need to use `pushd` and its corresponding `popd` at the end of this 
file? I guess we can simply use `cd` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-15 Thread imatiach-msft

Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r161656419
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -37,20 +37,67 @@ import org.apache.spark.sql.types._
 @Since("2.3.0")
 object ImageSchema {
 
-  val undefinedImageType = "Undefined"
+  /**
+   * OpenCv type representation
+   *
+   * @param mode ordinal for the type
+   * @param dataType open cv data type
--- End diff --

small nitpick: I think we should always spell it as "OpenCV" to be 
consistent in the comments (unless you have any good objections)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-15 Thread imatiach-msft

Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/20168
  
@MrBago @tomasatdatabricks I think the breaking changes are fine, the code 
was marked experimental and it is expected that the interfaces will change a 
lot initially based on early feedback.  The PR looks good to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20257
  
**[Test build #86153 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86153/testReport)**
 for PR 20257 at commit 
[`262c046`](https://github.com/apache/spark/commit/262c0461bd5226b2e99ca5b0c35cf2a372a4892c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20272: [SPARK-23078] [CORE] allow Spark Thrift Server to run in...

2018-01-15 Thread ozzieba

Github user ozzieba commented on the issue:

https://github.com/apache/spark/pull/20272
  
I'm getting stuck on 
https://github.com/apache-spark-on-k8s/spark-integration/blob/master/integration-test/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala#L106,
 will look again tomorrow


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...

2018-01-15 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20257
  
@MLnick Thanks for review. I think I've addressed all the comments. Please 
take a look for the updates.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...

2018-01-15 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20023#discussion_r161655115
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
 ---
@@ -243,17 +279,43 @@ object DecimalPrecision extends TypeCoercionRule {
 // Promote integers inside a binary expression with fixed-precision 
decimals to decimals,
 // and fixed-precision decimals in an expression with floats / doubles 
to doubles
 case b @ BinaryOperator(left, right) if left.dataType != 
right.dataType =>
-  (left.dataType, right.dataType) match {
-case (t: IntegralType, DecimalType.Fixed(p, s)) =>
-  b.makeCopy(Array(Cast(left, DecimalType.forType(t)), right))
-case (DecimalType.Fixed(p, s), t: IntegralType) =>
-  b.makeCopy(Array(left, Cast(right, DecimalType.forType(t
-case (t, DecimalType.Fixed(p, s)) if isFloat(t) =>
-  b.makeCopy(Array(left, Cast(right, DoubleType)))
-case (DecimalType.Fixed(p, s), t) if isFloat(t) =>
-  b.makeCopy(Array(Cast(left, DoubleType), right))
-case _ =>
-  b
-  }
+  nondecimalLiteralAndDecimal(b).lift((left, right)).getOrElse(
+nondecimalNonliteralAndDecimal(b).applyOrElse((left.dataType, 
right.dataType),
+  (_: (DataType, DataType)) => b))
   }
+
+  /**
+   * Type coercion for BinaryOperator in which one side is a non-decimal 
literal numeric, and the
+   * other side is a decimal.
+   */
+  private def nondecimalLiteralAndDecimal(
+  b: BinaryOperator): PartialFunction[(Expression, Expression), 
Expression] = {
+// Promote literal integers inside a binary expression with 
fixed-precision decimals to
+// decimals. The precision and scale are the ones needed by the 
integer value.
+case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType]
+  && l.dataType.isInstanceOf[IntegralType] =>
+  b.makeCopy(Array(Cast(l, DecimalType.forLiteral(l)), r))
--- End diff --

What if we don't do this? Requiring more precision seems OK as now we allow 
precision lose.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20208
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86144/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20208
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20208
  
**[Test build #86144 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86144/testReport)**
 for PR 20208 at commit 
[`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with 
SharedSQLContext `
  * `trait AddColumnEvolutionTest extends SchemaEvolutionTest `
  * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest `
  * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest `
  * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest `
  * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest `
  * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest `
  * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20171: [SPARK-22978] [PySpark] Register Vectorized UDFs ...

2018-01-15 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20171#discussion_r161654514
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -256,27 +258,58 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
 
+>>> from pyspark.sql.types import IntegerType
+>>> from pyspark.sql.functions import udf
+>>> slen = udf(lambda s: len(s), IntegerType())
+>>> _ = spark.udf.register("slen", slen)
+>>> spark.sql("SELECT slen('test')").collect()
+[Row(slen(test)=4)]
+
 >>> import random
 >>> from pyspark.sql.functions import udf
->>> from pyspark.sql.types import IntegerType, StringType
+>>> from pyspark.sql.types import IntegerType
 >>> random_udf = udf(lambda: random.randint(0, 100), 
IntegerType()).asNondeterministic()
->>> newRandom_udf = spark.catalog.registerFunction("random_udf", 
random_udf, StringType())
+>>> newRandom_udf = spark.udf.register("random_udf", random_udf)
 >>> spark.sql("SELECT random_udf()").collect()  # doctest: +SKIP
-[Row(random_udf()=u'82')]
+[Row(random_udf()=82)]
 >>> spark.range(1).select(newRandom_udf()).collect()  # doctest: 
+SKIP
-[Row(random_udf()=u'62')]
+[Row(()=26)]
+
+>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
+>>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
+... def add_one(x):
+... return x + 1
+...
+>>> _ = spark.udf.register("add_one", add_one)  # doctest: +SKIP
+>>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
+[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
 """
 
 # This is to check whether the input function is a wrapped/native 
UserDefinedFunction
 if hasattr(f, 'asNondeterministic'):
-udf = UserDefinedFunction(f.func, returnType=returnType, 
name=name,
-  
evalType=PythonEvalType.SQL_BATCHED_UDF,
-  deterministic=f.deterministic)
+if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF,
+  PythonEvalType.SQL_PANDAS_SCALAR_UDF]:
+raise ValueError(
+"Invalid f: f must be either SQL_BATCHED_UDF or 
SQL_PANDAS_SCALAR_UDF")
+if returnType is not None and not isinstance(returnType, 
DataType):
+returnType = _parse_datatype_string(returnType)
+if returnType is not None and returnType != f.returnType:
--- End diff --

I might miss something but I think it's okay to take `returnType` parameter 
optionally if the value is the same as the udf's.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20265: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20265
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86143/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20265: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...

2018-01-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20265
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20265: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...

2018-01-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20265
  
**[Test build #86143 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86143/testReport)**
 for PR 20265 at commit 
[`440f76b`](https://github.com/apache/spark/commit/440f76bdbf4d720a361e0afde3599027ff6e7be2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20171: [SPARK-22978] [PySpark] Register Vectorized UDFs ...

2018-01-15 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20171#discussion_r161654136
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -256,27 +258,58 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 >>> spark.sql("SELECT stringLengthInt('test')").collect()
 [Row(stringLengthInt(test)=4)]
 
+>>> from pyspark.sql.types import IntegerType
+>>> from pyspark.sql.functions import udf
+>>> slen = udf(lambda s: len(s), IntegerType())
+>>> _ = spark.udf.register("slen", slen)
+>>> spark.sql("SELECT slen('test')").collect()
+[Row(slen(test)=4)]
+
 >>> import random
 >>> from pyspark.sql.functions import udf
->>> from pyspark.sql.types import IntegerType, StringType
+>>> from pyspark.sql.types import IntegerType
 >>> random_udf = udf(lambda: random.randint(0, 100), 
IntegerType()).asNondeterministic()
->>> newRandom_udf = spark.catalog.registerFunction("random_udf", 
random_udf, StringType())
+>>> newRandom_udf = spark.udf.register("random_udf", random_udf)
 >>> spark.sql("SELECT random_udf()").collect()  # doctest: +SKIP
-[Row(random_udf()=u'82')]
+[Row(random_udf()=82)]
 >>> spark.range(1).select(newRandom_udf()).collect()  # doctest: 
+SKIP
-[Row(random_udf()=u'62')]
+[Row(()=26)]
+
+>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
+>>> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
+... def add_one(x):
+... return x + 1
+...
+>>> _ = spark.udf.register("add_one", add_one)  # doctest: +SKIP
+>>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
+[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
 """
 
 # This is to check whether the input function is a wrapped/native 
UserDefinedFunction
 if hasattr(f, 'asNondeterministic'):
-udf = UserDefinedFunction(f.func, returnType=returnType, 
name=name,
-  
evalType=PythonEvalType.SQL_BATCHED_UDF,
-  deterministic=f.deterministic)
+if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF,
+  PythonEvalType.SQL_PANDAS_SCALAR_UDF]:
+raise ValueError(
+"Invalid f: f must be either SQL_BATCHED_UDF or 
SQL_PANDAS_SCALAR_UDF")
+if returnType is not None and not isinstance(returnType, 
DataType):
+returnType = _parse_datatype_string(returnType)
+if returnType is not None and returnType != f.returnType:
--- End diff --

Is it common in our current PySpark impl?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20273: [SPARK-23000] Use fully qualified table names in ...

2018-01-15 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20273


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema test for f...

2018-01-15 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20266
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema tes...

2018-01-15 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20266#discussion_r161653628
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.test.SharedSQLContext
+
+class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  Seq("orc", "parquet", "csv", "json", "text").foreach { format =>
+test(s"Writing empty datasets should not fail - $format") {
+  withTempDir { dir =>
+
Seq("str").toDS.limit(0).write.format(format).save(dir.getCanonicalPath + 
"/tmp")
+  }
+}
+  }
+
+  Seq("orc", "parquet", "csv", "json").foreach { format =>
+test(s"Write and read back unicode schema - $format") {
+  withTempPath { path =>
+val dir = path.getCanonicalPath
+
+// scalastyle:off nonascii
+val df = Seq("a").toDF("íê¸")
+// scalastyle:on nonascii
+
+df.write.format(format).option("header", "true").save(dir)
+val answerDf = spark.read.format(format).option("header", 
"true").load(dir)
+
+assert(df.schema === answerDf.schema)
+checkAnswer(df, answerDf)
+  }
+}
+  }
+
+  // Only New OrcFileFormat supports this
+  
Seq(classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat].getCanonicalName,
--- End diff --

`spark.sql.orc.impl` is native by default, can we just use "orc" here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 368 matches

Mail list logo