[GitHub] spark issue #19618: [SPARK-5484][Followup] PeriodicRDDCheckpointer doc clean...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19618
  
**[Test build #83245 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83245/testReport)**
 for PR 19618 at commit 
[`2858cbb`](https://github.com/apache/spark/commit/2858cbb5c8264d7bee592835b56a415961ed1dc4).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19601
  
**[Test build #83246 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83246/testReport)**
 for PR 19601 at commit 
[`b971506`](https://github.com/apache/spark/commit/b971506f8d5138a2c23e039427d547b736079c13).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83249/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #83249 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83249/testReport)**
 for PR 16677 at commit 
[`e53648e`](https://github.com/apache/spark/commit/e53648e7f58f439bb09a702521c2f84cf2e344bd).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19618: [SPARK-5484][Followup] PeriodicRDDCheckpointer doc clean...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19618
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83245/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19618: [SPARK-5484][Followup] PeriodicRDDCheckpointer doc clean...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19618
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19601
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19601
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83246/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-10-31 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19601
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-10-31 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16677
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19601
  
**[Test build #83250 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83250/testReport)**
 for PR 19601 at commit 
[`b971506`](https://github.com/apache/spark/commit/b971506f8d5138a2c23e039427d547b736079c13).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #83251 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83251/testReport)**
 for PR 16677 at commit 
[`e53648e`](https://github.com/apache/spark/commit/e53648e7f58f439bb09a702521c2f84cf2e344bd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19531
  
**[Test build #83252 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83252/testReport)**
 for PR 19531 at commit 
[`a2dbb8e`](https://github.com/apache/spark/commit/a2dbb8eefacf560ff5e7090baa3796b2353daed2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15178
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15178
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83253/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15178
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15178
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15178
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83254/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19553: [SPARK-22330][CORE] Linear containsKey operation ...

2017-10-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19553#discussion_r147918030
  
--- Diff: core/src/main/scala/org/apache/spark/api/java/JavaUtils.scala ---
@@ -43,10 +43,17 @@ private[spark] object JavaUtils {
 
 override def size: Int = underlying.size
 
-override def get(key: AnyRef): B = try {
-  underlying.getOrElse(key.asInstanceOf[A], null.asInstanceOf[B])
-} catch {
-  case ex: ClassCastException => null.asInstanceOf[B]
+// Delegate to implementation because AbstractMap implementation 
iterates over whole key set
+override def containsKey(key: AnyRef): Boolean = {
+  underlying.contains(key.asInstanceOf[A])
--- End diff --

No, because Scala Maps do require their key type in contains() and get(). 
https://www.scala-lang.org/api/current/scala/collection/Map.html . This is 
indeed the difference between the APIs that we need to bridge.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15178
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19609: [SPARK-21708][Build] update some sbt plugins

2017-10-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19609
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15178
  
**[Test build #83255 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83255/testReport)**
 for PR 15178 at commit 
[`a1f2faa`](https://github.com/apache/spark/commit/a1f2faa99f7b6875885820893070c8b039ecddd7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19609: [SPARK-21708][Build] update some sbt plugins

2017-10-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19609


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19614: [SPARK-22399][ML] update the location of reference paper

2017-10-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19614
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19614: [SPARK-22399][ML] update the location of referenc...

2017-10-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19614


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18118: [SPARK-20199][ML] : Provided featureSubsetStrategy to GB...

2017-10-31 Thread pralabhkumar
Github user pralabhkumar commented on the issue:

https://github.com/apache/spark/pull/18118
  
@MLnick  @sethah  please find some time to look into this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19580: [SPARK-11334][CORE] Fix bug in Executor allocation manag...

2017-10-31 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19580
  
jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19580: [SPARK-11334][CORE] Fix bug in Executor allocation manag...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19580
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19580: [SPARK-11334][CORE] Fix bug in Executor allocation manag...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19580
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83256/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19605: [SPARK-22394] [SQL] Remove redundant synchronizat...

2017-10-31 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19605#discussion_r147929746
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -89,10 +89,12 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   }
 
   /**
-   * Run some code involving `client` in a [[synchronized]] block and wrap 
certain
-   * exceptions thrown in the process in [[AnalysisException]].
+   * Run some code involving `client` and wrap certain exceptions thrown 
in the process in
+   * [[AnalysisException]]. Thread-safety is guaranteed here because 
methods in the `client`
+   * ([[org.apache.spark.sql.hive.client.HiveClientImpl]]) are already 
synchronized through
+   * `clientLoader` in the `retryLocked` method.
*/
-  private def withClient[T](body: => T): T = synchronized {
+  private def withClient[T](body: => T): T = {
--- End diff --

I went through all methods in `HiveClient` having synchronization (except 
`addJar`):
- `getState`  is used only in test.
- `setOut`, `setInfo` and `setError` are only used in `SparkSQLEnv.init()`.
- all other methods are called through `HiveExternalCatalog`.

So it seems `addJar` is the only exception.

To make `addJar` also go throught `HiveExternalCatalog`, we can pass 
`externalCatalog` instead of `client` at [line46 in 
HiveSessionStateBuilder](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L46).
 But I don't know why we need to call `newSession()` at 
[line45](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45),
 where a new `HiveClientImpl` instance is created, with the same class loader 
and Hive client.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19613: Fixed a typo

2017-10-31 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19613
  
Or in UnsafeArrayData, corrresponding -> corresponding
If you just run a spell check of comments in an IDE, I'm sure a quick pass 
would reveal a lot of little typos. The more you are willing to fix in one 
pass, the better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19615: [SPARK-19611][SQL][followup] set dataSchema correctly in...

2017-10-31 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19615
  
also cc @dongjoon-hyun @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19601
  
**[Test build #83250 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83250/testReport)**
 for PR 19601 at commit 
[`b971506`](https://github.com/apache/spark/commit/b971506f8d5138a2c23e039427d547b736079c13).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19601
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19601
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83250/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...

2017-10-31 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19531
  
LGTM, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19531: [SPARK-22310] [SQL] Refactor join estimation to i...

2017-10-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19531


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

2017-10-31 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19479#discussion_r147946241
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1032,7 +1032,21 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   schema.fields.map(f => (f.name, f.dataType)).toMap
 stats.colStats.foreach { case (colName, colStat) =>
   colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, 
v) =>
-statsProperties += (columnStatKeyPropName(colName, k) -> v)
+if (k == ColumnStat.KEY_HISTOGRAM) {
+  // In Hive metastore, the length of value in table properties 
cannot be larger than 4000,
+  // so we need to split histogram into multiple key-value 
properties if it's too long.
+  val maxValueLen = 4000
--- End diff --

use `SCHEMA_STRING_LENGTH_THRESHOLD` instead of hardcode, please follow 
`tableMetaToTableProps`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

2017-10-31 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19479#discussion_r147946445
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1032,7 +1032,21 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   schema.fields.map(f => (f.name, f.dataType)).toMap
 stats.colStats.foreach { case (colName, colStat) =>
   colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, 
v) =>
-statsProperties += (columnStatKeyPropName(colName, k) -> v)
+if (k == ColumnStat.KEY_HISTOGRAM) {
--- End diff --

don't special-case histogram, we should do it for all stats.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

2017-10-31 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19479#discussion_r147947097
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1032,7 +1032,21 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   schema.fields.map(f => (f.name, f.dataType)).toMap
 stats.colStats.foreach { case (colName, colStat) =>
   colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, 
v) =>
-statsProperties += (columnStatKeyPropName(colName, k) -> v)
+if (k == ColumnStat.KEY_HISTOGRAM) {
--- End diff --

just make sure if the limitation is not hit, we still generate a simple 
key-value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19531
  
**[Test build #83252 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83252/testReport)**
 for PR 19531 at commit 
[`a2dbb8e`](https://github.com/apache/spark/commit/a2dbb8eefacf560ff5e7090baa3796b2353daed2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19531
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83252/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19531
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19615: [SPARK-19611][SQL][followup] set dataSchema correctly in...

2017-10-31 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19615
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19615: [SPARK-19611][SQL][followup] set dataSchema corre...

2017-10-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19615#discussion_r147951101
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -164,13 +164,12 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
 }
   }
 
-  val (dataSchema, updatedTable) =
-inferIfNeeded(relation, options, fileFormat, Option(fileIndex))
+  val updatedTable = inferIfNeeded(relation, options, fileFormat, 
Option(fileIndex))
--- End diff --

Yeah, we missed this. Actually the variable name is indicating it is data 
schema...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19615: [SPARK-19611][SQL][followup] set dataSchema correctly in...

2017-10-31 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19615
  
thanks, merging to master/2.2!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19615: [SPARK-19611][SQL][followup] set dataSchema corre...

2017-10-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19615


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #83251 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83251/testReport)**
 for PR 16677 at commit 
[`e53648e`](https://github.com/apache/spark/commit/e53648e7f58f439bb09a702521c2f84cf2e344bd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83251/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-31 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r147950431
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,462 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data produces a vector of zeros) or 
'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data produces a vector of zeros) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def checkParamsValidity(schema: StructType): Unit = {
--- End diff --

Can we use a single function `validateAndTransformSchema` instead? like 
other algorithms


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-31 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r147950931
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,462 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data produces a vector of zeros) or 
'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data produces a vector of zeros) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def checkParamsValidity(schema: StructType): Unit = {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
--- End diff --

it's better to add a checking for duplicated columns, like what `Imputer` 
and `Interaction` do, IMO


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-31 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r147953998
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,462 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data produces a vector of zeros) or 
'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data produces a vector of zeros) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def checkParamsValidity(schema: StructType): Unit = {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+  }
+
+  /** Prepares output columns with proper attributes by examining input 
columns. */
+  protected def prepareSchemaWithOutputField(schema: StructType): 
StructType = {
+val inputFields = $(inputCols).map(schema(_))
+val outputColNames = $(outputCols)
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The l

[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15178
  
**[Test build #83255 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83255/testReport)**
 for PR 15178 at commit 
[`a1f2faa`](https://github.com/apache/spark/commit/a1f2faa99f7b6875885820893070c8b039ecddd7).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15178
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83255/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15178
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r147959976
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,462 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data produces a vector of zeros) or 
'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data produces a vector of zeros) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def checkParamsValidity(schema: StructType): Unit = {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
+  s"The number of input columns ${inputColNames.length} must be the 
same as the number of " +
+s"output columns ${outputColNames.length}.")
+
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  require(schema(inputColName).dataType.isInstanceOf[NumericType],
+s"Input column must be of type NumericType but got 
${schema(inputColName).dataType}")
+  require(!existingFields.exists(_.name == outputColName),
+s"Output column $outputColName already exists.")
+}
+  }
+
+  /** Prepares output columns with proper attributes by examining input 
columns. */
+  protected def prepareSchemaWithOutputField(schema: StructType): 
StructType = {
+val inputFields = $(inputCols).map(schema(_))
+val outputColNames = $(outputCols)
+
+val outputFields = inputFields.zip(outputColNames).map { case 
(inputField, outputColName) =>
+  OneHotEncoderCommon.transformOutputColumnSchema(
+inputField, $(dropLast), outputColName)
+}
+StructType(schema.fields ++ outputFields)
+  }
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last ca

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r147960232
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,462 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data produces a vector of zeros) or 
'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data produces a vector of zeros) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def checkParamsValidity(schema: StructType): Unit = {
+val inputColNames = $(inputCols)
+val outputColNames = $(outputCols)
+val existingFields = schema.fields
+
+require(inputColNames.length == outputColNames.length,
--- End diff --

Yeah, based on previous comments 
https://github.com/apache/spark/pull/19527#discussion_r145911522 and 
https://github.com/apache/spark/pull/19527#discussion_r145915847, now I've used 
`withColumns` API which checks duplicated output columns.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r147960630
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,462 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params and common methods for OneHotEncoderEstimator 
and OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'keep' (invalid data produces a vector of zeros) or 
'error' (throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'keep' (invalid data produces a vector of zeros) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+
+  protected def checkParamsValidity(schema: StructType): Unit = {
--- End diff --

I'm ok to follow it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-31 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19527
  
Thanks for reviewing @zhengruifeng. I will update it later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
Ping @hhbyyh, @WeichenXu123, @srowen. 

Seems like the discussion is stuck. Does anybody think that the general 
approach implemented in this PR should be changed? Currently it is filtering 
before sampling with no caching. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19617: [SPARK-22347][PySpark][DOC] Add document to notic...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19617#discussion_r147964546
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2185,6 +2185,12 @@ def udf(f=None, returnType=StringType()):
 duplicate invocations may be eliminated or the function may even 
be invoked more times than
 it is present in the query.
 
+.. note:: The user-defined functions do not support conditional 
execution by using them with
--- End diff --

Hm, should we maybe clarify the output itself is correct if it does not 
cause the runtime failure by the condition? Maybe I am too much worried but 
think it might mislead the output is incorrect at all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19617: [SPARK-22347][PySpark][DOC] Add document to notic...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19617#discussion_r147961260
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2185,6 +2185,12 @@ def udf(f=None, returnType=StringType()):
 duplicate invocations may be eliminated or the function may even 
be invoked more times than
 it is present in the query.
 
+.. note:: The user-defined functions do not support conditional 
execution by using them with
+SQL conditional expressions such `when` or `if`. The functions 
still apply on all rows no
--- End diff --

Looks a tiny typo `` expressions such `when` or `if`. `` -> `` expressions 
such as `when` or `if`. ``.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19580: [SPARK-11334][CORE] Fix bug in Executor allocation manag...

2017-10-31 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19580
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19580: [SPARK-11334][CORE] Fix bug in Executor allocation manag...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19580
  
**[Test build #83257 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83257/testReport)**
 for PR 19580 at commit 
[`8abaa2c`](https://github.com/apache/spark/commit/8abaa2cbc4c54dd728b7493ddfedabddd8c8ba7b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

2017-10-31 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19586
  
@ConeyLiu what about the below example, does your implementation support 
this?

```scala

trait Base { val name: String }
case class A(name: String) extends Base
case class B(name: String) extends Base

sc.parallelize(Seq(A("a"), B("b"))).map { i => (i, 1) }.reduceByKey(_ + 
_).collect()
```

Here not all the elements have same class type, does your PR here support 
such scenario?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147935499
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
+val img = ImageIO.read(new ByteArrayInputStream(bytes))
+
+if (img == null) {
+  None
+} else {
+  val isGray = img.getColorModel.getColorSpace.get

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147986087
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes
+
+
+# DataFrame with a single column of images named "image" (nullable)
+def getImageSchema(spark=None):
+"""
+Returns the image schema
+
+:param spark (SparkSession): The current spark session
+:rtype StructType: The image schema
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema
--- End diff --

Hm, isn't this a JVM object? I think we should do something like:

```python
from pyspark.sql.types import _parse_datatype_json_string
...
jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema()
return _parse_datatype_json_string(jschema.json())
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147981187
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes
+
+
+# DataFrame with a single column of images named "image" (nullable)
+def getImageSchema(spark=None):
+"""
+Returns the image schema
+
+:param spark (SparkSession): The current spark session
+:rtype StructType: The image schema
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema
+
+
+def toNDArray(image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image (object): The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+
+def toImage(array, origin="", spark=None):
+"""
+Converts a one-dimensional array to a two-dimensional image.
+
+:param array (array): The array to convert to image
+:param origin (str): Path to the image
+:param spark (SparkSession): The current spark session
+:rtype object: Two dimensional image
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+if array.ndim != 3:
+raise
+height, width, nChannels = array.shape
+ocvTypes = getOcvTypes(spark)
+if nChannels == 1:
+mode = ocvTypes["CV_8UC1"]
+elif nChannels == 3:
+mode = ocvTypes["CV_8UC3"]
+elif nChannels == 4:
+mode = ocvTypes["CV_8UC4"]
+else:
+raise
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147980830
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes
+
+
+# DataFrame with a single column of images named "image" (nullable)
+def getImageSchema(spark=None):
+"""
+Returns the image schema
+
+:param spark (SparkSession): The current spark session
+:rtype StructType: The image schema
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema
+
+
+def toNDArray(image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image (object): The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+
+def toImage(array, origin="", spark=None):
+"""
+Converts a one-dimensional array to a two-dimensional image.
+
+:param array (array): The array to convert to image
+:param origin (str): Path to the image
+:param spark (SparkSession): The current spark session
+:rtype object: Two dimensional image
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+if array.ndim != 3:
+raise
--- End diff --

Can we throw explicit exception here? Looks an invalid syntax BTW.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147988875
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
--- End diff --

ditto for not passing `spark`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147988776
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes
+
+
+# DataFrame with a single column of images named "image" (nullable)
+def getImageSchema(spark=None):
+"""
+Returns the image schema
+
+:param spark (SparkSession): The current spark session
+:rtype StructType: The image schema
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
--- End diff --

Hm.. actually, I think we won't need to pass `spark` in this case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147989197
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
--- End diff --

Would it be difficult to specify what we import here or, are there many 
instances to import here? I think we generally should avoid wildcard import.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147989847
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes
+
+
+# DataFrame with a single column of images named "image" (nullable)
+def getImageSchema(spark=None):
+"""
+Returns the image schema
+
+:param spark (SparkSession): The current spark session
+:rtype StructType: The image schema
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema
+
+
+def toNDArray(image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image (object): The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+
+def toImage(array, origin="", spark=None):
+"""
+Converts a one-dimensional array to a two-dimensional image.
+
+:param array (array): The array to convert to image
+:param origin (str): Path to the image
+:param spark (SparkSession): The current spark session
+:rtype object: Two dimensional image
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+if array.ndim != 3:
+raise
+height, width, nChannels = array.shape
+ocvTypes = getOcvTypes(spark)
+if nChannels == 1:
+mode = ocvTypes["CV_8UC1"]
+elif nChannels == 3:
+mode = ocvTypes["CV_8UC3"]
+elif nChannels == 4:
+mode = ocvTypes["CV_8UC4"]
+else:
+raise
+data = bytearray(array.astype(dtype=np.uint8).ravel())
+# Creating new Row with _create_row(), because Row(name = value, ... )
+# orders fields by name, which conflicts with expected schema order
+# when the new DataFrame is created by UDF
+return _create_row(imageFields,
+   [origin, height, width, nChannels, mode, data])
+
+
+def readImages(path, spark=None, recursive=False, numPartitions=0,
+   dropImageFailures=False, sampleRatio=1.0):
+"""
+Reads the directory of images from the local or remote source.
+
+:param path (str): Path to the image directory
+:param spark (SparkSession): The current spark session
+:param recursive (bool): Recursive search flag
+:param numPartitions (int): Number of DataFrame partitions
+:param dropImageFailures (bool): Drop the files that are not valid 
images
+:param sampleRatio (double): Fraction of the images loaded
+:rtype DataFrame: DataFrame with a single column of "images",
+   see ImageSchema for details
   

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147935560
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,229 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields: Array[String] = Array("origin", "height", "width", 
"nChannels", "mode", "data")
+
+  val ocvTypes: Map[String, Int] = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Int, 
Array[Byte])
+   */
+  val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * Default values for the invalid image
+   *
+   * @param origin Origin of the invalid image
+   * @return Row with the default values
+   */
+  private def invalidImageRow(origin: String): Row =
+Row(Row(origin, -1, -1, -1, ocvTypes(undefinedImageType), 
Array.ofDim[Byte](0)))
+
+  /**
+   * Convert the compressed image (jpeg, png, etc.) into OpenCV
+   * representation and store it in DataFrame Row
+   *
+   * @param origin Arbitrary string that identifies the image
+   * @param bytes Image bytes (for example, jpeg)
+   * @return DataFrame Row or None (if the decompression fails)
+   */
+  private[spark] def decode(origin: String, bytes: Array[Byte]): 
Option[Row] = {
+
+val img = ImageIO.read(new ByteArrayInputStream(bytes))
+
+if (img == null) {
+  None
+} else {
+  val isGray = img.getColorModel.getColorSpace.get

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147991609
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes
--- End diff --

This looks a JVM object too:

```python
>>> spark.sparkContext._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes()
JavaObject id=o41
>>> 
spark.sparkContext._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes().getClass().toString()
u'class scala.collection.immutable.HashMap$HashTrieMap'
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19617: [SPARK-22347][PySpark][DOC] Add document to notic...

2017-10-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19617#discussion_r147995716
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2185,6 +2185,12 @@ def udf(f=None, returnType=StringType()):
 duplicate invocations may be eliminated or the function may even 
be invoked more times than
 it is present in the query.
 
+.. note:: The user-defined functions do not support conditional 
execution by using them with
--- End diff --

Yes. I think this is a valid worry. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19617: [SPARK-22347][PySpark][DOC] Add document to notic...

2017-10-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19617#discussion_r147995905
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2185,6 +2185,12 @@ def udf(f=None, returnType=StringType()):
 duplicate invocations may be eliminated or the function may even 
be invoked more times than
 it is present in the query.
 
+.. note:: The user-defined functions do not support conditional 
execution by using them with
+SQL conditional expressions such `when` or `if`. The functions 
still apply on all rows no
--- End diff --

Oops! Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147997611
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes
--- End diff --

Probably, we should convert this one to `java.util.HashMap()` so that 
automatically py4j detects this as `dict`.
Quick workaround I could think is, somewhere:

```scala
import scala.collection.JavaConverters._

val _ocvTypes: util.Map[String, Int] = ImageSchema.ocvTypes.asJava
```

Python side:

```python
spark.sparkContext._jvm.org.apache.spark.ml.image.ImageSchema._ocvTypes()
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147998308
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes
+
+
+# DataFrame with a single column of images named "image" (nullable)
+def getImageSchema(spark=None):
+"""
+Returns the image schema
+
+:param spark (SparkSession): The current spark session
+:rtype StructType: The image schema
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema
--- End diff --

I think we should have simple tests if these are supposed to be public APIs 
in Python side in order to to check if they can be called properly at least.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #83258 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83258/testReport)**
 for PR 19527 at commit 
[`4c6cc57`](https://github.com/apache/spark/commit/4c6cc57136a60c577dd0ba2ae90521f7c18ae5d1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r147999393
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,139 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import *
+from pyspark.ml.param.shared import *
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame, SparkSession, SQLContext
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+
+def getOcvTypes(spark=None):
+"""
+Returns the OpenCV type mapping supported
+
+:param sparkSession (SparkSession): The current spark session
+:rtype dict: The OpenCV type mapping supported
+
+.. versionadded:: 2.3.0
+"""
+spark = spark or SparkSession.builder.getOrCreate()
+ctx = spark.sparkContext
+return ctx._jvm.org.apache.spark.ml.image.ImageSchema.ocvTypes
--- End diff --

Or.. just simply define a dict in Python side.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19617
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19617
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83259/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19617
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19617
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83260/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19617
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19617
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19617
  
**[Test build #83261 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83261/testReport)**
 for PR 19617 at commit 
[`47bb26e`](https://github.com/apache/spark/commit/47bb26e9fe5a9db8cf666ae16a29d82a3e5e6311).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19617
  
**[Test build #83261 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83261/testReport)**
 for PR 19617 at commit 
[`47bb26e`](https://github.com/apache/spark/commit/47bb26e9fe5a9db8cf666ae16a29d82a3e5e6311).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19617
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83261/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19617: [SPARK-22347][PySpark][DOC] Add document to notice users...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19617
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19272: [Spark-21842][Mesos] Support Kerberos ticket rene...

2017-10-31 Thread ArtRand
Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19272#discussion_r148013037
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -233,10 +249,15 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 context.reply(true)
 
   case RetrieveSparkAppConfig =>
+val tokens = if (renewableDelegationTokens.isDefined) {
+  Some(renewableDelegationTokens.get.credentials)
+} else {
+  None
+}
 val reply = SparkAppConfig(
   sparkProperties,
   SparkEnv.get.securityManager.getIOEncryptionKey(),
-  hadoopDelegationCreds)
+  tokens)
--- End diff --

could make `SparkAppConfig` just take a `RenewableDelegationTokens` object? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #83262 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83262/testReport)**
 for PR 19272 at commit 
[`f93c551`](https://github.com/apache/spark/commit/f93c5513178628d6e7c33bd6cc941407484299fb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19621: [SPARK-11215][ml] Add multiple columns support to...

2017-10-31 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/19621

[SPARK-11215][ml] Add multiple columns support to StringIndexer

## What changes were proposed in this pull request?

Add multiple columns support to StringIndexer.

## How was this patch tested?

UT will add soon


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark multi-col-string-indexer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19621.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19621


commit faa839052ebe0949882b7b8c0ca3557f26beb8e8
Author: WeichenXu 
Date:   2017-10-31T14:45:34Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #83258 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83258/testReport)**
 for PR 19527 at commit 
[`4c6cc57`](https://github.com/apache/spark/commit/4c6cc57136a60c577dd0ba2ae90521f7c18ae5d1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83258/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-10-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19621
  
**[Test build #83263 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83263/testReport)**
 for PR 19621 at commit 
[`faa8390`](https://github.com/apache/spark/commit/faa839052ebe0949882b7b8c0ca3557f26beb8e8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >