[GitHub] zeppelin issue #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12

2018-12-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3253
  
I'm not aware of release plan in Zeppelin since I'm just one of 
contributors. For the current status, the Spark should be downgraded as far as 
I can tell.


---


[GitHub] zeppelin issue #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12

2018-12-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3253
  
Spark 2.4 support was added at 
https://github.com/apache/zeppelin/pull/3206, which exactly addresses the issue 
you faced. This will be available in new release of Zeppelin.


---


[GitHub] spark issue #23226: [SPARK-26286][TEST] Add MAXIMUM_PAGE_SIZE_BYTES exceptio...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23226
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23268: [SPARK-26319][SQL][Test] Add appendReadColumns Unit Test...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23268
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23270
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23262
  
Let's remove. No point of keeping unused method. The code will remain in 
the commit  anyway. Also, there's no quite good point of keeping few lines 
method that's called only at one place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23202
  
Similar discussion is going on at 
https://github.com/apache/spark/pull/23201#discussion_r240156871. Let me keep 
tracking them. Sorry for late response, @MaxGekk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r240156871
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
--- End diff --

> If we switch the order here, we don't need the length check here, right?

@cloud-fan, that works only if we use default date/timestamp patterns. Both 
should do the exact match with pattern, which unfortunately the current parsing 
library (SimpleDateFormat) does not allow.

The order here is just to make it look better and both shouldn't be 
dependent on its order. I think we should support those inferences after 
completely switching the library to `java.time.format.*` without a legacy. That 
should make this change easier without a hole.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r240153595
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
--- End diff --

Yea, one time I tried to match it with CSV a long long ago but I kind of 
gave up due to behaviour changes IIRC. If that's possible, it should be awesome.

If that's difficult, matching the behaviour within text based datasource 
(meaning CSV and JSON I guess) should be good enough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23262
  
LGTM otheriwse


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23262#discussion_r240151214
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -17,51 +17,39 @@
 
 package org.apache.spark.sql.execution
 
+import scala.reflect.runtime.universe.TypeTag
+
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Encoder, Row, SparkSession}
-import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
+import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, 
UnknownPartitioning}
 import org.apache.spark.sql.catalyst.util.truncatedString
 import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.DataType
+import org.apache.spark.sql.types.StructType
 
 object RDDConversions {
-  def productToRowRdd[A <: Product](data: RDD[A], outputTypes: 
Seq[DataType]): RDD[InternalRow] = {
+  def productToRowRdd[A <: Product : TypeTag](data: RDD[A],
+  outputSchema: StructType): 
RDD[InternalRow] = {
+val converters = 
ExpressionEncoder[A].resolveAndBind(outputSchema.toAttributes)
 data.mapPartitions { iterator =>
-  val numColumns = outputTypes.length
-  val mutableRow = new GenericInternalRow(numColumns)
-  val converters = 
outputTypes.map(CatalystTypeConverters.createToCatalystConverter)
   iterator.map { r =>
-var i = 0
-while (i < numColumns) {
-  mutableRow(i) = converters(i)(r.productElement(i))
-  i += 1
-}
-
-mutableRow
+converters.toRow(r)
   }
 }
   }
 
   /**
* Convert the objects inside Row into the types Catalyst expected.
*/
-  def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): 
RDD[InternalRow] = {
+  def rowToRowRdd(data: RDD[Row], outputSchema: StructType): 
RDD[InternalRow] = {
+val converters = RowEncoder(outputSchema)
--- End diff --

I checked each case. Every case looks fine except one case:


https://github.com/apache/spark/blob/fa0d4bf69929c5acd676d602e758a969713d19d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L289

Looks we're going to drop `Char` as `StringType`. I think it's trivial and 
rather a mistake that we supported this. I don't feel strongly about 
documenting it in migration guide but if anyone feels so, we better do that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23262#discussion_r240141142
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -17,51 +17,39 @@
 
 package org.apache.spark.sql.execution
 
+import scala.reflect.runtime.universe.TypeTag
+
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Encoder, Row, SparkSession}
-import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
+import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, 
UnknownPartitioning}
 import org.apache.spark.sql.catalyst.util.truncatedString
 import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.DataType
+import org.apache.spark.sql.types.StructType
 
 object RDDConversions {
-  def productToRowRdd[A <: Product](data: RDD[A], outputTypes: 
Seq[DataType]): RDD[InternalRow] = {
+  def productToRowRdd[A <: Product : TypeTag](data: RDD[A],
+  outputSchema: StructType): 
RDD[InternalRow] = {
+val converters = 
ExpressionEncoder[A].resolveAndBind(outputSchema.toAttributes)
 data.mapPartitions { iterator =>
-  val numColumns = outputTypes.length
-  val mutableRow = new GenericInternalRow(numColumns)
-  val converters = 
outputTypes.map(CatalystTypeConverters.createToCatalystConverter)
   iterator.map { r =>
-var i = 0
-while (i < numColumns) {
-  mutableRow(i) = converters(i)(r.productElement(i))
-  i += 1
-}
-
-mutableRow
+converters.toRow(r)
   }
 }
   }
 
   /**
* Convert the objects inside Row into the types Catalyst expected.
*/
-  def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): 
RDD[InternalRow] = {
+  def rowToRowRdd(data: RDD[Row], outputSchema: StructType): 
RDD[InternalRow] = {
--- End diff --

Let's remove whole object. `rowToRowRdd` looks only being used at one place 
and the code here is quite small.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22683
  
Looks okay to me too. UI change requires screenshots of UI tho strictly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23268: [SPARK-26319][SQL][Test] Add appendReadColumns Unit Test...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23268
  
Let's fix PR description as well. You can leave the comments above resolved.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23249#discussion_r240123260
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
@@ -262,6 +261,15 @@ case class RangePartitioning(ordering: Seq[SortOrder], 
numPartitions: Int)
 super.satisfies0(required) || {
   required match {
 case OrderedDistribution(requiredOrdering) =>
+  // If `ordering` is a prefix of `requiredOrdering`:
+  //   - Let's say `ordering` is [a, b] and `requiredOrdering` is 
[a, b, c]. If a row is
+  // larger than another row w.r.t. [a, b], it's also larger 
w.r.t. [a, b, c]. So
+  // `RangePartitioning(a, b)` satisfy `OrderedDistribution(a, 
b, c)`.
--- End diff --

nit `satisfy` -> `satisfies`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23270
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

2018-12-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23132
  
@MaxGekk, mind fixing PR description accordingly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240110126
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
+  .getOrElse(cols)
+  .mkString(",")
--- End diff --

Thanks. It doesn't matter which company it is. All the PRs are equally and 
reasonably reviewed, and merged per the same criteria.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240107064
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
--- End diff --

Ah, it's `:+` not `+:`. Yea, it's confusing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240105070
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
--- End diff --

? output is different, no?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18784
  
Looks inactive. @srowen and @felixcheung, do you know anyone who might be 
interested in this? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240101927
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
+  .getOrElse(cols)
+  .mkString(",")
--- End diff --

Right, but I don't think it's more readable. It's non-critical path so 
performance concern is secondary anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23268
  
Yup, the test looks okay in that way. Let's file a JIRA and only leave the 
test case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23268
  
Let's close this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240097931
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
--- End diff --

Also, looks the output would be different after this change.
Looks it was `old + col` but the current changes looks doing `col + old`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240097105
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
+  .getOrElse(cols)
+  .mkString(",")
--- End diff --

I don't think this is more readable. Also, the previous code is more 
performant. I wouldn't change this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23266: [SPARK-26313][SQL] move read related methods from...

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23266#discussion_r240074711
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java 
---
@@ -20,14 +20,27 @@
 import org.apache.spark.annotation.Evolving;
 import org.apache.spark.sql.sources.v2.reader.Scan;
 import org.apache.spark.sql.sources.v2.reader.ScanBuilder;
+import org.apache.spark.sql.types.StructType;
 
 /**
- * An empty mix-in interface for {@link Table}, to indicate this table 
supports batch scan.
- * 
- * If a {@link Table} implements this interface, its {@link 
Table#newScanBuilder(DataSourceOptions)}
- * must return a {@link ScanBuilder} that builds {@link Scan} with {@link 
Scan#toBatch()}
- * implemented.
- * 
+ * A mix-in interface for {@link Table} to provide data reading ability of 
batch processing.
  */
 @Evolving
-public interface SupportsBatchRead extends Table { }
+public interface SupportsBatchRead extends Table {
+
+  /**
+   * Returns the schema of this table.
+   */
+  StructType schema();
--- End diff --

To me, +1 for the current change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23266: [SPARK-26313][SQL] move read related methods from...

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23266#discussion_r240073831
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java 
---
@@ -20,14 +20,27 @@
 import org.apache.spark.annotation.Evolving;
 import org.apache.spark.sql.sources.v2.reader.Scan;
 import org.apache.spark.sql.sources.v2.reader.ScanBuilder;
+import org.apache.spark.sql.types.StructType;
 
 /**
- * An empty mix-in interface for {@link Table}, to indicate this table 
supports batch scan.
- * 
- * If a {@link Table} implements this interface, its {@link 
Table#newScanBuilder(DataSourceOptions)}
- * must return a {@link ScanBuilder} that builds {@link Scan} with {@link 
Scan#toBatch()}
- * implemented.
- * 
+ * A mix-in interface for {@link Table} to provide data reading ability of 
batch processing.
  */
 @Evolving
-public interface SupportsBatchRead extends Table { }
+public interface SupportsBatchRead extends Table {
+
+  /**
+   * Returns the schema of this table.
+   */
+  StructType schema();
+
+  /**
+   * Returns a {@link ScanBuilder} which can be used to build a {@link 
Scan} later. The built
+   * {@link Scan} must implement {@link Scan#toBatch()}. Spark will call 
this method for each data
+   * scanning query.
+   * 
+   * The builder can take some query specific information to do operators 
pushdown, and keep these
+   * information in the created {@link Scan}.
+   * 
+   */
+  ScanBuilder newScanBuilder(DataSourceOptions options);
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16812
  
I still don't think need this since the workaround is easy. If other 
committers find it worth, I won't object.
If there are no interests fro this PR afterwards, I would just close this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263
  
> Visualizing a workflow is nice, but Spark's Pipelines are typically 
pretty straightforward and linear. I could imagine producing a nicer 
visualization than what you get from reading the Spark UI, although of course 
we already have some degree of history and data there.

Another good thing that might have to be considered is, that we can 
interact this with other SQL events. For instance, where the input `Dataset` is 
originated. For instance, with current Apache Spark, I can visualises SQL 
operations as below:

![screen shot 2018-12-10 at 9 41 36 
am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png)

I think we can combine those existing lineages together to easily 
understand where the data comes and goes.

(BTW, I hope it doesn't sound like I'm pushing this case for my one 
specific case - I think this hook-like feature can be useful in many way. I 
currently have one explicit example to show so I'm referring my case.)

> These are just the hooks, right? someone would have to implement 
something to use these events. I see the value in the API to some degree, but 
with no concrete implementation, does it add anything for Spark users out of 
the box?

Yes, right. It does not add anything to Spark users out of the box. It 
needs a custom implementation for a query listener. For instance,

with the custom listener below:

```scala
class CustomSparkListener extends SparkListener
  def onOtherEvents(e: SparkListenerEvent) = e match {
case e: MLEvent => // do something
case _ => // pass
  }
```

There are two (existing) ways to use this.

```scala
spark.sparkContext.addSparkListener(new CustomSparkListener)
```

```bash
spark-submit
  --conf spark.extraListeners=CustomSparkListener\
...
```

It's also similar with other existing implementation in SQL side (catalog 
events described above in PR description).

One actual example that I had with SQL query listener was that I had to 
close one connection every time after SQL execution.

> Is that what someone would likely do? or would someone likely have to run 
Atlas to use this? If that's a good example of the use case, and Atlas is 
really about lineage and governance, is that the thrust of this change, to help 
with something to do with model lineage and reproducibility?

There are two reasons.

1. I think someone in general would likely utilise this feature like other 
event listeners. At least, I can see some interests going on outside.

- SQL Listener
  - 
https://stackoverflow.com/questions/46409339/spark-listener-to-an-sql-query
  - 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-Custom-Query-Execution-listener-via-conf-properties-td30979.html

- Streaming Query Listener
  - https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/
  -  
http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Watermark-td25413.html#a25416

2. Someone would likely run this via Atlas so that someone could do 
something about lineage and governance, yes, as you said. Yes, I'm trying to 
show integrated lineages in Apache Spark but this is a missing hole. There had 
to be this change for this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23256
  
Ooops, 

>  tbh, more meaningful dataset as example would be better...

did you expect to fix more instances here Felix? Sorry, I misread.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23256
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23263#discussion_r240004006
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends 
PipelineStage {
* Fits a model to the input data.
*/
   @Since("2.0.0")
-  def fit(dataset: Dataset[_]): M
+  def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) {
+fitImpl(dataset)
+  }
+
+  /**
+   * `fit()` handles events and then calls this method. Subclasses should 
override this
+   * method to implement the actual fiting a model to the input data.
+   */
+  @Since("3.0.0")
+  protected def fitImpl(dataset: Dataset[_]): M = {
+// Keep this default body for backward compatibility.
+throw new UnsupportedOperationException("fitImpl is not implemented.")
--- End diff --

Yes, that was my intention. I wanted to force to implement `fitImpl` but 
was thinking that might be too breaking change (it's going to at least break 
source compatibility). I am willing to follow other suggestions - I am pretty 
sure you or other guys are more familiar with ML side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23263#discussion_r240003885
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/MLEventsSuite.scala ---
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import java.io.File
+
+import scala.collection.mutable
+import scala.concurrent.duration._
+
+import org.apache.hadoop.fs.Path
+import org.mockito.Matchers.{any, eq => meq}
+import org.mockito.Mockito.when
+import org.scalatest.BeforeAndAfterEach
+import org.scalatest.concurrent.Eventually
+import org.scalatest.mockito.MockitoSugar.mock
+
+import org.apache.spark.{SparkContext, SparkFunSuite}
+import org.apache.spark.ml.param.ParamMap
+import org.apache.spark.ml.util._
+import org.apache.spark.scheduler.{SparkListener, SparkListenerEvent}
+import org.apache.spark.sql._
+import org.apache.spark.util.Utils
+
+
+class MLEventsSuite
+extends SparkFunSuite
+with BeforeAndAfterEach
+with DefaultReadWriteTest
+with Eventually {
+
+  private var spark: SparkSession = _
+  private var sc: SparkContext = _
+  private var checkpointDir: String = _
+  private var listener: SparkListener = _
+  private val dirName: String = "pipeline"
+  private val events = mutable.ArrayBuffer.empty[MLEvent]
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+sc = new SparkContext("local[2]", "SparkListenerSuite")
+listener = new SparkListener {
+  override def onOtherEvent(event: SparkListenerEvent): Unit = event 
match {
+case e: FitStart[_] => events.append(e)
+case e: FitEnd[_] => events.append(e)
+case e: TransformStart => events.append(e)
+case e: TransformEnd => events.append(e)
+case e: SaveInstanceStart if e.path.endsWith(dirName) => 
events.append(e)
+case e: SaveInstanceEnd if e.path.endsWith(dirName) => 
events.append(e)
+case _ =>
+  }
+}
+sc.addSparkListener(listener)
+
+spark = SparkSession.builder()
+  .sparkContext(sc)
+  .getOrCreate()
+
+checkpointDir = Utils.createDirectory(tempDir.getCanonicalPath, 
"checkpoints").toString
+sc.setCheckpointDir(checkpointDir)
--- End diff --

Let me double check and address this while fixing the test. I just copied 
this from `MLlibTestSparkContext`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23263#discussion_r240003869
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala ---
@@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") (
* @return fitted pipeline
*/
   @Since("2.0.0")
-  override def fit(dataset: Dataset[_]): PipelineModel = {
+  override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset)
--- End diff --

Ah, it's there just only to keep the `@Since`. Looks some classes don't 
explicitly note that so I didn't call `super` in other places.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263
  
The tests pass in my local. I'll fix them shortly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263
  
cc @srowen, @cloud-fan (since it mimics SQL's event listener), @jkbradley, 
@mengxr and @yanboliang. Mind if I ask to take a look please? WDYT about this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23263#discussion_r23747
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -210,7 +214,7 @@ abstract class PredictionModel[FeaturesType, M <: 
PredictionModel[FeaturesType,
 }
   }
 
-  protected def transformImpl(dataset: Dataset[_]): DataFrame = {
+  override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
--- End diff --

`transformImpl` for some abstraction and `saveImpl` are already existent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/23261


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23263

[SPARK-23674][ML] Adds Spark ML Events

## What changes were proposed in this pull request?

This PR proposes to add ML events so that other developers can track and 
add some actions for them.

## Introduction

This PR proposes to send some ML events like SQL. This is quite useful when 
people want to track and make some actions for corresponding ML operations. For 
instance, I have been working on integrating 
Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). 
With some custom changes with this PR, I can visualise ML pipeline as below:


![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png)

I think not to mention how useful it is to track the SQL operations. 
Likewise, I would like to propose ML events as well (as lowest stability 
`@Unstable` APIs for now - no guarantee about stability).

## Implementation Details

### Sends event (but not expose ML specific listener)

In `events.scala`, it adds:

```scala
@Unstable
case class ...StartEvent(caller, input)
@Unstable
case class ...EndEvent(caller, output)

object MLEvents {
  // Wrappers to send events:
  // def with...Event(body) = {
  //   body()
  //   SparkContext.getOrCreate().listenerBus.post(event)
  // }
}
```

This way mimics both:

**1. Catalog events (see 
`org.apache.spark.sql.catalyst.catalog.events.scala`)**

- This allows a Catalog specific listener to be added 
`ExternalCatalogEventListener` 

- It's implemented in a way of wrapping whole `ExternalCatalog` named 
`ExternalCatalogWithListener`
which delegates the operations to `ExternalCatalog`

This is not quite possible in this case because most of instances (like 
`Pipeline`) will be directly created in most of cases. We might be able to do 
that via extending `ListenerBus` for all possible instances but IMHO it's too 
invasive. Also, exposing another ML specific listener sounds a bit too much at 
this stage. Therefore, I simply borrowed file name and structures here

**2. SQL execution events (see 
`org.apache.spark.sql.execution.SQLExecution.scala`)**

- Add an object that wraps a body to send events

Current apporach is rather close to this. It has a `with...` wrapper to 
send events. I borrowed this approach to be consistent.


### Add `...Impl` methods to wrap each to send events

**1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`**

```diff
- def save(...) = { saveImpl(...) }
+ def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) }
  def saveImpl(...): Unit = ...
```

  Note that `saveImpl` was already implemented unlike other instances below.


```diff
- def load(...): T
+ def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) }
+ def loadImpl(...): T
```

**2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`**

```diff
- def fit(...): Model
+ def fit(...): Model = MLEvents.withFitEvent { fitImpl(...) }
+ def fitImpl(...): Model
```

**3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`**

```diff
- def transform(...): DataFrame
+ def transform(...): DataFrame = MLEvents.withTransformEvent { 
transformImpl(...) }
+ def transformImpl(...): DataFrame
```

This approach follows the existing way as below in ML:

**1. `transform` and `transformImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L191-L196


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042

**2. `save` and `saveImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176

Inherited ones are intentionally omitted here for simplicity. They are 
inherited and implemented at multiple places.

## Backward Compatibility

_This keeps both source and binary backward compatibility_. I was thinking 
enforcing `...Impl` by leaving it abstract methods to force to implement but 
just decided to leave a body that throws `UnsupportedOperationException` so 
that we can keep full source and binary compatibilities.

- For user-faced API perspective, _there's no difference_

[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23261
  
I'm going to rebase and reopen a PR to retrigger AppVeyor tests. The 
failure looks unrelated


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23261#discussion_r239998861
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -163,7 +163,7 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
*/
   @Since("1.6.0")
   @throws[IOException]("If the input path already exists but overwrite is 
not enabled.")
-  def save(path: String): Unit = {
+  def save(path: String): Unit = MLEvents.withSaveInstanceEvent(this, 
path) {
 new FileSystemOverwrite().handleOverwrite(path, shouldOverwrite, sc)
 saveImpl(path)
--- End diff --

and `saveImpl` is already existent as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

2018-12-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23261#discussion_r239998855
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -210,7 +214,7 @@ abstract class PredictionModel[FeaturesType, M <: 
PredictionModel[FeaturesType,
 }
   }
 
-  protected def transformImpl(dataset: Dataset[_]): DataFrame = {
+  override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
--- End diff --

For clarification, some of abstraction already has `transformImpl`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23261: [SPARK-23674][ML] Adds Spark ML Events

2018-12-07 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23261

[SPARK-23674][ML] Adds Spark ML Events

## What changes were proposed in this pull request?

This PR proposes to add ML events so that other developers can track add 
some actions for them.

## Introduction

This PR proposes to send some ML events like SQL. This is quite useful when 
people want to track and make some actions for corresponding ML operations. For 
instance, I have been working on integrating Spark with Atlas, and with some 
custom changes with this PR, I can visualise ML pipeline as below:


![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png)

I think not to mention how useful it is to track the SQL operations. 
Likewise, I would like to propose ML events as well (as lowest stability 
`@Unstable` APIs for now).

## Implementation Details

### Sends event (but not expose ML specific listener)

In `events.scala`, it adds:

```scala
@Unstable
case class ...Events

object MLEvents {
  // Wrappers to send events:
  // def with...Event(body) = {
  //   body()
  //   SparkContext.getOrCreate().listenerBus.post(event)
  // }
}
```

This way mimics both:

**1.. Catalog events (see 
`org.apache.spark.sql.catalyst.catalog.events.scala`)**

- This allows a Catalog specific listener to be added 
`ExternalCatalogEventListener` 
- It's implemented in a way of wrapping whole `ExternalCatalog` named 
`ExternalCatalogWithListener`
which delegates the operations to `ExternalCatalog`

This is not quite possible in this case because most of instances (like 
`Pipeline`) will be directly created in most of cases. We might be able to do 
that via extending `ListenerBus` for all possbiel instances but IMHO it's 
invasive. Also, exposing another ML specific listener sounds a bit too much for 
current status. Therefore, I simply borrowed file name and structures here

**2.. SQL execution events (see 
`org.apache.spark.sql.execution.SQLExecution.scala`)**

- Add an object that wraps a body to send events

Current apporach is rather close to this. It has a `with...` wrapper to 
send events. I borrowed this approach to be consistent.


### Add `...Impl` methods to wrap each to send events

**1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`**

```diff
- def save(...) = { saveImpl(...) }
+ def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) }
  def saveImpl(...): Unit = ...
```

  Note that `saveImpl` was already implemented unlike other instances below.


```diff
- def load(...): T
+ def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) }
+ def loadImpl(...): T
```

**2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`**

```diff
- def fit(...): Model
+ def fit(...): Model = MLEvents.withFitEvent { fitImpl() }
+ def fitImpl(...): Model
```

**3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`**

```diff
- def transform(...): DataFrame
+ def transform(...): DataFrame = MLEvents.withTransformEvent { 
transformImpl() }
+ def transformImpl(...): DataFrame
```

This approach follows the existing way as below in ML:

**1.. `transform` and `transformImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L190-L196


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042

**2.. `save` and `saveImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176

Inherited ones are intentionally omitted here for simplicity. They are 
inherited and implemented at multiple places.

## Backword Compatibility

This keeps both source and binary backward compatibility. I was thinking 
enforcing `...Impl` by leaving it abstract methods but just decided to leave a 
body that throws `UnsupportedOperationException` so that we can keep full 
source and binary compatibilities.

- For user-faced API perspective, there's no difference. `...Impl` methods 
are protected and not visible to end users.

- For developer API perspective, if some developers want to `...` methods 
instead of `...Impl`, that's still fine. It only does

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239701916
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

BTW, when I added that into https://spark.apache.org/contributing.html, we 
also agreed upon following committer's judgement based upon the guide because 
the guide mentions:

> The coding conventions described above should be followed, unless there 
is good reason to do otherwise. Exceptions include legacy code and modifying 
third-party code.

since we do have legacy reason, and there is a good reason - consistency 
and committer's judgement.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239701364
  
--- Diff: R/pkg/tests/fulltests/test_mllib_clustering.R ---
@@ -319,4 +319,18 @@ test_that("spark.posterior and spark.perplexity", {
   expect_equal(length(local.posterior), sum(unlist(local.posterior)))
 })
 
+test_that("spark.assignClusters", {
+  df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+ list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+ list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+  clusters <- spark.assignClusters(df, initMode = "degree", weightCol = 
"weight")
+  expected_result <- createDataFrame(list(list(4L, 1L),
+  list(0L, 0L),
+  list(1L, 0L),
+  list(3L, 1L),
+  list(2L, 0L)),
+  schema = c("id", "cluster"))
--- End diff --

ditto for style


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239701069
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

There are two separate style are already mixed in R code IIRC:

```r
df <- createDataFrame(
  list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
  list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))
```

or

```r
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
   list(4L, 0L, 0.1)),
  schema = c("src", "dst", "weight"))
```

Let's avoid mixed style, and let's go for the later one when possible 
because at least that looks more complying the code style.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239700846
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

Yea, we do have for indentation rule. "Code Style Guide" at 
https://spark.apache.org/contributing.html -> 
https://google.github.io/styleguide/Rguide.xml. I know the code style is not 
perfectly documented but at least there are some examples. I think the correct 
indentation is:

```r
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
   list(4L, 0L, 0.1)),
  schema = c("src", "dst", "weight"))
``` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23226: [SPARK-26286][TEST] Add MAXIMUM_PAGE_SIZE_BYTES e...

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23226#discussion_r239662927
  
--- Diff: 
core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
 ---
@@ -622,6 +622,17 @@ public void initialCapacityBoundsChecking() {
 } catch (IllegalArgumentException e) {
   // expected exception
 }
+
+try {
+  new BytesToBytesMap(
+taskMemoryManager,
+1,
+TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES+1);
--- End diff --

`E_BYTES+1` -> `E_BYTES + 1`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23248: [SPARK-26293][SQL] Cast exception when having python udf...

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23248
  
Thanks, @cloud-fan. I will take a look within tomorrow - don't block by me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23248: [SPARK-26293][SQL] Cast exception when having pyt...

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23248#discussion_r239441136
  
--- Diff: python/pyspark/sql/tests/test_udf.py ---
@@ -23,7 +23,7 @@
 
 from pyspark import SparkContext
 from pyspark.sql import SparkSession, Column, Row
-from pyspark.sql.functions import UserDefinedFunction
+from pyspark.sql.functions import UserDefinedFunction, udf
--- End diff --

Ah, yea. It's okay and I think it's good timing to clean up while we are 
here, and while it's broken down into multiple test files now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23245: [SPARK-26060][SQL][FOLLOW-UP] Rename the config name.

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23245
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23246: [SPARK-26292][CORE]Assert statement of currentPage may b...

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23246
  
Ditto. Let's close these and focus on fixing actual bugs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23247: [SPARK-26294][CORE]Delete Unnecessary If statement

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23247
  
@wangjiaochun, I think you better stop fixing trivial stuff in each PR. 
Those stuff can be fixed when the codes around here is fixed, or let other 
people fix it later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23215: [SPARK-26263][SQL] Validate partition values with user p...

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23215
  
+1 from me to. It makes sense to me and having configuration sounds good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23243: [branch-2.4][ExternalShuffleService]add initRegisteredEx...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23243
  
Backport from which JIRA @weixiuli? Usually the fix should go to master 
first and it's backported to other branches when it's needed. If it should be 
fixed in master branch as well, let's file a JIRA and switch the branch to the 
master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23237: [SPARK-26279][CORE] Remove unused method in Logging

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23237
  
Looks some classes, for instance, `KafkaUtils` exposes this (I guess 
mistakenly?). Let's don't bother this and close this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23237: [SPARK-26279][CORE] Remove unused method in Logging

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23237
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
Thanks guys .. I will try to take a look for it as well (although it's 
going to take relatively a long while). Let me get this in for now anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23225: [MINOR][CORE]Don't need to create an empty spill file wh...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23225
  
If a test passes without/with this changes, it wouldn't verify the 
regression.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
Thanks for a close look, @BryanCutler. I just increased the timeout because 
I was just trying to keep the test as is and fix it. The actual test doesn't 
look taking so much time but it looks it can be flaky when the resource usage 
is heavy. If you feel this can be resolved in other ways, would you mind if I 
ask to take a look for this please? and i'm happy to close this one. Actually, 
I am not quite familiar with ML ones ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  

test this please
--





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23120
  
a late LGTM as well


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23235: [SPARK-26151][SQL][FOLLOWUP] Return partial resul...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23235#discussion_r239049825
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -35,6 +35,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Since Spark 3.0, CSV datasource uses java.time API for parsing and 
generating CSV content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
 
+  - In Spark version 2.4 and earlier, CSV datasource converts a malformed 
CSV string to a row with all `null`s in the PERMISSIVE mode if specified schema 
is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if 
some of CSV column values were parsed and converted to desired types 
successfully.
--- End diff --

Ah, `from_csv` and `to_csv` are added in 3.0 so it's intentionally not 
mentioned. BTW, I think CSV functionalities can only have `StructType` so maybe 
we don't have to mention.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
cc @BryanCutler and @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
(cc @squito as well since it's from #23111)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23236
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23236: [SPARK-26275][PYTHON][ML] Increases timeout for S...

2018-12-05 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23236

[SPARK-26275][PYTHON][ML] Increases timeout for 
StreamingLogisticRegressionWithSGDTests.test_training_and_prediction test

## What changes were proposed in this pull request?

Looks this test is flaky


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console

```
==
FAIL: test_training_and_prediction 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the model improves on toy data with no. of batches
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 367, in test_training_and_prediction
self._eventually(condition)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 78, in _eventually
% (timeout, lastValue))
AssertionError: Test failed due to timeout after 30 sec, with last 
condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 
0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74

--
Ran 13 tests in 185.051s

FAILED (failures=1, skipped=1)
```

This looks happening after increasing the parallelism in Jenkins to speed 
up at https://github.com/apache/spark/pull/23111. I am able to reproduce this 
manually when the resource usage is heavy (with manual decrease of timeout).

## How was this patch tested?

Manually tested by 

```
cd python
./run-tests --testnames 'pyspark.mllib.tests.test_streaming_algorithms 
StreamingLogisticRegressionWithSGDTests.test_training_and_prediction' 
--python-executables=python
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-26275

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23236.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23236


commit 3c4ee75c4d0585702cd87cc4df9af74e235bb431
Author: Hyukjin Kwon 
Date:   2018-12-05T12:17:21Z

Increases timeout for 
StreamingLogisticRegressionWithSGDTests.test_training_and_prediction test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23227: [SPARK-26271][FOLLOW-UP][SQL] remove unuse object SparkP...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23227
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin pull request #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12

2018-12-05 Thread HyukjinKwon
GitHub user HyukjinKwon reopened a pull request:

https://github.com/apache/zeppelin/pull/3253

[ZEPPELIN-3551] Upgrade Scala to 2.11.12 

### What is this PR for?

This is just to update scala to 2.11.12 which to be consistent with spark 
(SPARK-24418).
This PR takes over and closes #3033

There was a minor conflict which my PR 
(https://github.com/apache/zeppelin/pull/3206) introduced. That change is 
compatible with both Scala 2.11.8 and 2.11.12 so we don't need to change it 
anymore.

### What type of PR is it?

[Improvement]

### Todos
* [ ] - None

### What is the Jira issue?

* https://issues.apache.org/jira/browse/ZEPPELIN-3551

### How should this be tested?

- CI pass

### Screenshots (if appropriate)

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/zeppelin scala-2.11.12

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zeppelin/pull/3253.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3253


commit 7d00314f8ec48556f75fe9877dcbee0d6e7ce2fe
Author: Jeff Zhang 
Date:   2018-06-19T07:32:27Z

ZEPPELIN-3551. Upgrade Scala to 2.11.12




---


[GitHub] zeppelin pull request #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/zeppelin/pull/3253


---


[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23229
  
I don't think it's worth to change naming the variable in a single PR. 
Let's do that when we fix some codes around here, or let other people try to 
fix later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23224: [MINOR][SQL][TEST] WholeStageCodegen metrics should be t...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23224
  
Can we file a JIRA? I think it's not minor.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23230
  
Oh, the original one was 3.0. Although this doc change can go to branch-2.4 
alone as well, let me revert it in branch-2.4 for management simplicity.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23230
  
Merged to master and branch-2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23227: [SPARK-26271][FOLLOW-UP][SQL] remove unuse object SparkP...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23227
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23226: [MINOR][TEST] Add MAXIMUM_PAGE_SIZE_BYTES Excepti...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23226#discussion_r238977650
  
--- Diff: 
core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
 ---
@@ -622,6 +622,17 @@ public void initialCapacityBoundsChecking() {
 } catch (IllegalArgumentException e) {
   // expected exception
 }
+
+try {
+  new BytesToBytesMap(
+  taskMemoryManager,
--- End diff --

Let's keep the indentation consistent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23225: [MINOR][CORE]Don't need to create an empty spill file wh...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23225
  
Also, it needs a JIRA. it's not minor one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23225: [MINOR][CORE]Don't need to create an empty spill file wh...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23225
  
How come existing tests cover if the empty file is created or not?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin issue #3034: [WIP] ZEPPELIN-3552. Support Scala 2.12 of SparkInterp...

2018-12-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3034
  
Hey Jeff, I can take over this one too if you're busy since I took a look 
for similar code paths.


---


[GitHub] zeppelin pull request #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12

2018-12-05 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/zeppelin/pull/3253

[ZEPPELIN-3551] Upgrade Scala to 2.11.12 

### What is this PR for?

This is just to update scala to 2.11.12 which to be consistent with spark 
(SPARK-24418).
This PR takes over and closes #3033

There was a minor conflict which my PR 
(https://github.com/apache/zeppelin/pull/3206) introduced. That change is 
compatible with both Scala 2.11.8 and 2.11.12 so we don't need to change it 
anymore.

### What type of PR is it?

[Improvement]

### Todos
* [ ] - None

### What is the Jira issue?

* https://issues.apache.org/jira/browse/ZEPPELIN-3551

### How should this be tested?

- CI pass

### Screenshots (if appropriate)

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/zeppelin scala-2.11.12

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zeppelin/pull/3253.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3253


commit 7d00314f8ec48556f75fe9877dcbee0d6e7ce2fe
Author: Jeff Zhang 
Date:   2018-06-19T07:32:27Z

ZEPPELIN-3551. Upgrade Scala to 2.11.12




---


[GitHub] spark issue #23203: [SPARK-26252][PYTHON] Add support to run specific unitte...

2018-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23203
  
Thank you @cloud-fan, @viirya, @srowen, and @BryanCutler.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23203: [SPARK-26252][PYTHON] Add support to run specific unitte...

2018-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23203
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23216: [SPARK-26264][CORE]It is better to add @transient...

2018-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23216#discussion_r238892679
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala 
---
@@ -56,7 +56,7 @@ private[spark] class ResultTask[T, U](
 stageAttemptId: Int,
 taskBinary: Broadcast[Array[Byte]],
 partition: Partition,
-locs: Seq[TaskLocation],
+@transient private var locs: Seq[TaskLocation],
--- End diff --

why is it `var` BTW?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23203: [SPARK-26252][PYTHON] Add support to run specific...

2018-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23203#discussion_r238887812
  
--- Diff: python/run-tests.py ---
@@ -93,17 +93,18 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python):
 "pyspark-shell"
 ]
 env["PYSPARK_SUBMIT_ARGS"] = " ".join(spark_args)
-
-LOGGER.info("Starting test(%s): %s", pyspark_python, test_name)
+str_test_name = " ".join(test_name)
+LOGGER.info("Starting test(%s): %s", pyspark_python, str_test_name)
 start_time = time.time()
 try:
 per_test_output = tempfile.TemporaryFile()
 retcode = subprocess.Popen(
-[os.path.join(SPARK_HOME, "bin/pyspark"), test_name],
--- End diff --

Oh, yea. Looks that's going to reduce the diff. Let me try.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zeppelin issue #3033: ZEPPELIN-3551. Upgrade Scala to 2.11.12

2018-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/zeppelin/pull/3033
  
Hey @zjffdu, busy? I can take this over - looks there's only minor conflict.


---


[GitHub] spark issue #23203: [SPARK-26252][PYTHON] Add support to run specific unitte...

2018-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23203
  
Yea, will update it as well after this one gets merged.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-12-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
It's fixed in upcoming Spark. Spark 2.4 does not support it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238496050
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
   override def beforeAll() {
 super.beforeAll()
 TestHive.setCacheTables(true)
-// Timezone is fixed to America/Los_Angeles for those timezone 
sensitive tests (timestamp_*)
-TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+// Timezone is fixed to GMT for those timezone sensitive tests 
(timestamp_*)
--- End diff --

@MaxGekk, BTW, why does this have to be GMT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...

2018-12-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23196#discussion_r238495344
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - Spark applications which are built with Spark version 2.4 and prior, 
and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 
3.0, as they are not binary compatible with Spark 3.0.
 
+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and 
generating JSON content. New formatting implementation supports date/timestamp 
patterns conformed to ISO 8601. To switch back to the implementation used in 
Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
--- End diff --

I think we can add an example that shows the diff. IIRC it has a difference 
about exact match or non-exact match.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...

2018-12-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22590
  
Hm, let's try to find a way to expose other parse options. I think we 
shouldn't allow every options available on Univocity ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >