from:"HyukjinKwon"

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23226
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23268: [SPARK-26319][SQL][Test] Add appendReadColumns Unit Test...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23268
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23270
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23262
  
Let's remove. No point of keeping unused method. The code will remain in 
the commit  anyway. Also, there's no quite good point of keeping few lines 
method that's called only at one place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23202: [SPARK-26248][SQL] Infer date type from CSV

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23202
  
Similar discussion is going on at 
https://github.com/apache/spark/pull/23201#discussion_r240156871. Let me keep 
tracking them. Sorry for late response, @MaxGekk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r240156871
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
--- End diff --

> If we switch the order here, we don't need the length check here, right?

@cloud-fan, that works only if we use default date/timestamp patterns. Both 
should do the exact match with pattern, which unfortunately the current parsing 
library (SimpleDateFormat) does not allow.

The order here is just to make it look better and both shouldn't be 
dependent on its order. I think we should support those inferences after 
completely switching the library to `java.time.format.*` without a legacy. That 
should make this change easier without a hole.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r240153595
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
--- End diff --

Yea, one time I tried to match it with CSV a long long ago but I kind of 
gave up due to behaviour changes IIRC. If that's possible, it should be awesome.

If that's difficult, matching the behaviour within text based datasource 
(meaning CSV and JSON I guess) should be good enough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23262
  
LGTM otheriwse


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23262#discussion_r240151214
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -17,51 +17,39 @@
 
 package org.apache.spark.sql.execution
 
+import scala.reflect.runtime.universe.TypeTag
+
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Encoder, Row, SparkSession}
-import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
+import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, 
UnknownPartitioning}
 import org.apache.spark.sql.catalyst.util.truncatedString
 import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.DataType
+import org.apache.spark.sql.types.StructType
 
 object RDDConversions {
-  def productToRowRdd[A <: Product](data: RDD[A], outputTypes: 
Seq[DataType]): RDD[InternalRow] = {
+  def productToRowRdd[A <: Product : TypeTag](data: RDD[A],
+  outputSchema: StructType): 
RDD[InternalRow] = {
+val converters = 
ExpressionEncoder[A].resolveAndBind(outputSchema.toAttributes)
 data.mapPartitions { iterator =>
-  val numColumns = outputTypes.length
-  val mutableRow = new GenericInternalRow(numColumns)
-  val converters = 
outputTypes.map(CatalystTypeConverters.createToCatalystConverter)
   iterator.map { r =>
-var i = 0
-while (i < numColumns) {
-  mutableRow(i) = converters(i)(r.productElement(i))
-  i += 1
-}
-
-mutableRow
+converters.toRow(r)
   }
 }
   }
 
   /**
* Convert the objects inside Row into the types Catalyst expected.
*/
-  def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): 
RDD[InternalRow] = {
+  def rowToRowRdd(data: RDD[Row], outputSchema: StructType): 
RDD[InternalRow] = {
+val converters = RowEncoder(outputSchema)
--- End diff --

I checked each case. Every case looks fine except one case:


https://github.com/apache/spark/blob/fa0d4bf69929c5acd676d602e758a969713d19d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L289

Looks we're going to drop `Char` as `StringType`. I think it's trivial and 
rather a mistake that we supported this. I don't feel strongly about 
documenting it in migration guide but if anyone feels so, we better do that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23262#discussion_r240141142
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -17,51 +17,39 @@
 
 package org.apache.spark.sql.execution
 
+import scala.reflect.runtime.universe.TypeTag
+
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Encoder, Row, SparkSession}
-import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
+import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, 
UnknownPartitioning}
 import org.apache.spark.sql.catalyst.util.truncatedString
 import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.DataType
+import org.apache.spark.sql.types.StructType
 
 object RDDConversions {
-  def productToRowRdd[A <: Product](data: RDD[A], outputTypes: 
Seq[DataType]): RDD[InternalRow] = {
+  def productToRowRdd[A <: Product : TypeTag](data: RDD[A],
+  outputSchema: StructType): 
RDD[InternalRow] = {
+val converters = 
ExpressionEncoder[A].resolveAndBind(outputSchema.toAttributes)
 data.mapPartitions { iterator =>
-  val numColumns = outputTypes.length
-  val mutableRow = new GenericInternalRow(numColumns)
-  val converters = 
outputTypes.map(CatalystTypeConverters.createToCatalystConverter)
   iterator.map { r =>
-var i = 0
-while (i < numColumns) {
-  mutableRow(i) = converters(i)(r.productElement(i))
-  i += 1
-}
-
-mutableRow
+converters.toRow(r)
   }
 }
   }
 
   /**
* Convert the objects inside Row into the types Catalyst expected.
*/
-  def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): 
RDD[InternalRow] = {
+  def rowToRowRdd(data: RDD[Row], outputSchema: StructType): 
RDD[InternalRow] = {
--- End diff --

Let's remove whole object. `rowToRowRdd` looks only being used at one place 
and the code here is quite small.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22683
  
Looks okay to me too. UI change requires screenshots of UI tho strictly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23268: [SPARK-26319][SQL][Test] Add appendReadColumns Unit Test...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23268
  
Let's fix PR description as well. You can leave the comments above resolved.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23249#discussion_r240123260
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
@@ -262,6 +261,15 @@ case class RangePartitioning(ordering: Seq[SortOrder], 
numPartitions: Int)
 super.satisfies0(required) || {
   required match {
 case OrderedDistribution(requiredOrdering) =>
+  // If `ordering` is a prefix of `requiredOrdering`:
+  //   - Let's say `ordering` is [a, b] and `requiredOrdering` is 
[a, b, c]. If a row is
+  // larger than another row w.r.t. [a, b], it's also larger 
w.r.t. [a, b, c]. So
+  // `RangePartitioning(a, b)` satisfy `OrderedDistribution(a, 
b, c)`.
--- End diff --

nit `satisfy` -> `satisfies`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23270
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23132
  
@MaxGekk, mind fixing PR description accordingly?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240110126
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
+  .getOrElse(cols)
+  .mkString(",")
--- End diff --

Thanks. It doesn't matter which company it is. All the PRs are equally and 
reasonably reviewed, and merged per the same criteria.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240107064
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
--- End diff --

Ah, it's `:+` not `+:`. Yea, it's confusing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240105070
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
--- End diff --

? output is different, no?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18784
  
Looks inactive. @srowen and @felixcheung, do you know anyone who might be 
interested in this? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240101927
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
+  .getOrElse(cols)
+  .mkString(",")
--- End diff --

Right, but I don't think it's more readable. It's non-critical path so 
performance concern is secondary anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23268
  
Yup, the test looks okay in that way. Let's file a JIRA and only leave the 
test case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23268
  
Let's close this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240097931
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
--- End diff --

Also, looks the output would be different after this change.
Looks it was `old + col` but the current changes looks doing `col + old`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23268#discussion_r240097105
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala 
---
@@ -53,19 +53,12 @@ private[hive] object HiveShim {
* This function in hive-0.13 become private, but we have to do this to 
work around hive bug
*/
   private def appendReadColumnNames(conf: Configuration, cols: 
Seq[String]) {
-val old: String = 
conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "")
-val result: StringBuilder = new StringBuilder(old)
-var first: Boolean = old.isEmpty
-
-for (col <- cols) {
-  if (first) {
-first = false
-  } else {
-result.append(',')
-  }
-  result.append(col)
-}
-conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, 
result.toString)
+val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR
+val value = Option(conf.get(key, null))
+  .map(old => cols.+:(old))
+  .getOrElse(cols)
+  .mkString(",")
--- End diff --

I don't think this is more readable. Also, the previous code is more 
performant. I wouldn't change this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23266: [SPARK-26313][SQL] move read related methods from...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23266#discussion_r240074711
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java 
---
@@ -20,14 +20,27 @@
 import org.apache.spark.annotation.Evolving;
 import org.apache.spark.sql.sources.v2.reader.Scan;
 import org.apache.spark.sql.sources.v2.reader.ScanBuilder;
+import org.apache.spark.sql.types.StructType;
 
 /**
- * An empty mix-in interface for {@link Table}, to indicate this table 
supports batch scan.
- * 
- * If a {@link Table} implements this interface, its {@link 
Table#newScanBuilder(DataSourceOptions)}
- * must return a {@link ScanBuilder} that builds {@link Scan} with {@link 
Scan#toBatch()}
- * implemented.
- * 
+ * A mix-in interface for {@link Table} to provide data reading ability of 
batch processing.
  */
 @Evolving
-public interface SupportsBatchRead extends Table { }
+public interface SupportsBatchRead extends Table {
+
+  /**
+   * Returns the schema of this table.
+   */
+  StructType schema();
--- End diff --

To me, +1 for the current change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23266: [SPARK-26313][SQL] move read related methods from...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23266#discussion_r240073831
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java 
---
@@ -20,14 +20,27 @@
 import org.apache.spark.annotation.Evolving;
 import org.apache.spark.sql.sources.v2.reader.Scan;
 import org.apache.spark.sql.sources.v2.reader.ScanBuilder;
+import org.apache.spark.sql.types.StructType;
 
 /**
- * An empty mix-in interface for {@link Table}, to indicate this table 
supports batch scan.
- * 
- * If a {@link Table} implements this interface, its {@link 
Table#newScanBuilder(DataSourceOptions)}
- * must return a {@link ScanBuilder} that builds {@link Scan} with {@link 
Scan#toBatch()}
- * implemented.
- * 
+ * A mix-in interface for {@link Table} to provide data reading ability of 
batch processing.
  */
 @Evolving
-public interface SupportsBatchRead extends Table { }
+public interface SupportsBatchRead extends Table {
+
+  /**
+   * Returns the schema of this table.
+   */
+  StructType schema();
+
+  /**
+   * Returns a {@link ScanBuilder} which can be used to build a {@link 
Scan} later. The built
+   * {@link Scan} must implement {@link Scan#toBatch()}. Spark will call 
this method for each data
+   * scanning query.
+   * 
+   * The builder can take some query specific information to do operators 
pushdown, and keep these
+   * information in the created {@link Scan}.
+   * 
+   */
+  ScanBuilder newScanBuilder(DataSourceOptions options);
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16812
  
I still don't think need this since the workaround is easy. If other 
committers find it worth, I won't object.
If there are no interests fro this PR afterwards, I would just close this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263

> Visualizing a workflow is nice, but Spark's Pipelines are typically
pretty straightforward and linear. I could imagine producing a nicer
visualization than what you get from reading the Spark UI, although of course
we already have some degree of history and data there.

Another good thing that might have to be considered is, that we can
interact this with other SQL events. For instance, where the input `Dataset` is
originated. For instance, with current Apache Spark, I can visualises SQL
operations as below:

![screen shot 2018-12-10 at 9 41 36
am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png)

I think we can combine those existing lineages together to easily
understand where the data comes and goes.

(BTW, I hope it doesn't sound like I'm pushing this case for my one
specific case - I think this hook-like feature can be useful in many way. I
currently have one explicit example to show so I'm referring my case.)

> These are just the hooks, right? someone would have to implement
something to use these events. I see the value in the API to some degree, but
with no concrete implementation, does it add anything for Spark users out of
the box?

Yes, right. It does not add anything to Spark users out of the box. It
needs a custom implementation for a query listener. For instance,

with the custom listener below:

```scala
class CustomSparkListener extends SparkListener
def onOtherEvents(e: SparkListenerEvent) = e match {
case e: MLEvent => // do something
case _ => // pass
}
```

There are two (existing) ways to use this.

```scala
spark.sparkContext.addSparkListener(new CustomSparkListener)
```

```bash
spark-submit
--conf spark.extraListeners=CustomSparkListener\
...
```

It's also similar with other existing implementation in SQL side (catalog
events described above in PR description).

One actual example that I had with SQL query listener was that I had to
close one connection every time after SQL execution.

> Is that what someone would likely do? or would someone likely have to run
Atlas to use this? If that's a good example of the use case, and Atlas is
really about lineage and governance, is that the thrust of this change, to help
with something to do with model lineage and reproducibility?

There are two reasons.

1. I think someone in general would likely utilise this feature like other
event listeners. At least, I can see some interests going on outside.

- SQL Listener
-
https://stackoverflow.com/questions/46409339/spark-listener-to-an-sql-query
-
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-Custom-Query-Execution-listener-via-conf-properties-td30979.html

- Streaming Query Listener
- https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/
-
http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Watermark-td25413.html#a25416

2. Someone would likely run this via Atlas so that someone could do
something about lineage and governance, yes, as you said. Yes, I'm trying to
show integrated lineages in Apache Spark but this is a missing hole. There had
to be this change for this.

---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23256
  
Ooops, 

>  tbh, more meaningful dataset as example would be better...

did you expect to fix more instances here Felix? Sorry, I misread.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23256
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23263#discussion_r240004006
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends 
PipelineStage {
* Fits a model to the input data.
*/
   @Since("2.0.0")
-  def fit(dataset: Dataset[_]): M
+  def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) {
+fitImpl(dataset)
+  }
+
+  /**
+   * `fit()` handles events and then calls this method. Subclasses should 
override this
+   * method to implement the actual fiting a model to the input data.
+   */
+  @Since("3.0.0")
+  protected def fitImpl(dataset: Dataset[_]): M = {
+// Keep this default body for backward compatibility.
+throw new UnsupportedOperationException("fitImpl is not implemented.")
--- End diff --

Yes, that was my intention. I wanted to force to implement `fitImpl` but 
was thinking that might be too breaking change (it's going to at least break 
source compatibility). I am willing to follow other suggestions - I am pretty 
sure you or other guys are more familiar with ML side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23263#discussion_r240003885
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/MLEventsSuite.scala ---
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import java.io.File
+
+import scala.collection.mutable
+import scala.concurrent.duration._
+
+import org.apache.hadoop.fs.Path
+import org.mockito.Matchers.{any, eq => meq}
+import org.mockito.Mockito.when
+import org.scalatest.BeforeAndAfterEach
+import org.scalatest.concurrent.Eventually
+import org.scalatest.mockito.MockitoSugar.mock
+
+import org.apache.spark.{SparkContext, SparkFunSuite}
+import org.apache.spark.ml.param.ParamMap
+import org.apache.spark.ml.util._
+import org.apache.spark.scheduler.{SparkListener, SparkListenerEvent}
+import org.apache.spark.sql._
+import org.apache.spark.util.Utils
+
+
+class MLEventsSuite
+extends SparkFunSuite
+with BeforeAndAfterEach
+with DefaultReadWriteTest
+with Eventually {
+
+  private var spark: SparkSession = _
+  private var sc: SparkContext = _
+  private var checkpointDir: String = _
+  private var listener: SparkListener = _
+  private val dirName: String = "pipeline"
+  private val events = mutable.ArrayBuffer.empty[MLEvent]
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+sc = new SparkContext("local[2]", "SparkListenerSuite")
+listener = new SparkListener {
+  override def onOtherEvent(event: SparkListenerEvent): Unit = event 
match {
+case e: FitStart[_] => events.append(e)
+case e: FitEnd[_] => events.append(e)
+case e: TransformStart => events.append(e)
+case e: TransformEnd => events.append(e)
+case e: SaveInstanceStart if e.path.endsWith(dirName) => 
events.append(e)
+case e: SaveInstanceEnd if e.path.endsWith(dirName) => 
events.append(e)
+case _ =>
+  }
+}
+sc.addSparkListener(listener)
+
+spark = SparkSession.builder()
+  .sparkContext(sc)
+  .getOrCreate()
+
+checkpointDir = Utils.createDirectory(tempDir.getCanonicalPath, 
"checkpoints").toString
+sc.setCheckpointDir(checkpointDir)
--- End diff --

Let me double check and address this while fixing the test. I just copied 
this from `MLlibTestSparkContext`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23263#discussion_r240003869
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala ---
@@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") (
* @return fitted pipeline
*/
   @Since("2.0.0")
-  override def fit(dataset: Dataset[_]): PipelineModel = {
+  override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset)
--- End diff --

Ah, it's there just only to keep the `@Since`. Looks some classes don't 
explicitly note that so I didn't call `super` in other places.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263
  
The tests pass in my local. I'll fix them shortly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23263
  
cc @srowen, @cloud-fan (since it mimics SQL's event listener), @jkbradley, 
@mengxr and @yanboliang. Mind if I ask to take a look please? WDYT about this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23263#discussion_r23747
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -210,7 +214,7 @@ abstract class PredictionModel[FeaturesType, M <: 
PredictionModel[FeaturesType,
 }
   }
 
-  protected def transformImpl(dataset: Dataset[_]): DataFrame = {
+  override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
--- End diff --

`transformImpl` for some abstraction and `saveImpl` are already existent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/23261


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23263

[SPARK-23674][ML] Adds Spark ML Events

## What changes were proposed in this pull request?

This PR proposes to add ML events so that other developers can track and 
add some actions for them.

## Introduction

This PR proposes to send some ML events like SQL. This is quite useful when 
people want to track and make some actions for corresponding ML operations. For 
instance, I have been working on integrating 
Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). 
With some custom changes with this PR, I can visualise ML pipeline as below:


![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png)

I think not to mention how useful it is to track the SQL operations. 
Likewise, I would like to propose ML events as well (as lowest stability 
`@Unstable` APIs for now - no guarantee about stability).

## Implementation Details

### Sends event (but not expose ML specific listener)

In `events.scala`, it adds:

```scala
@Unstable
case class ...StartEvent(caller, input)
@Unstable
case class ...EndEvent(caller, output)

object MLEvents {
  // Wrappers to send events:
  // def with...Event(body) = {
  //   body()
  //   SparkContext.getOrCreate().listenerBus.post(event)
  // }
}
```

This way mimics both:

**1. Catalog events (see 
`org.apache.spark.sql.catalyst.catalog.events.scala`)**

- This allows a Catalog specific listener to be added 
`ExternalCatalogEventListener` 

- It's implemented in a way of wrapping whole `ExternalCatalog` named 
`ExternalCatalogWithListener`
which delegates the operations to `ExternalCatalog`

This is not quite possible in this case because most of instances (like 
`Pipeline`) will be directly created in most of cases. We might be able to do 
that via extending `ListenerBus` for all possible instances but IMHO it's too 
invasive. Also, exposing another ML specific listener sounds a bit too much at 
this stage. Therefore, I simply borrowed file name and structures here

**2. SQL execution events (see 
`org.apache.spark.sql.execution.SQLExecution.scala`)**

- Add an object that wraps a body to send events

Current apporach is rather close to this. It has a `with...` wrapper to 
send events. I borrowed this approach to be consistent.


### Add `...Impl` methods to wrap each to send events

**1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`**

```diff
- def save(...) = { saveImpl(...) }
+ def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) }
  def saveImpl(...): Unit = ...
```

  Note that `saveImpl` was already implemented unlike other instances below.


```diff
- def load(...): T
+ def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) }
+ def loadImpl(...): T
```

**2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`**

```diff
- def fit(...): Model
+ def fit(...): Model = MLEvents.withFitEvent { fitImpl(...) }
+ def fitImpl(...): Model
```

**3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`**

```diff
- def transform(...): DataFrame
+ def transform(...): DataFrame = MLEvents.withTransformEvent { 
transformImpl(...) }
+ def transformImpl(...): DataFrame
```

This approach follows the existing way as below in ML:

**1. `transform` and `transformImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L191-L196


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042

**2. `save` and `saveImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176

Inherited ones are intentionally omitted here for simplicity. They are 
inherited and implemented at multiple places.

## Backward Compatibility

_This keeps both source and binary backward compatibility_. I was thinking 
enforcing `...Impl` by leaving it abstract methods to force to implement but 
just decided to leave a body that throws `UnsupportedOperationException` so 
that we can keep full source and binary compatibilities.

- For user-faced API perspective, _there's no difference_

[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23261
  
I'm going to rebase and reopen a PR to retrigger AppVeyor tests. The 
failure looks unrelated


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23261#discussion_r239998861
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -163,7 +163,7 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
*/
   @Since("1.6.0")
   @throws[IOException]("If the input path already exists but overwrite is 
not enabled.")
-  def save(path: String): Unit = {
+  def save(path: String): Unit = MLEvents.withSaveInstanceEvent(this, 
path) {
 new FileSystemOverwrite().handleOverwrite(path, shouldOverwrite, sc)
 saveImpl(path)
--- End diff --

and `saveImpl` is already existent as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23261#discussion_r239998855
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -210,7 +214,7 @@ abstract class PredictionModel[FeaturesType, M <: 
PredictionModel[FeaturesType,
 }
   }
 
-  protected def transformImpl(dataset: Dataset[_]): DataFrame = {
+  override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
--- End diff --

For clarification, some of abstraction already has `transformImpl`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23261: [SPARK-23674][ML] Adds Spark ML Events

2018-12-07 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23261

[SPARK-23674][ML] Adds Spark ML Events

## What changes were proposed in this pull request?

This PR proposes to add ML events so that other developers can track add 
some actions for them.

## Introduction

This PR proposes to send some ML events like SQL. This is quite useful when 
people want to track and make some actions for corresponding ML operations. For 
instance, I have been working on integrating Spark with Atlas, and with some 
custom changes with this PR, I can visualise ML pipeline as below:


![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png)

I think not to mention how useful it is to track the SQL operations. 
Likewise, I would like to propose ML events as well (as lowest stability 
`@Unstable` APIs for now).

## Implementation Details

### Sends event (but not expose ML specific listener)

In `events.scala`, it adds:

```scala
@Unstable
case class ...Events

object MLEvents {
  // Wrappers to send events:
  // def with...Event(body) = {
  //   body()
  //   SparkContext.getOrCreate().listenerBus.post(event)
  // }
}
```

This way mimics both:

**1.. Catalog events (see 
`org.apache.spark.sql.catalyst.catalog.events.scala`)**

- This allows a Catalog specific listener to be added 
`ExternalCatalogEventListener` 
- It's implemented in a way of wrapping whole `ExternalCatalog` named 
`ExternalCatalogWithListener`
which delegates the operations to `ExternalCatalog`

This is not quite possible in this case because most of instances (like 
`Pipeline`) will be directly created in most of cases. We might be able to do 
that via extending `ListenerBus` for all possbiel instances but IMHO it's 
invasive. Also, exposing another ML specific listener sounds a bit too much for 
current status. Therefore, I simply borrowed file name and structures here

**2.. SQL execution events (see 
`org.apache.spark.sql.execution.SQLExecution.scala`)**

- Add an object that wraps a body to send events

Current apporach is rather close to this. It has a `with...` wrapper to 
send events. I borrowed this approach to be consistent.


### Add `...Impl` methods to wrap each to send events

**1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`**

```diff
- def save(...) = { saveImpl(...) }
+ def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) }
  def saveImpl(...): Unit = ...
```

  Note that `saveImpl` was already implemented unlike other instances below.


```diff
- def load(...): T
+ def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) }
+ def loadImpl(...): T
```

**2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`**

```diff
- def fit(...): Model
+ def fit(...): Model = MLEvents.withFitEvent { fitImpl() }
+ def fitImpl(...): Model
```

**3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`**

```diff
- def transform(...): DataFrame
+ def transform(...): DataFrame = MLEvents.withTransformEvent { 
transformImpl() }
+ def transformImpl(...): DataFrame
```

This approach follows the existing way as below in ML:

**1.. `transform` and `transformImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L190-L196


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042

**2.. `save` and `saveImpl`**


https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176

Inherited ones are intentionally omitted here for simplicity. They are 
inherited and implemented at multiple places.

## Backword Compatibility

This keeps both source and binary backward compatibility. I was thinking 
enforcing `...Impl` by leaving it abstract methods but just decided to leave a 
body that throws `UnsupportedOperationException` so that we can keep full 
source and binary compatibilities.

- For user-faced API perspective, there's no difference. `...Impl` methods 
are protected and not visible to end users.

- For developer API perspective, if some developers want to `...` methods 
instead of `...Impl`, that's still fine. It only does

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239701916
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

BTW, when I added that into https://spark.apache.org/contributing.html, we 
also agreed upon following committer's judgement based upon the guide because 
the guide mentions:

> The coding conventions described above should be followed, unless there 
is good reason to do otherwise. Exceptions include legacy code and modifying 
third-party code.

since we do have legacy reason, and there is a good reason - consistency 
and committer's judgement.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239701364
  
--- Diff: R/pkg/tests/fulltests/test_mllib_clustering.R ---
@@ -319,4 +319,18 @@ test_that("spark.posterior and spark.perplexity", {
   expect_equal(length(local.posterior), sum(unlist(local.posterior)))
 })
 
+test_that("spark.assignClusters", {
+  df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+ list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+ list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+  clusters <- spark.assignClusters(df, initMode = "degree", weightCol = 
"weight")
+  expected_result <- createDataFrame(list(list(4L, 1L),
+  list(0L, 0L),
+  list(1L, 0L),
+  list(3L, 1L),
+  list(2L, 0L)),
+  schema = c("id", "cluster"))
--- End diff --

ditto for style


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239701069
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

There are two separate style are already mixed in R code IIRC:

```r
df <- createDataFrame(
  list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
  list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))
```

or

```r
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
   list(4L, 0L, 0.1)),
  schema = c("src", "dst", "weight"))
```

Let's avoid mixed style, and let's go for the later one when possible 
because at least that looks more complying the code style.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239700846
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

Yea, we do have for indentation rule. "Code Style Guide" at 
https://spark.apache.org/contributing.html -> 
https://google.github.io/styleguide/Rguide.xml. I know the code style is not 
perfectly documented but at least there are some examples. I think the correct 
indentation is:

```r
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
   list(4L, 0L, 0.1)),
  schema = c("src", "dst", "weight"))
``` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23226: [SPARK-26286][TEST] Add MAXIMUM_PAGE_SIZE_BYTES e...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23226#discussion_r239662927
  
--- Diff: 
core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
 ---
@@ -622,6 +622,17 @@ public void initialCapacityBoundsChecking() {
 } catch (IllegalArgumentException e) {
   // expected exception
 }
+
+try {
+  new BytesToBytesMap(
+taskMemoryManager,
+1,
+TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES+1);
--- End diff --

`E_BYTES+1` -> `E_BYTES + 1`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23248: [SPARK-26293][SQL] Cast exception when having python udf...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23248
  
Thanks, @cloud-fan. I will take a look within tomorrow - don't block by me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23248: [SPARK-26293][SQL] Cast exception when having pyt...

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23248#discussion_r239441136
  
--- Diff: python/pyspark/sql/tests/test_udf.py ---
@@ -23,7 +23,7 @@
 
 from pyspark import SparkContext
 from pyspark.sql import SparkSession, Column, Row
-from pyspark.sql.functions import UserDefinedFunction
+from pyspark.sql.functions import UserDefinedFunction, udf
--- End diff --

Ah, yea. It's okay and I think it's good timing to clean up while we are 
here, and while it's broken down into multiple test files now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23245: [SPARK-26060][SQL][FOLLOW-UP] Rename the config name.

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23245
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23246: [SPARK-26292][CORE]Assert statement of currentPage may b...

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23246
  
Ditto. Let's close these and focus on fixing actual bugs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23247: [SPARK-26294][CORE]Delete Unnecessary If statement

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23247
  
@wangjiaochun, I think you better stop fixing trivial stuff in each PR. 
Those stuff can be fixed when the codes around here is fixed, or let other 
people fix it later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23215: [SPARK-26263][SQL] Validate partition values with user p...