[GitHub] zeppelin issue #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3253 I'm not aware of release plan in Zeppelin since I'm just one of contributors. For the current status, the Spark should be downgraded as far as I can tell. ---
[GitHub] zeppelin issue #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3253 Spark 2.4 support was added at https://github.com/apache/zeppelin/pull/3206, which exactly addresses the issue you faced. This will be available in new release of Zeppelin. ---
[GitHub] spark issue #23226: [SPARK-26286][TEST] Add MAXIMUM_PAGE_SIZE_BYTES exceptio...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23226 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [SPARK-26319][SQL][Test] Add appendReadColumns Unit Test...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23268 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23270 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23262 Let's remove. No point of keeping unused method. The code will remain in the commit anyway. Also, there's no quite good point of keeping few lines method that's called only at one place. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23202: [SPARK-26248][SQL] Infer date type from CSV
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23202 Similar discussion is going on at https://github.com/apache/spark/pull/23201#discussion_r240156871. Let me keep tracking them. Sorry for late response, @MaxGekk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r240156871 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText --- End diff -- > If we switch the order here, we don't need the length check here, right? @cloud-fan, that works only if we use default date/timestamp patterns. Both should do the exact match with pattern, which unfortunately the current parsing library (SimpleDateFormat) does not allow. The order here is just to make it look better and both shouldn't be dependent on its order. I think we should support those inferences after completely switching the library to `java.time.format.*` without a legacy. That should make this change easier without a hole. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23201#discussion_r240153595 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -121,7 +122,26 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { DecimalType(bigDecimal.precision, bigDecimal.scale) } decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType + case VALUE_STRING => +val stringValue = parser.getText --- End diff -- Yea, one time I tried to match it with CSV a long long ago but I kind of gave up due to behaviour changes IIRC. If that's possible, it should be awesome. If that's difficult, matching the behaviour within text based datasource (meaning CSV and JSON I guess) should be good enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23262: [SPARK-26312][SQL]Converting converters in RDDConversion...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23262 LGTM otheriwse --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23262#discussion_r240151214 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala --- @@ -17,51 +17,39 @@ package org.apache.spark.sql.execution +import scala.reflect.runtime.universe.TypeTag + import org.apache.spark.rdd.RDD import org.apache.spark.sql.{Encoder, Row, SparkSession} -import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow} +import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, UnknownPartitioning} import org.apache.spark.sql.catalyst.util.truncatedString import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.DataType +import org.apache.spark.sql.types.StructType object RDDConversions { - def productToRowRdd[A <: Product](data: RDD[A], outputTypes: Seq[DataType]): RDD[InternalRow] = { + def productToRowRdd[A <: Product : TypeTag](data: RDD[A], + outputSchema: StructType): RDD[InternalRow] = { +val converters = ExpressionEncoder[A].resolveAndBind(outputSchema.toAttributes) data.mapPartitions { iterator => - val numColumns = outputTypes.length - val mutableRow = new GenericInternalRow(numColumns) - val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter) iterator.map { r => -var i = 0 -while (i < numColumns) { - mutableRow(i) = converters(i)(r.productElement(i)) - i += 1 -} - -mutableRow +converters.toRow(r) } } } /** * Convert the objects inside Row into the types Catalyst expected. */ - def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): RDD[InternalRow] = { + def rowToRowRdd(data: RDD[Row], outputSchema: StructType): RDD[InternalRow] = { +val converters = RowEncoder(outputSchema) --- End diff -- I checked each case. Every case looks fine except one case: https://github.com/apache/spark/blob/fa0d4bf69929c5acd676d602e758a969713d19d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L289 Looks we're going to drop `Char` as `StringType`. I think it's trivial and rather a mistake that we supported this. I don't feel strongly about documenting it in migration guide but if anyone feels so, we better do that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23262: [SPARK-26312][SQL]Converting converters in RDDCon...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23262#discussion_r240141142 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala --- @@ -17,51 +17,39 @@ package org.apache.spark.sql.execution +import scala.reflect.runtime.universe.TypeTag + import org.apache.spark.rdd.RDD import org.apache.spark.sql.{Encoder, Row, SparkSession} -import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow} +import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, UnknownPartitioning} import org.apache.spark.sql.catalyst.util.truncatedString import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.DataType +import org.apache.spark.sql.types.StructType object RDDConversions { - def productToRowRdd[A <: Product](data: RDD[A], outputTypes: Seq[DataType]): RDD[InternalRow] = { + def productToRowRdd[A <: Product : TypeTag](data: RDD[A], + outputSchema: StructType): RDD[InternalRow] = { +val converters = ExpressionEncoder[A].resolveAndBind(outputSchema.toAttributes) data.mapPartitions { iterator => - val numColumns = outputTypes.length - val mutableRow = new GenericInternalRow(numColumns) - val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter) iterator.map { r => -var i = 0 -while (i < numColumns) { - mutableRow(i) = converters(i)(r.productElement(i)) - i += 1 -} - -mutableRow +converters.toRow(r) } } } /** * Convert the objects inside Row into the types Catalyst expected. */ - def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): RDD[InternalRow] = { + def rowToRowRdd(data: RDD[Row], outputSchema: StructType): RDD[InternalRow] = { --- End diff -- Let's remove whole object. `rowToRowRdd` looks only being used at one place and the code here is quite small. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22683 Looks okay to me too. UI change requires screenshots of UI tho strictly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [SPARK-26319][SQL][Test] Add appendReadColumns Unit Test...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23268 Let's fix PR description as well. You can leave the comments above resolved. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23249#discussion_r240123260 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala --- @@ -262,6 +261,15 @@ case class RangePartitioning(ordering: Seq[SortOrder], numPartitions: Int) super.satisfies0(required) || { required match { case OrderedDistribution(requiredOrdering) => + // If `ordering` is a prefix of `requiredOrdering`: + // - Let's say `ordering` is [a, b] and `requiredOrdering` is [a, b, c]. If a row is + // larger than another row w.r.t. [a, b], it's also larger w.r.t. [a, b, c]. So + // `RangePartitioning(a, b)` satisfy `OrderedDistribution(a, b, c)`. --- End diff -- nit `satisfy` -> `satisfies` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23270: [SPARK-26317][BUILD] Upgrade SBT to 0.13.18
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23270 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23132: [SPARK-26163][SQL] Parsing decimals from JSON using loca...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23132 @MaxGekk, mind fixing PR description accordingly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240110126 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) + .getOrElse(cols) + .mkString(",") --- End diff -- Thanks. It doesn't matter which company it is. All the PRs are equally and reasonably reviewed, and merged per the same criteria. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240107064 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) --- End diff -- Ah, it's `:+` not `+:`. Yea, it's confusing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240105070 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) --- End diff -- ? output is different, no? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18784 Looks inactive. @srowen and @felixcheung, do you know anyone who might be interested in this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240101927 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) + .getOrElse(cols) + .mkString(",") --- End diff -- Right, but I don't think it's more readable. It's non-critical path so performance concern is secondary anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23268 Yup, the test looks okay in that way. Let's file a JIRA and only leave the test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23268: [Hive][Minor] Refactor on HiveShim and Add Unit Tests
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23268 Let's close this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240097931 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) --- End diff -- Also, looks the output would be different after this change. Looks it was `old + col` but the current changes looks doing `col + old` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23268: [Hive][Minor] Refactor on HiveShim and Add Unit T...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23268#discussion_r240097105 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala --- @@ -53,19 +53,12 @@ private[hive] object HiveShim { * This function in hive-0.13 become private, but we have to do this to work around hive bug */ private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { -val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "") -val result: StringBuilder = new StringBuilder(old) -var first: Boolean = old.isEmpty - -for (col <- cols) { - if (first) { -first = false - } else { -result.append(',') - } - result.append(col) -} -conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) +val key = ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR +val value = Option(conf.get(key, null)) + .map(old => cols.+:(old)) + .getOrElse(cols) + .mkString(",") --- End diff -- I don't think this is more readable. Also, the previous code is more performant. I wouldn't change this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23266: [SPARK-26313][SQL] move read related methods from...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23266#discussion_r240074711 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java --- @@ -20,14 +20,27 @@ import org.apache.spark.annotation.Evolving; import org.apache.spark.sql.sources.v2.reader.Scan; import org.apache.spark.sql.sources.v2.reader.ScanBuilder; +import org.apache.spark.sql.types.StructType; /** - * An empty mix-in interface for {@link Table}, to indicate this table supports batch scan. - * - * If a {@link Table} implements this interface, its {@link Table#newScanBuilder(DataSourceOptions)} - * must return a {@link ScanBuilder} that builds {@link Scan} with {@link Scan#toBatch()} - * implemented. - * + * A mix-in interface for {@link Table} to provide data reading ability of batch processing. */ @Evolving -public interface SupportsBatchRead extends Table { } +public interface SupportsBatchRead extends Table { + + /** + * Returns the schema of this table. + */ + StructType schema(); --- End diff -- To me, +1 for the current change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23266: [SPARK-26313][SQL] move read related methods from...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23266#discussion_r240073831 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchRead.java --- @@ -20,14 +20,27 @@ import org.apache.spark.annotation.Evolving; import org.apache.spark.sql.sources.v2.reader.Scan; import org.apache.spark.sql.sources.v2.reader.ScanBuilder; +import org.apache.spark.sql.types.StructType; /** - * An empty mix-in interface for {@link Table}, to indicate this table supports batch scan. - * - * If a {@link Table} implements this interface, its {@link Table#newScanBuilder(DataSourceOptions)} - * must return a {@link ScanBuilder} that builds {@link Scan} with {@link Scan#toBatch()} - * implemented. - * + * A mix-in interface for {@link Table} to provide data reading ability of batch processing. */ @Evolving -public interface SupportsBatchRead extends Table { } +public interface SupportsBatchRead extends Table { + + /** + * Returns the schema of this table. + */ + StructType schema(); + + /** + * Returns a {@link ScanBuilder} which can be used to build a {@link Scan} later. The built + * {@link Scan} must implement {@link Scan#toBatch()}. Spark will call this method for each data + * scanning query. + * + * The builder can take some query specific information to do operators pushdown, and keep these + * information in the created {@link Scan}. + * + */ + ScanBuilder newScanBuilder(DataSourceOptions options); --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16812: [SPARK-19465][SQL] Added options for custom boolean valu...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16812 I still don't think need this since the workaround is easy. If other committers find it worth, I won't object. If there are no interests fro this PR afterwards, I would just close this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23263 > Visualizing a workflow is nice, but Spark's Pipelines are typically pretty straightforward and linear. I could imagine producing a nicer visualization than what you get from reading the Spark UI, although of course we already have some degree of history and data there. Another good thing that might have to be considered is, that we can interact this with other SQL events. For instance, where the input `Dataset` is originated. For instance, with current Apache Spark, I can visualises SQL operations as below: ![screen shot 2018-12-10 at 9 41 36 am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png) I think we can combine those existing lineages together to easily understand where the data comes and goes. (BTW, I hope it doesn't sound like I'm pushing this case for my one specific case - I think this hook-like feature can be useful in many way. I currently have one explicit example to show so I'm referring my case.) > These are just the hooks, right? someone would have to implement something to use these events. I see the value in the API to some degree, but with no concrete implementation, does it add anything for Spark users out of the box? Yes, right. It does not add anything to Spark users out of the box. It needs a custom implementation for a query listener. For instance, with the custom listener below: ```scala class CustomSparkListener extends SparkListener def onOtherEvents(e: SparkListenerEvent) = e match { case e: MLEvent => // do something case _ => // pass } ``` There are two (existing) ways to use this. ```scala spark.sparkContext.addSparkListener(new CustomSparkListener) ``` ```bash spark-submit --conf spark.extraListeners=CustomSparkListener\ ... ``` It's also similar with other existing implementation in SQL side (catalog events described above in PR description). One actual example that I had with SQL query listener was that I had to close one connection every time after SQL execution. > Is that what someone would likely do? or would someone likely have to run Atlas to use this? If that's a good example of the use case, and Atlas is really about lineage and governance, is that the thrust of this change, to help with something to do with model lineage and reproducibility? There are two reasons. 1. I think someone in general would likely utilise this feature like other event listeners. At least, I can see some interests going on outside. - SQL Listener - https://stackoverflow.com/questions/46409339/spark-listener-to-an-sql-query - http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-Custom-Query-Execution-listener-via-conf-properties-td30979.html - Streaming Query Listener - https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/ - http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Watermark-td25413.html#a25416 2. Someone would likely run this via Atlas so that someone could do something about lineage and governance, yes, as you said. Yes, I'm trying to show integrated lineages in Apache Spark but this is a missing hole. There had to be this change for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23256 Ooops, > tbh, more meaningful dataset as example would be better... did you expect to fix more instances here Felix? Sorry, I misread. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to fix cod...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23256 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240004006 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala --- @@ -65,7 +65,19 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage { * Fits a model to the input data. */ @Since("2.0.0") - def fit(dataset: Dataset[_]): M + def fit(dataset: Dataset[_]): M = MLEvents.withFitEvent(this, dataset) { +fitImpl(dataset) + } + + /** + * `fit()` handles events and then calls this method. Subclasses should override this + * method to implement the actual fiting a model to the input data. + */ + @Since("3.0.0") + protected def fitImpl(dataset: Dataset[_]): M = { +// Keep this default body for backward compatibility. +throw new UnsupportedOperationException("fitImpl is not implemented.") --- End diff -- Yes, that was my intention. I wanted to force to implement `fitImpl` but was thinking that might be too breaking change (it's going to at least break source compatibility). I am willing to follow other suggestions - I am pretty sure you or other guys are more familiar with ML side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003885 --- Diff: mllib/src/test/scala/org/apache/spark/ml/MLEventsSuite.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import java.io.File + +import scala.collection.mutable +import scala.concurrent.duration._ + +import org.apache.hadoop.fs.Path +import org.mockito.Matchers.{any, eq => meq} +import org.mockito.Mockito.when +import org.scalatest.BeforeAndAfterEach +import org.scalatest.concurrent.Eventually +import org.scalatest.mockito.MockitoSugar.mock + +import org.apache.spark.{SparkContext, SparkFunSuite} +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.scheduler.{SparkListener, SparkListenerEvent} +import org.apache.spark.sql._ +import org.apache.spark.util.Utils + + +class MLEventsSuite +extends SparkFunSuite +with BeforeAndAfterEach +with DefaultReadWriteTest +with Eventually { + + private var spark: SparkSession = _ + private var sc: SparkContext = _ + private var checkpointDir: String = _ + private var listener: SparkListener = _ + private val dirName: String = "pipeline" + private val events = mutable.ArrayBuffer.empty[MLEvent] + + override def beforeAll(): Unit = { +super.beforeAll() +sc = new SparkContext("local[2]", "SparkListenerSuite") +listener = new SparkListener { + override def onOtherEvent(event: SparkListenerEvent): Unit = event match { +case e: FitStart[_] => events.append(e) +case e: FitEnd[_] => events.append(e) +case e: TransformStart => events.append(e) +case e: TransformEnd => events.append(e) +case e: SaveInstanceStart if e.path.endsWith(dirName) => events.append(e) +case e: SaveInstanceEnd if e.path.endsWith(dirName) => events.append(e) +case _ => + } +} +sc.addSparkListener(listener) + +spark = SparkSession.builder() + .sparkContext(sc) + .getOrCreate() + +checkpointDir = Utils.createDirectory(tempDir.getCanonicalPath, "checkpoints").toString +sc.setCheckpointDir(checkpointDir) --- End diff -- Let me double check and address this while fixing the test. I just copied this from `MLlibTestSparkContext`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r240003869 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala --- @@ -132,7 +132,8 @@ class Pipeline @Since("1.4.0") ( * @return fitted pipeline */ @Since("2.0.0") - override def fit(dataset: Dataset[_]): PipelineModel = { + override def fit(dataset: Dataset[_]): PipelineModel = super.fit(dataset) --- End diff -- Ah, it's there just only to keep the `@Since`. Looks some classes don't explicitly note that so I didn't call `super` in other places. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23263 The tests pass in my local. I'll fix them shortly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23263 cc @srowen, @cloud-fan (since it mimics SQL's event listener), @jkbradley, @mengxr and @yanboliang. Mind if I ask to take a look please? WDYT about this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23263#discussion_r23747 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -210,7 +214,7 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType, } } - protected def transformImpl(dataset: Dataset[_]): DataFrame = { + override protected def transformImpl(dataset: Dataset[_]): DataFrame = { --- End diff -- `transformImpl` for some abstraction and `saveImpl` are already existent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/23261 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23263: [SPARK-23674][ML] Adds Spark ML Events
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23263 [SPARK-23674][ML] Adds Spark ML Events ## What changes were proposed in this pull request? This PR proposes to add ML events so that other developers can track and add some actions for them. ## Introduction This PR proposes to send some ML events like SQL. This is quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). With some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) I think not to mention how useful it is to track the SQL operations. Likewise, I would like to propose ML events as well (as lowest stability `@Unstable` APIs for now - no guarantee about stability). ## Implementation Details ### Sends event (but not expose ML specific listener) In `events.scala`, it adds: ```scala @Unstable case class ...StartEvent(caller, input) @Unstable case class ...EndEvent(caller, output) object MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This way mimics both: **1. Catalog events (see `org.apache.spark.sql.catalyst.catalog.events.scala`)** - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possible instances but IMHO it's too invasive. Also, exposing another ML specific listener sounds a bit too much at this stage. Therefore, I simply borrowed file name and structures here **2. SQL execution events (see `org.apache.spark.sql.execution.SQLExecution.scala`)** - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ### Add `...Impl` methods to wrap each to send events **1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`** ```diff - def save(...) = { saveImpl(...) } + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) } def saveImpl(...): Unit = ... ``` Note that `saveImpl` was already implemented unlike other instances below. ```diff - def load(...): T + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) } + def loadImpl(...): T ``` **2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`** ```diff - def fit(...): Model + def fit(...): Model = MLEvents.withFitEvent { fitImpl(...) } + def fitImpl(...): Model ``` **3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`** ```diff - def transform(...): DataFrame + def transform(...): DataFrame = MLEvents.withTransformEvent { transformImpl(...) } + def transformImpl(...): DataFrame ``` This approach follows the existing way as below in ML: **1. `transform` and `transformImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L191-L196 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042 **2. `save` and `saveImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176 Inherited ones are intentionally omitted here for simplicity. They are inherited and implemented at multiple places. ## Backward Compatibility _This keeps both source and binary backward compatibility_. I was thinking enforcing `...Impl` by leaving it abstract methods to force to implement but just decided to leave a body that throws `UnsupportedOperationException` so that we can keep full source and binary compatibilities. - For user-faced API perspective, _there's no difference_
[GitHub] spark issue #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23261 I'm going to rebase and reopen a PR to retrigger AppVeyor tests. The failure looks unrelated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23261#discussion_r239998861 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -163,7 +163,7 @@ abstract class MLWriter extends BaseReadWrite with Logging { */ @Since("1.6.0") @throws[IOException]("If the input path already exists but overwrite is not enabled.") - def save(path: String): Unit = { + def save(path: String): Unit = MLEvents.withSaveInstanceEvent(this, path) { new FileSystemOverwrite().handleOverwrite(path, shouldOverwrite, sc) saveImpl(path) --- End diff -- and `saveImpl` is already existent as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23261: [WIP][SPARK-23674][ML] Adds Spark ML Events
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23261#discussion_r239998855 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -210,7 +214,7 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType, } } - protected def transformImpl(dataset: Dataset[_]): DataFrame = { + override protected def transformImpl(dataset: Dataset[_]): DataFrame = { --- End diff -- For clarification, some of abstraction already has `transformImpl`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23261: [SPARK-23674][ML] Adds Spark ML Events
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23261 [SPARK-23674][ML] Adds Spark ML Events ## What changes were proposed in this pull request? This PR proposes to add ML events so that other developers can track add some actions for them. ## Introduction This PR proposes to send some ML events like SQL. This is quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Spark with Atlas, and with some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) I think not to mention how useful it is to track the SQL operations. Likewise, I would like to propose ML events as well (as lowest stability `@Unstable` APIs for now). ## Implementation Details ### Sends event (but not expose ML specific listener) In `events.scala`, it adds: ```scala @Unstable case class ...Events object MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This way mimics both: **1.. Catalog events (see `org.apache.spark.sql.catalyst.catalog.events.scala`)** - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possbiel instances but IMHO it's invasive. Also, exposing another ML specific listener sounds a bit too much for current status. Therefore, I simply borrowed file name and structures here **2.. SQL execution events (see `org.apache.spark.sql.execution.SQLExecution.scala`)** - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ### Add `...Impl` methods to wrap each to send events **1. `mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala`** ```diff - def save(...) = { saveImpl(...) } + def save(...) = MLEvents.withSaveInstanceEvent { saveImpl(...) } def saveImpl(...): Unit = ... ``` Note that `saveImpl` was already implemented unlike other instances below. ```diff - def load(...): T + def load(...): T = MLEvents.withLoadInstanceEvent { loadImple(...) } + def loadImpl(...): T ``` **2. `mllib/src/main/scala/org/apache/spark/ml/Estimator.scala`** ```diff - def fit(...): Model + def fit(...): Model = MLEvents.withFitEvent { fitImpl() } + def fitImpl(...): Model ``` **3. `mllib/src/main/scala/org/apache/spark/ml/Transformer.scala`** ```diff - def transform(...): DataFrame + def transform(...): DataFrame = MLEvents.withTransformEvent { transformImpl() } + def transformImpl(...): DataFrame ``` This approach follows the existing way as below in ML: **1.. `transform` and `transformImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L202-L213 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala#L190-L196 https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala#L1037-L1042 **2.. `save` and `saveImpl`** https://github.com/apache/spark/blob/9b1f6c8bab5401258c653d4e2efb50e97c6d282f/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L166-L176 Inherited ones are intentionally omitted here for simplicity. They are inherited and implemented at multiple places. ## Backword Compatibility This keeps both source and binary backward compatibility. I was thinking enforcing `...Impl` by leaving it abstract methods but just decided to leave a body that throws `UnsupportedOperationException` so that we can keep full source and binary compatibilities. - For user-faced API perspective, there's no difference. `...Impl` methods are protected and not visible to end users. - For developer API perspective, if some developers want to `...` methods instead of `...Impl`, that's still fine. It only does
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r239701916 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -968,6 +970,17 @@ predicted <- predict(model, df) head(predicted) ``` + Power Iteration Clustering + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. `spark.assignClusters` method runs the PIC algorithm and returns a cluster assignment for each input vertex. + +```{r} +df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), + list(1L, 2L, 1.0), list(3L, 4L, 1.0), --- End diff -- BTW, when I added that into https://spark.apache.org/contributing.html, we also agreed upon following committer's judgement based upon the guide because the guide mentions: > The coding conventions described above should be followed, unless there is good reason to do otherwise. Exceptions include legacy code and modifying third-party code. since we do have legacy reason, and there is a good reason - consistency and committer's judgement. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r239701364 --- Diff: R/pkg/tests/fulltests/test_mllib_clustering.R --- @@ -319,4 +319,18 @@ test_that("spark.posterior and spark.perplexity", { expect_equal(length(local.posterior), sum(unlist(local.posterior))) }) +test_that("spark.assignClusters", { + df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), + list(1L, 2L, 1.0), list(3L, 4L, 1.0), + list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) + clusters <- spark.assignClusters(df, initMode = "degree", weightCol = "weight") + expected_result <- createDataFrame(list(list(4L, 1L), + list(0L, 0L), + list(1L, 0L), + list(3L, 1L), + list(2L, 0L)), + schema = c("id", "cluster")) --- End diff -- ditto for style --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r239701069 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -968,6 +970,17 @@ predicted <- predict(model, df) head(predicted) ``` + Power Iteration Clustering + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. `spark.assignClusters` method runs the PIC algorithm and returns a cluster assignment for each input vertex. + +```{r} +df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), + list(1L, 2L, 1.0), list(3L, 4L, 1.0), --- End diff -- There are two separate style are already mixed in R code IIRC: ```r df <- createDataFrame( list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), list(1L, 2L, 1.0), list(3L, 4L, 1.0), list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) ``` or ```r df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), list(1L, 2L, 1.0), list(3L, 4L, 1.0), list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) ``` Let's avoid mixed style, and let's go for the later one when possible because at least that looks more complying the code style. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r239700846 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -968,6 +970,17 @@ predicted <- predict(model, df) head(predicted) ``` + Power Iteration Clustering + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. `spark.assignClusters` method runs the PIC algorithm and returns a cluster assignment for each input vertex. + +```{r} +df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), + list(1L, 2L, 1.0), list(3L, 4L, 1.0), --- End diff -- Yea, we do have for indentation rule. "Code Style Guide" at https://spark.apache.org/contributing.html -> https://google.github.io/styleguide/Rguide.xml. I know the code style is not perfectly documented but at least there are some examples. I think the correct indentation is: ```r df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), list(1L, 2L, 1.0), list(3L, 4L, 1.0), list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23226: [SPARK-26286][TEST] Add MAXIMUM_PAGE_SIZE_BYTES e...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23226#discussion_r239662927 --- Diff: core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java --- @@ -622,6 +622,17 @@ public void initialCapacityBoundsChecking() { } catch (IllegalArgumentException e) { // expected exception } + +try { + new BytesToBytesMap( +taskMemoryManager, +1, +TaskMemoryManager.MAXIMUM_PAGE_SIZE_BYTES+1); --- End diff -- `E_BYTES+1` -> `E_BYTES + 1` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23248: [SPARK-26293][SQL] Cast exception when having python udf...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23248 Thanks, @cloud-fan. I will take a look within tomorrow - don't block by me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23248: [SPARK-26293][SQL] Cast exception when having pyt...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23248#discussion_r239441136 --- Diff: python/pyspark/sql/tests/test_udf.py --- @@ -23,7 +23,7 @@ from pyspark import SparkContext from pyspark.sql import SparkSession, Column, Row -from pyspark.sql.functions import UserDefinedFunction +from pyspark.sql.functions import UserDefinedFunction, udf --- End diff -- Ah, yea. It's okay and I think it's good timing to clean up while we are here, and while it's broken down into multiple test files now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23245: [SPARK-26060][SQL][FOLLOW-UP] Rename the config name.
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23245 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23246: [SPARK-26292][CORE]Assert statement of currentPage may b...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23246 Ditto. Let's close these and focus on fixing actual bugs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23247: [SPARK-26294][CORE]Delete Unnecessary If statement
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23247 @wangjiaochun, I think you better stop fixing trivial stuff in each PR. Those stuff can be fixed when the codes around here is fixed, or let other people fix it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23215: [SPARK-26263][SQL] Validate partition values with user p...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23215 +1 from me to. It makes sense to me and having configuration sounds good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23243: [branch-2.4][ExternalShuffleService]add initRegisteredEx...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23243 Backport from which JIRA @weixiuli? Usually the fix should go to master first and it's backported to other branches when it's needed. If it should be fixed in master branch as well, let's file a JIRA and switch the branch to the master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23237: [SPARK-26279][CORE] Remove unused method in Logging
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23237 Looks some classes, for instance, `KafkaUtils` exposes this (I guess mistakenly?). Let's don't bother this and close this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23237: [SPARK-26279][CORE] Remove unused method in Logging
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23237 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 Thanks guys .. I will try to take a look for it as well (although it's going to take relatively a long while). Let me get this in for now anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23225: [MINOR][CORE]Don't need to create an empty spill file wh...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23225 If a test passes without/with this changes, it wouldn't verify the regression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 Thanks for a close look, @BryanCutler. I just increased the timeout because I was just trying to keep the test as is and fix it. The actual test doesn't look taking so much time but it looks it can be flaky when the resource usage is heavy. If you feel this can be resolved in other ways, would you mind if I ask to take a look for this please? and i'm happy to close this one. Actually, I am not quite familiar with ML ones .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 test this please -- --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23120 a late LGTM as well --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23235: [SPARK-26151][SQL][FOLLOWUP] Return partial resul...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23235#discussion_r239049825 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -35,6 +35,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 3.0, CSV datasource uses java.time API for parsing and generating CSV content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. + - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV string to a row with all `null`s in the PERMISSIVE mode if specified schema is `StructType`. Since Spark 3.0, returned row can contain non-`null` fields if some of CSV column values were parsed and converted to desired types successfully. --- End diff -- Ah, `from_csv` and `to_csv` are added in 3.0 so it's intentionally not mentioned. BTW, I think CSV functionalities can only have `StructType` so maybe we don't have to mention. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 cc @BryanCutler and @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 (cc @squito as well since it's from #23111) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23236: [SPARK-26275][PYTHON][ML] Increases timeout for Streamin...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23236 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23236: [SPARK-26275][PYTHON][ML] Increases timeout for S...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/23236 [SPARK-26275][PYTHON][ML] Increases timeout for StreamingLogisticRegressionWithSGDTests.test_training_and_prediction test ## What changes were proposed in this pull request? Looks this test is flaky https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console ``` == FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the model improves on toy data with no. of batches -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_prediction self._eventually(condition) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 78, in _eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 30 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 -- Ran 13 tests in 185.051s FAILED (failures=1, skipped=1) ``` This looks happening after increasing the parallelism in Jenkins to speed up at https://github.com/apache/spark/pull/23111. I am able to reproduce this manually when the resource usage is heavy (with manual decrease of timeout). ## How was this patch tested? Manually tested by ``` cd python ./run-tests --testnames 'pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction' --python-executables=python ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-26275 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23236.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23236 commit 3c4ee75c4d0585702cd87cc4df9af74e235bb431 Author: Hyukjin Kwon Date: 2018-12-05T12:17:21Z Increases timeout for StreamingLogisticRegressionWithSGDTests.test_training_and_prediction test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-26271][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23227 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin pull request #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12
GitHub user HyukjinKwon reopened a pull request: https://github.com/apache/zeppelin/pull/3253 [ZEPPELIN-3551] Upgrade Scala to 2.11.12 ### What is this PR for? This is just to update scala to 2.11.12 which to be consistent with spark (SPARK-24418). This PR takes over and closes #3033 There was a minor conflict which my PR (https://github.com/apache/zeppelin/pull/3206) introduced. That change is compatible with both Scala 2.11.8 and 2.11.12 so we don't need to change it anymore. ### What type of PR is it? [Improvement] ### Todos * [ ] - None ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-3551 ### How should this be tested? - CI pass ### Screenshots (if appropriate) ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/zeppelin scala-2.11.12 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zeppelin/pull/3253.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3253 commit 7d00314f8ec48556f75fe9877dcbee0d6e7ce2fe Author: Jeff Zhang Date: 2018-06-19T07:32:27Z ZEPPELIN-3551. Upgrade Scala to 2.11.12 ---
[GitHub] zeppelin pull request #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12
Github user HyukjinKwon closed the pull request at: https://github.com/apache/zeppelin/pull/3253 ---
[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23229 I don't think it's worth to change naming the variable in a single PR. Let's do that when we fix some codes around here, or let other people try to fix later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [MINOR][SQL][TEST] WholeStageCodegen metrics should be t...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23224 Can we file a JIRA? I think it's not minor. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23230 Oh, the original one was 3.0. Although this doc change can go to branch-2.4 alone as well, let me revert it in branch-2.4 for management simplicity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23230 Merged to master and branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-26271][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23227 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23226: [MINOR][TEST] Add MAXIMUM_PAGE_SIZE_BYTES Excepti...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23226#discussion_r238977650 --- Diff: core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java --- @@ -622,6 +622,17 @@ public void initialCapacityBoundsChecking() { } catch (IllegalArgumentException e) { // expected exception } + +try { + new BytesToBytesMap( + taskMemoryManager, --- End diff -- Let's keep the indentation consistent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23225: [MINOR][CORE]Don't need to create an empty spill file wh...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23225 Also, it needs a JIRA. it's not minor one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23225: [MINOR][CORE]Don't need to create an empty spill file wh...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23225 How come existing tests cover if the empty file is created or not? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3034: [WIP] ZEPPELIN-3552. Support Scala 2.12 of SparkInterp...
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3034 Hey Jeff, I can take over this one too if you're busy since I took a look for similar code paths. ---
[GitHub] zeppelin pull request #3253: [ZEPPELIN-3551] Upgrade Scala to 2.11.12
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/zeppelin/pull/3253 [ZEPPELIN-3551] Upgrade Scala to 2.11.12 ### What is this PR for? This is just to update scala to 2.11.12 which to be consistent with spark (SPARK-24418). This PR takes over and closes #3033 There was a minor conflict which my PR (https://github.com/apache/zeppelin/pull/3206) introduced. That change is compatible with both Scala 2.11.8 and 2.11.12 so we don't need to change it anymore. ### What type of PR is it? [Improvement] ### Todos * [ ] - None ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-3551 ### How should this be tested? - CI pass ### Screenshots (if appropriate) ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/zeppelin scala-2.11.12 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zeppelin/pull/3253.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3253 commit 7d00314f8ec48556f75fe9877dcbee0d6e7ce2fe Author: Jeff Zhang Date: 2018-06-19T07:32:27Z ZEPPELIN-3551. Upgrade Scala to 2.11.12 ---
[GitHub] spark issue #23203: [SPARK-26252][PYTHON] Add support to run specific unitte...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23203 Thank you @cloud-fan, @viirya, @srowen, and @BryanCutler. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23203: [SPARK-26252][PYTHON] Add support to run specific unitte...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23203 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23216: [SPARK-26264][CORE]It is better to add @transient...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23216#discussion_r238892679 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala --- @@ -56,7 +56,7 @@ private[spark] class ResultTask[T, U]( stageAttemptId: Int, taskBinary: Broadcast[Array[Byte]], partition: Partition, -locs: Seq[TaskLocation], +@transient private var locs: Seq[TaskLocation], --- End diff -- why is it `var` BTW? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23203: [SPARK-26252][PYTHON] Add support to run specific...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23203#discussion_r238887812 --- Diff: python/run-tests.py --- @@ -93,17 +93,18 @@ def run_individual_python_test(target_dir, test_name, pyspark_python): "pyspark-shell" ] env["PYSPARK_SUBMIT_ARGS"] = " ".join(spark_args) - -LOGGER.info("Starting test(%s): %s", pyspark_python, test_name) +str_test_name = " ".join(test_name) +LOGGER.info("Starting test(%s): %s", pyspark_python, str_test_name) start_time = time.time() try: per_test_output = tempfile.TemporaryFile() retcode = subprocess.Popen( -[os.path.join(SPARK_HOME, "bin/pyspark"), test_name], --- End diff -- Oh, yea. Looks that's going to reduce the diff. Let me try. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3033: ZEPPELIN-3551. Upgrade Scala to 2.11.12
Github user HyukjinKwon commented on the issue: https://github.com/apache/zeppelin/pull/3033 Hey @zjffdu, busy? I can take this over - looks there's only minor conflict. ---
[GitHub] spark issue #23203: [SPARK-26252][PYTHON] Add support to run specific unitte...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23203 Yea, will update it as well after this one gets merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23080 It's fixed in upcoming Spark. Spark 2.4 does not support it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238496050 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- @MaxGekk, BTW, why does this have to be GMT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r238495344 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. + - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. --- End diff -- I think we can add an example that shows the diff. IIRC it has a difference about exact match or non-exact match. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22590 Hm, let's try to find a way to expose other parse options. I think we shouldn't allow every options available on Univocity ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org