[GitHub] [spark] wangyum commented on issue #25018: [SPARK-26321][SQL] Port HIVE-15297: Hive should not split semicolon within quoted string literals
wangyum commented on issue #25018: [SPARK-26321][SQL] Port HIVE-15297: Hive should not split semicolon within quoted string literals URL: https://github.com/apache/spark/pull/25018#issuecomment-540369275 Thank you @HyukjinKwon I added another test case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon removed a comment on issue #25820: [SPARK-29101][SQL] Fix count API for csv file when DROPMALFORMED mode is selected
HyukjinKwon removed a comment on issue #25820: [SPARK-29101][SQL] Fix count API for csv file when DROPMALFORMED mode is selected URL: https://github.com/apache/spark/pull/25820#issuecomment-540358900 This issue exists in Spark 2.4 as well. Let me open a backporting PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #25820: [SPARK-29101][SQL] Fix count API for csv file when DROPMALFORMED mode is selected
HyukjinKwon commented on issue #25820: [SPARK-29101][SQL] Fix count API for csv file when DROPMALFORMED mode is selected URL: https://github.com/apache/spark/pull/25820#issuecomment-540358900 This issue exists in Spark 2.4 as well. Let me open a backporting PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shivusondur commented on a change in pull request #25561: [SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference.
shivusondur commented on a change in pull request #25561: [SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference. URL: https://github.com/apache/spark/pull/25561#discussion_r22463 ## File path: docs/sql-ref-syntax-aux-show-tables.md ## @@ -38,10 +38,13 @@ SHOW TABLES [{FROM|IN} database_name] [LIKE 'regex_pattern'] LIKE 'regex_pattern' - Specifies the regex pattern that is used to filter out unwanted tables. -- The pattern is a regex except `*` and `|`characters -- `*` matches 0 or more characters and `|` used to provide more than one regex with OR condition -- The leading and trailing blanks are trimmed in the input pattern before processing. + Specifies the regular expression pattern that is used to filter out unwanted tables. + + Except `*` and `|` characters remaining characters will follow the regular expression convention. + `*` matches 0 or more characters and `|` used to provide more than one regex with OR condition. Review comment: comments handled Thanks for reviewing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on issue #26068: [SPARK-29405][SQL] Alter table / Insert statements should not change a table's ownership
yaooqinn commented on issue #26068: [SPARK-29405][SQL] Alter table / Insert statements should not change a table's ownership URL: https://github.com/apache/spark/pull/26068#issuecomment-540330063 test this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on issue #25648: [SPARK-28947][K8S] Status logging not happens at an interval for liveness
yaooqinn commented on issue #25648: [SPARK-28947][K8S] Status logging not happens at an interval for liveness URL: https://github.com/apache/spark/pull/25648#issuecomment-540321431 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #26014: [SPARK-29349][SQL] Support FETCH_PRIOR in Thriftserver fetch request
wangyum commented on a change in pull request #26014: [SPARK-29349][SQL] Support FETCH_PRIOR in Thriftserver fetch request URL: https://github.com/apache/spark/pull/26014#discussion_r10328 ## File path: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala ## @@ -684,6 +685,92 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { assert(e.getMessage.contains("org.apache.spark.sql.catalyst.parser.ParseException")) } } + + test("ThriftCLIService FetchResults FETCH_FIRST, FETCH_NEXT, FETCH_PRIOR") { +def checkResult(rows: RowSet, start: Long, end: Long): Unit = { + assert(rows.getStartOffset() == start) + assert(rows.numRows() == end - start) + rows.iterator.asScala.zip((start until end).iterator).foreach { case (row, v) => +assert(row(0).asInstanceOf[Long] === v) + } +} + +withCLIServiceClient { client => + val user = System.getProperty("user.name") + val sessionHandle = client.openSession(user, "") + + val confOverlay = new java.util.HashMap[java.lang.String, java.lang.String] + val operationHandle = client.executeStatement( +sessionHandle, +"SELECT * FROM range(10)", +confOverlay) // 10 rows result with sequence 0, 1, 2, ..., 9 + var rows: RowSet = null + + // Fetch 5 rows with FETCH_NEXT + rows = client.fetchResults( +operationHandle, FetchOrientation.FETCH_NEXT, 5, FetchType.QUERY_OUTPUT) + checkResult(rows, 0, 5) // fetched [0, 5) + + // Fetch another 2 rows with FETCH_NEXT + rows = client.fetchResults( +operationHandle, FetchOrientation.FETCH_NEXT, 2, FetchType.QUERY_OUTPUT) + checkResult(rows, 5, 7) // fetched [5, 7) + + // FETCH_PRIOR 3 rows Review comment: The `startOffset` is `5` at this time. If we do FETCH_PRIOR one row, it is 4. But PostgreSQL returns `5`. Is this what we expect? ```sql postgres=# begin; BEGIN postgres=# declare rf_cur scroll cursor for select * from generate_series(0, 9); DECLARE CURSOR postgres=# FETCH 5 from rf_cur; generate_series - 0 1 2 3 4 (5 rows) postgres=# FETCH 2 from rf_cur; generate_series - 5 6 (2 rows) postgres=# FETCH PRIOR from rf_cur; generate_series - 5 (1 row) postgres=# commit; COMMIT ``` https://www.postgresql.org/docs/11/sql-fetch.html This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #26075: [WIP][K8S] Spark operator
dongjoon-hyun edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540297020 Thanks, @nrchakradhar . For me, this seems to be dumped from that repo by the author. I took a look. The current status is difficult to be considered for Apache project contribution because there is no Apache license header and the remaining package name authorship `radanalytics`. I'd like to recommend @jkremser to reopen this after cleaning up the PR by adding licenses and remove the authorship package name. I'm close this for now. You can use `./dev/check-license` script to check the missing licenses. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #26075: [WIP][K8S] Spark operator
dongjoon-hyun commented on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540300821 Definitely, +1 for @holdenk 's comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on issue #26075: [WIP][K8S] Spark operator
holdenk commented on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540299867 I think before putting in work on this PR getting buy-in from the `dev@` list might be a good next first step. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #26067: [SPARK-28989][SQL][FollowUp] Update ANSI mode related config names in comments
dongjoon-hyun closed pull request #26067: [SPARK-28989][SQL][FollowUp] Update ANSI mode related config names in comments URL: https://github.com/apache/spark/pull/26067 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #25416: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression
beliefer commented on a change in pull request #25416: [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression URL: https://github.com/apache/spark/pull/25416#discussion_r332897164 ## File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ## @@ -399,6 +399,7 @@ queryOrganization (SORT BY sort+=sortItem (',' sort+=sortItem)*)? windowClause? (LIMIT (ALL | limit=expression))? + (OFFSET offset=expression)? Review comment: Yes, SQL standard defined the order for `LIMIT` and `OFFSET` clauses. `LIMIT` at the front of `OFFSET`. The behavior looks contrary to MySQL direction. `MySQL` support `LIMIT 10,10`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #26075: [WIP][K8S] Spark operator
dongjoon-hyun edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540297020 Thanks, @nrchakradhar . For me, this seems to dumped from that repo by the author. I took a look. The current status is difficult to be considered for Apache project contribution because there is no Apache license header and the remaining package name authorship `radanalytics`. I'd like to recommend @jkremser to reopen this after cleaning up the PR by adding licenses and remove the authorship package name. I'm close this for now. You can use `./dev/check-license` script to check the missing licenses. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #26075: [WIP][K8S] Spark operator
dongjoon-hyun closed pull request #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #26075: [WIP][K8S] Spark operator
dongjoon-hyun commented on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540297020 Thanks, @nrchakradhar . For me, this seems to dumped from that repo by the author. I took a look. The current status is difficult to be considered for Apache project contribution because there is no Apache license header and the remaining package name authorship `radanalytics`. I'd like to recommend @jkremser to reopen this after cleaning up the PR by adding licenses and remove the authorship package name. I'm close this for now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #26071: [SPARK-29412][SQL] refine the document of v2 session catalog config
cloud-fan commented on a change in pull request #26071: [SPARK-29412][SQL] refine the document of v2 session catalog config URL: https://github.com/apache/spark/pull/26071#discussion_r02564 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -1976,11 +1977,19 @@ object SQLConf { .stringConf .createOptional - val V2_SESSION_CATALOG = buildConf("spark.sql.catalog.session") - .doc("A catalog implementation that will be used in place of the Spark built-in session " + -"catalog for v2 operations. The implementation may extend `CatalogExtension` to be " + -"passed the Spark built-in session catalog, so that it may delegate calls to the " + -"built-in session catalog.") + val V2_SESSION_CATALOG_IMPLEMENTATION = +buildConf(s"spark.sql.catalog.${CatalogManager.SESSION_CATALOG_NAME}") + .doc("A catalog implementation that will be used in place of the Spark Catalog for v2 " + +"operations (e.g. create table using a v2 source, alter a v2 table). The Spark Catalog " + +"is the current catalog by default, and supports all kinds of catalog operations like " + +"CREATE TABLE USING v1/v2 source, VIEW/FUNCTION related operations, etc. This config is " + +"used to extend the Spark Catalog and inject custom logic to v2 operations, while other" + +"operations still go through the Spark Catalog. The catalog implementation specified " + +"by this config should extend `CatalogExtension` to be passed the Spark Catalog, " + +"so that it can delegate calls to Spark Catalog. Otherwise, the implementation " + +"should figure out a way to access the Spark Catalog or its underlying meta-store " + +"by itself. It's important to make the implementation share the underlying meta-store " + +"of the Spark Catalog and act as an extension, instead of a separated catalog.") Review comment: thanks for the suggestion! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page
liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page URL: https://github.com/apache/spark/pull/25994#issuecomment-540284189 @srowen Thanks again, I have finished the change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #25018: [SPARK-26321][SQL] Port HIVE-15297: Hive should not split semicolon within quoted string literals
HyukjinKwon commented on issue #25018: [SPARK-26321][SQL] Port HIVE-15297: Hive should not split semicolon within quoted string literals URL: https://github.com/apache/spark/pull/25018#issuecomment-540277064 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liucht-inspur commented on a change in pull request #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page
liucht-inspur commented on a change in pull request #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page URL: https://github.com/apache/spark/pull/25994#discussion_r333292223 ## File path: core/src/main/resources/org/apache/spark/ui/static/executorspage-template.html ## @@ -37,8 +37,8 @@ Summary Disk Used Cores -Active Tasks -Failed Tasks +Active Tasks +Failed Tasks Review comment: Yes, they are more suitable,thank you! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #26053: [SPARK-29379][SQL]SHOW FUNCTIONS show '!=', '<>' , 'between', 'case'
HyukjinKwon commented on issue #26053: [SPARK-29379][SQL]SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' URL: https://github.com/apache/spark/pull/26053#issuecomment-540272697 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
HyukjinKwon commented on issue #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#issuecomment-540272481 Seems fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nrchakradhar commented on issue #26075: [WIP][K8S] Spark operator
nrchakradhar commented on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540270821 This is same as https://github.com/radanalyticsio/spark-operator. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows
HyukjinKwon commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows URL: https://github.com/apache/spark/pull/26013#discussion_r333285090 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ## @@ -501,4 +513,88 @@ trait Row extends Serializable { private def getAnyValAs[T <: AnyVal](i: Int): T = if (isNullAt(i)) throw new NullPointerException(s"Value at index $i is null") else getAs[T](i) + + /** The compact JSON representation of this row. */ + def json: String = compact(jsonValue) Review comment: There's one API case we dropped performance improvement in `Row` as an example (see https://github.com/apache/spark/pull/23271). ```scala @deprecated("This method is deprecated and will be removed in future versions.", "3.0.0") def merge(rows: Row*): Row = { // TODO: Improve the performance of this if used in performance critical part. new GenericRow(rows.flatMap(_.toSeq).toArray) } ``` Do you mind if I ask to add `@Unstable` or `@Private` for these new APIs instead just for future improvement in case, with `@since` tag? `Row` itself is marked as `@Stable` so it might better explicitly note that this can be changed in the future. With this LGTM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows
HyukjinKwon commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows URL: https://github.com/apache/spark/pull/26013#discussion_r333285090 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ## @@ -501,4 +513,88 @@ trait Row extends Serializable { private def getAnyValAs[T <: AnyVal](i: Int): T = if (isNullAt(i)) throw new NullPointerException(s"Value at index $i is null") else getAs[T](i) + + /** The compact JSON representation of this row. */ + def json: String = compact(jsonValue) Review comment: There's one API case we dropped performance improvement in `Row` as an example (see https://github.com/apache/spark/pull/23271). ```scala @deprecated("This method is deprecated and will be removed in future versions.", "3.0.0") def merge(rows: Row*): Row = { // TODO: Improve the performance of this if used in performance critical part. new GenericRow(rows.flatMap(_.toSeq).toArray) } ``` Do you mind if I ask to add `@Unstable` or `@Private` for these new APIs instead just for future improvement in case, with `@since` in the Scaladoc? `Row` itself is marked as `@Stable` so it might better explicitly note that this can be changed in the future. With this LGTM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows
HyukjinKwon commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows URL: https://github.com/apache/spark/pull/26013#discussion_r333285090 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ## @@ -501,4 +513,88 @@ trait Row extends Serializable { private def getAnyValAs[T <: AnyVal](i: Int): T = if (isNullAt(i)) throw new NullPointerException(s"Value at index $i is null") else getAs[T](i) + + /** The compact JSON representation of this row. */ + def json: String = compact(jsonValue) Review comment: There's one API case we dropped performance improvement in `Row` as an example. ```scala @deprecated("This method is deprecated and will be removed in future versions.", "3.0.0") def merge(rows: Row*): Row = { // TODO: Improve the performance of this if used in performance critical part. new GenericRow(rows.flatMap(_.toSeq).toArray) } ``` Do you mind if I ask to add `@Unstable` or `@Private` for these new APIs instead just for future improvement in case, with `@since` tag? `Row` itself is marked as `@Stable` so it might better explicitly note that this can be changed in the future. With this LGTM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows
HyukjinKwon commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows URL: https://github.com/apache/spark/pull/26013#discussion_r333285090 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ## @@ -501,4 +513,88 @@ trait Row extends Serializable { private def getAnyValAs[T <: AnyVal](i: Int): T = if (isNullAt(i)) throw new NullPointerException(s"Value at index $i is null") else getAs[T](i) + + /** The compact JSON representation of this row. */ + def json: String = compact(jsonValue) Review comment: There's one API case we dropped performance improvement in `Row` as an example. ```scala @deprecated("This method is deprecated and will be removed in future versions.", "3.0.0") def merge(rows: Row*): Row = { // TODO: Improve the performance of this if used in performance critical part. new GenericRow(rows.flatMap(_.toSeq).toArray) } ``` Do you mind if I ask to add `@Unstable` or `@Private` for these new APIs instead just for future improvement in case, with `@since` tag? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)
cloud-fan commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076#issuecomment-540259061 > reducing the cost of .copy() by refactoring the ExpressionEncoder class That sounds like a good idea to me. Can we do that first? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
dongjoon-hyun commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#discussion_r333283788 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala ## @@ -40,30 +40,35 @@ private[v1] class PrometheusResource extends ApiRequestContext { def executors(): String = { val sb = new StringBuilder val store = uiRoot.asInstanceOf[SparkUI].store -val appId = store.applicationInfo.id.replaceAll("[^a-zA-Z0-9]", "_") store.executorList(true).foreach { executor => - val prefix = s"metrics_${appId}_${executor.id}_executor_" - sb.append(s"${prefix}rddBlocks_Count ${executor.rddBlocks}\n") - sb.append(s"${prefix}memoryUsed_Count ${executor.memoryUsed}\n") - sb.append(s"${prefix}diskUsed_Count ${executor.diskUsed}\n") - sb.append(s"${prefix}totalCores_Count ${executor.totalCores}\n") - sb.append(s"${prefix}maxTasks_Count ${executor.maxTasks}\n") - sb.append(s"${prefix}activeTasks_Count ${executor.activeTasks}\n") - sb.append(s"${prefix}failedTasks_Count ${executor.failedTasks}\n") - sb.append(s"${prefix}completedTasks_Count ${executor.completedTasks}\n") - sb.append(s"${prefix}totalTasks_Count ${executor.totalTasks}\n") - sb.append(s"${prefix}totalDuration_Value ${executor.totalDuration}\n") - sb.append(s"${prefix}totalGCTime_Value ${executor.totalGCTime}\n") - sb.append(s"${prefix}totalInputBytes_Count ${executor.totalInputBytes}\n") - sb.append(s"${prefix}totalShuffleRead_Count ${executor.totalShuffleRead}\n") - sb.append(s"${prefix}totalShuffleWrite_Count ${executor.totalShuffleWrite}\n") - sb.append(s"${prefix}maxMemory_Count ${executor.maxMemory}\n") + val prefix = "metrics_executor_" + val labels = Seq( +"application_id" -> store.applicationInfo.id, +"application_name" -> store.applicationInfo.name, +"executor_id" -> executor.id + ).map { case (k, v) => s"""$k="$v }.mkString("{", ", ", "}") + sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n") Review comment: For that `Prometheus` question, different labels mean different time-series in `Prometheus`. > Prometheus fundamentally stores all data as time series: streams of timestamped values belonging to the same metric and the same set of labeled dimensions. Here are the reference for the details~ - https://prometheus.io/docs/concepts/data_model/ - https://prometheus.io/docs/practices/naming/#labels This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #19424: [SPARK-22197][SQL] push down operators to data source before planning
cloud-fan commented on issue #19424: [SPARK-22197][SQL] push down operators to data source before planning URL: https://github.com/apache/spark/pull/19424#issuecomment-540257605 @rafaelkyrdan limit pushdown has not been supported yet. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #26041: [SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now
HyukjinKwon closed pull request #26041: [SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now URL: https://github.com/apache/spark/pull/26041 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #26041: [SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now
HyukjinKwon commented on issue #26041: [SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now URL: https://github.com/apache/spark/pull/26041#issuecomment-540256081 I am merging this to keep our AppVeyor build alive for now .. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #26073: [SPARK-29416][CORE][ML][SQL][MESOS][TESTS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13)
dongjoon-hyun closed pull request #26073: [SPARK-29416][CORE][ML][SQL][MESOS][TESTS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13) URL: https://github.com/apache/spark/pull/26073 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #26041: [SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now
HyukjinKwon commented on a change in pull request #26041: [SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now URL: https://github.com/apache/spark/pull/26041#discussion_r333282310 ## File path: appveyor.yml ## @@ -57,6 +57,8 @@ build_script: environment: NOT_CRAN: true + # See SPARK-29378. + ARROW_PRE_0_15_IPC_FORMAT: 1 Review comment: I removed it in the latest commit :-). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#discussion_r333281688 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala ## @@ -40,30 +40,35 @@ private[v1] class PrometheusResource extends ApiRequestContext { def executors(): String = { val sb = new StringBuilder val store = uiRoot.asInstanceOf[SparkUI].store -val appId = store.applicationInfo.id.replaceAll("[^a-zA-Z0-9]", "_") store.executorList(true).foreach { executor => - val prefix = s"metrics_${appId}_${executor.id}_executor_" - sb.append(s"${prefix}rddBlocks_Count ${executor.rddBlocks}\n") - sb.append(s"${prefix}memoryUsed_Count ${executor.memoryUsed}\n") - sb.append(s"${prefix}diskUsed_Count ${executor.diskUsed}\n") - sb.append(s"${prefix}totalCores_Count ${executor.totalCores}\n") - sb.append(s"${prefix}maxTasks_Count ${executor.maxTasks}\n") - sb.append(s"${prefix}activeTasks_Count ${executor.activeTasks}\n") - sb.append(s"${prefix}failedTasks_Count ${executor.failedTasks}\n") - sb.append(s"${prefix}completedTasks_Count ${executor.completedTasks}\n") - sb.append(s"${prefix}totalTasks_Count ${executor.totalTasks}\n") - sb.append(s"${prefix}totalDuration_Value ${executor.totalDuration}\n") - sb.append(s"${prefix}totalGCTime_Value ${executor.totalGCTime}\n") - sb.append(s"${prefix}totalInputBytes_Count ${executor.totalInputBytes}\n") - sb.append(s"${prefix}totalShuffleRead_Count ${executor.totalShuffleRead}\n") - sb.append(s"${prefix}totalShuffleWrite_Count ${executor.totalShuffleWrite}\n") - sb.append(s"${prefix}maxMemory_Count ${executor.maxMemory}\n") + val prefix = "metrics_executor_" + val labels = Seq( +"application_id" -> store.applicationInfo.id, +"application_name" -> store.applicationInfo.name, +"executor_id" -> executor.id + ).map { case (k, v) => s"""$k="$v }.mkString("{", ", ", "}") + sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n") Review comment: I may misunderstand Prometheus's approach. If so, then this might not be a problem. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#discussion_r333281562 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala ## @@ -40,30 +40,35 @@ private[v1] class PrometheusResource extends ApiRequestContext { def executors(): String = { val sb = new StringBuilder val store = uiRoot.asInstanceOf[SparkUI].store -val appId = store.applicationInfo.id.replaceAll("[^a-zA-Z0-9]", "_") store.executorList(true).foreach { executor => - val prefix = s"metrics_${appId}_${executor.id}_executor_" - sb.append(s"${prefix}rddBlocks_Count ${executor.rddBlocks}\n") - sb.append(s"${prefix}memoryUsed_Count ${executor.memoryUsed}\n") - sb.append(s"${prefix}diskUsed_Count ${executor.diskUsed}\n") - sb.append(s"${prefix}totalCores_Count ${executor.totalCores}\n") - sb.append(s"${prefix}maxTasks_Count ${executor.maxTasks}\n") - sb.append(s"${prefix}activeTasks_Count ${executor.activeTasks}\n") - sb.append(s"${prefix}failedTasks_Count ${executor.failedTasks}\n") - sb.append(s"${prefix}completedTasks_Count ${executor.completedTasks}\n") - sb.append(s"${prefix}totalTasks_Count ${executor.totalTasks}\n") - sb.append(s"${prefix}totalDuration_Value ${executor.totalDuration}\n") - sb.append(s"${prefix}totalGCTime_Value ${executor.totalGCTime}\n") - sb.append(s"${prefix}totalInputBytes_Count ${executor.totalInputBytes}\n") - sb.append(s"${prefix}totalShuffleRead_Count ${executor.totalShuffleRead}\n") - sb.append(s"${prefix}totalShuffleWrite_Count ${executor.totalShuffleWrite}\n") - sb.append(s"${prefix}maxMemory_Count ${executor.maxMemory}\n") + val prefix = "metrics_executor_" + val labels = Seq( +"application_id" -> store.applicationInfo.id, +"application_name" -> store.applicationInfo.name, +"executor_id" -> executor.id + ).map { case (k, v) => s"""$k="$v }.mkString("{", ", ", "}") + sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n") Review comment: > No. Prometheus query language support to handle them individually. Yes. But I am wondering is, now all numbers from all applications are recorded under same metric. To retrieve number for specified application, does not Prometheus need to search it among all applications' metric numbers? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #25984: [SPARK-29308][BUILD] Update deps in dev/deps/spark-deps-hadoop-3.2 for hadoop-3.2
dongjoon-hyun commented on issue #25984: [SPARK-29308][BUILD] Update deps in dev/deps/spark-deps-hadoop-3.2 for hadoop-3.2 URL: https://github.com/apache/spark/pull/25984#issuecomment-540251763 Unfortunately, Jenkins is down. Let's wait until Jenkins is back. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on issue #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#issuecomment-540247560 Thank you for updating, @s1ck . I'll take a look tonight~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature
dongjoon-hyun commented on a change in pull request #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature URL: https://github.com/apache/spark/pull/26077#discussion_r333274284 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -274,12 +274,19 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { + def parmap[I, O, Col[X] <: TraversableOnce[X]] Review comment: Thank you for the investigation and the conclusion, @zsxwing and @srowen . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
dongjoon-hyun commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#discussion_r333270955 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala ## @@ -40,30 +40,35 @@ private[v1] class PrometheusResource extends ApiRequestContext { def executors(): String = { val sb = new StringBuilder val store = uiRoot.asInstanceOf[SparkUI].store -val appId = store.applicationInfo.id.replaceAll("[^a-zA-Z0-9]", "_") store.executorList(true).foreach { executor => - val prefix = s"metrics_${appId}_${executor.id}_executor_" - sb.append(s"${prefix}rddBlocks_Count ${executor.rddBlocks}\n") - sb.append(s"${prefix}memoryUsed_Count ${executor.memoryUsed}\n") - sb.append(s"${prefix}diskUsed_Count ${executor.diskUsed}\n") - sb.append(s"${prefix}totalCores_Count ${executor.totalCores}\n") - sb.append(s"${prefix}maxTasks_Count ${executor.maxTasks}\n") - sb.append(s"${prefix}activeTasks_Count ${executor.activeTasks}\n") - sb.append(s"${prefix}failedTasks_Count ${executor.failedTasks}\n") - sb.append(s"${prefix}completedTasks_Count ${executor.completedTasks}\n") - sb.append(s"${prefix}totalTasks_Count ${executor.totalTasks}\n") - sb.append(s"${prefix}totalDuration_Value ${executor.totalDuration}\n") - sb.append(s"${prefix}totalGCTime_Value ${executor.totalGCTime}\n") - sb.append(s"${prefix}totalInputBytes_Count ${executor.totalInputBytes}\n") - sb.append(s"${prefix}totalShuffleRead_Count ${executor.totalShuffleRead}\n") - sb.append(s"${prefix}totalShuffleWrite_Count ${executor.totalShuffleWrite}\n") - sb.append(s"${prefix}maxMemory_Count ${executor.maxMemory}\n") + val prefix = "metrics_executor_" + val labels = Seq( +"application_id" -> store.applicationInfo.id, +"application_name" -> store.applicationInfo.name, +"executor_id" -> executor.id + ).map { case (k, v) => s"""$k="$v }.mkString("{", ", ", "}") + sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n") Review comment: Right. The redundant information is moved to labels. > Actually prefix is fixed? So they are now the same metrics. And application id, executor id now are labels on them? No. Prometheus query language support to handle them individually. > Does it have bad impact on the metrics usage later? Because now all applications are recorded under the same metrics. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
dongjoon-hyun commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#discussion_r333270410 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala ## @@ -40,30 +40,35 @@ private[v1] class PrometheusResource extends ApiRequestContext { def executors(): String = { val sb = new StringBuilder val store = uiRoot.asInstanceOf[SparkUI].store -val appId = store.applicationInfo.id.replaceAll("[^a-zA-Z0-9]", "_") store.executorList(true).foreach { executor => - val prefix = s"metrics_${appId}_${executor.id}_executor_" - sb.append(s"${prefix}rddBlocks_Count ${executor.rddBlocks}\n") - sb.append(s"${prefix}memoryUsed_Count ${executor.memoryUsed}\n") - sb.append(s"${prefix}diskUsed_Count ${executor.diskUsed}\n") - sb.append(s"${prefix}totalCores_Count ${executor.totalCores}\n") - sb.append(s"${prefix}maxTasks_Count ${executor.maxTasks}\n") - sb.append(s"${prefix}activeTasks_Count ${executor.activeTasks}\n") - sb.append(s"${prefix}failedTasks_Count ${executor.failedTasks}\n") - sb.append(s"${prefix}completedTasks_Count ${executor.completedTasks}\n") - sb.append(s"${prefix}totalTasks_Count ${executor.totalTasks}\n") - sb.append(s"${prefix}totalDuration_Value ${executor.totalDuration}\n") - sb.append(s"${prefix}totalGCTime_Value ${executor.totalGCTime}\n") - sb.append(s"${prefix}totalInputBytes_Count ${executor.totalInputBytes}\n") - sb.append(s"${prefix}totalShuffleRead_Count ${executor.totalShuffleRead}\n") - sb.append(s"${prefix}totalShuffleWrite_Count ${executor.totalShuffleWrite}\n") - sb.append(s"${prefix}maxMemory_Count ${executor.maxMemory}\n") + val prefix = "metrics_executor_" + val labels = Seq( +"application_id" -> store.applicationInfo.id, +"application_name" -> store.applicationInfo.name, Review comment: This is the only field *human-readable* to distinguish the jobs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on issue #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature
zsxwing commented on issue #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature URL: https://github.com/apache/spark/pull/26077#issuecomment-540236633 Closing my PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing closed pull request #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature
zsxwing closed pull request #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature URL: https://github.com/apache/spark/pull/26077 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature
zsxwing commented on a change in pull request #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature URL: https://github.com/apache/spark/pull/26077#discussion_r333264618 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -274,12 +274,19 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { + def parmap[I, O, Col[X] <: TraversableOnce[X]] Review comment: Hm, `CanBuildFrom` doesn't exist in 2.13 either. Then we have to write two versions for 2.12 and 2.13. Hence, I'm okey to just use `Seq` instead. Adding `asInstanceOf` if necessary is better than two versions of codes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] PingHao commented on issue #24922: [SPARK-28120][SS] Rocksdb state storage implementation
PingHao commented on issue #24922: [SPARK-28120][SS] Rocksdb state storage implementation URL: https://github.com/apache/spark/pull/24922#issuecomment-540235341 > > > @PingHao - Both the issue- stuck executor, core dump - might be due to the same reasons. Will debug and fix it. > > Would you have some code snippets that can help me to reproduce the problem? My running code is difficult to isolation or share here, so here is a new test case (based on existing test case "maintenance" ) in your RocksDbStateStoreSuite.scala, to try simulator parallel spark tasks operation on each partition statestore and at same time have maintenance thread try destoryDB. see code here https://gist.github.com/PingHao/c20846542adda742f27ff00459fafe29#file-rocksdbstatestoresuite-scala-L384 I can produce core dump on my developer machine, but not sure if logic is legit anyway. you can change N - number of partitions, and LOOPS. recommend N = number of cpu cores. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener
HeartSaVioR commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener URL: https://github.com/apache/spark/pull/25943#discussion_r333262965 ## File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ## @@ -103,6 +104,81 @@ private[spark] class AppStatusListener( } } + // visible for tests + private[spark] def recoverLiveEntities(): Unit = { +if (!live) { + kvstore.view(classOf[JobDataWrapper]) +.asScala.filter(_.info.status == JobExecutionStatus.RUNNING) +.map(_.toLiveJob).foreach(job => liveJobs.put(job.jobId, job)) + + kvstore.view(classOf[StageDataWrapper]).asScala +.filter { stageData => + stageData.info.status == v1.StageStatus.PENDING || +stageData.info.status == v1.StageStatus.ACTIVE +} +.map { stageData => + val stageId = stageData.info.stageId + val jobs = liveJobs.values.filter(_.stageIds.contains(stageId)).toSeq + stageData.toLiveStage(jobs) +}.foreach { stage => +val stageId = stage.info.stageId +val stageAttempt = stage.info.attemptNumber() +liveStages.put((stageId, stageAttempt), stage) Review comment: That would be nice, it would beyond the scope of this PR though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener
HeartSaVioR commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener URL: https://github.com/apache/spark/pull/25943#discussion_r333262518 ## File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ## @@ -103,6 +104,81 @@ private[spark] class AppStatusListener( } } + // visible for tests + private[spark] def recoverLiveEntities(): Unit = { +if (!live) { + kvstore.view(classOf[JobDataWrapper]) +.asScala.filter(_.info.status == JobExecutionStatus.RUNNING) +.map(_.toLiveJob).foreach(job => liveJobs.put(job.jobId, job)) + + kvstore.view(classOf[StageDataWrapper]).asScala +.filter { stageData => + stageData.info.status == v1.StageStatus.PENDING || +stageData.info.status == v1.StageStatus.ACTIVE +} +.map { stageData => Review comment: Yes, see foreach in below. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener
HeartSaVioR commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener URL: https://github.com/apache/spark/pull/25943#discussion_r333262295 ## File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ## @@ -103,6 +104,81 @@ private[spark] class AppStatusListener( } } + // visible for tests + private[spark] def recoverLiveEntities(): Unit = { +if (!live) { Review comment: > I think our current usage of a live(true) AppStatusListener guarantees the empty KVStore at the initialization step. So I don't understand well for depends on the possibility ? That's a "context" we might have a chance to break eventually and as a side-effect it will break here. I'm in favor of doing defensive programming: if there're preconditions it should be mentioned anywhere or asserted. But I agree we don't have isEmpty api for KVStore - let's leave it as it is. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#discussion_r333261862 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala ## @@ -40,30 +40,35 @@ private[v1] class PrometheusResource extends ApiRequestContext { def executors(): String = { val sb = new StringBuilder val store = uiRoot.asInstanceOf[SparkUI].store -val appId = store.applicationInfo.id.replaceAll("[^a-zA-Z0-9]", "_") store.executorList(true).foreach { executor => - val prefix = s"metrics_${appId}_${executor.id}_executor_" - sb.append(s"${prefix}rddBlocks_Count ${executor.rddBlocks}\n") - sb.append(s"${prefix}memoryUsed_Count ${executor.memoryUsed}\n") - sb.append(s"${prefix}diskUsed_Count ${executor.diskUsed}\n") - sb.append(s"${prefix}totalCores_Count ${executor.totalCores}\n") - sb.append(s"${prefix}maxTasks_Count ${executor.maxTasks}\n") - sb.append(s"${prefix}activeTasks_Count ${executor.activeTasks}\n") - sb.append(s"${prefix}failedTasks_Count ${executor.failedTasks}\n") - sb.append(s"${prefix}completedTasks_Count ${executor.completedTasks}\n") - sb.append(s"${prefix}totalTasks_Count ${executor.totalTasks}\n") - sb.append(s"${prefix}totalDuration_Value ${executor.totalDuration}\n") - sb.append(s"${prefix}totalGCTime_Value ${executor.totalGCTime}\n") - sb.append(s"${prefix}totalInputBytes_Count ${executor.totalInputBytes}\n") - sb.append(s"${prefix}totalShuffleRead_Count ${executor.totalShuffleRead}\n") - sb.append(s"${prefix}totalShuffleWrite_Count ${executor.totalShuffleWrite}\n") - sb.append(s"${prefix}maxMemory_Count ${executor.maxMemory}\n") + val prefix = "metrics_executor_" + val labels = Seq( +"application_id" -> store.applicationInfo.id, +"application_name" -> store.applicationInfo.name, +"executor_id" -> executor.id + ).map { case (k, v) => s"""$k="$v }.mkString("{", ", ", "}") + sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n") Review comment: Actually `prefix` is fixed? So they are now the same metrics. And application id, executor id now are labels on them? Does it have bad impact on the metrics usage later? Because now all applications are recorded under the same metrics. I am not sure how Prometheus processes, but naturally I'd think Prometheus needs to search specified application id in the metrics of all applications. Previously you have appId and executor id in metric name. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#discussion_r333261862 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala ## @@ -40,30 +40,35 @@ private[v1] class PrometheusResource extends ApiRequestContext { def executors(): String = { val sb = new StringBuilder val store = uiRoot.asInstanceOf[SparkUI].store -val appId = store.applicationInfo.id.replaceAll("[^a-zA-Z0-9]", "_") store.executorList(true).foreach { executor => - val prefix = s"metrics_${appId}_${executor.id}_executor_" - sb.append(s"${prefix}rddBlocks_Count ${executor.rddBlocks}\n") - sb.append(s"${prefix}memoryUsed_Count ${executor.memoryUsed}\n") - sb.append(s"${prefix}diskUsed_Count ${executor.diskUsed}\n") - sb.append(s"${prefix}totalCores_Count ${executor.totalCores}\n") - sb.append(s"${prefix}maxTasks_Count ${executor.maxTasks}\n") - sb.append(s"${prefix}activeTasks_Count ${executor.activeTasks}\n") - sb.append(s"${prefix}failedTasks_Count ${executor.failedTasks}\n") - sb.append(s"${prefix}completedTasks_Count ${executor.completedTasks}\n") - sb.append(s"${prefix}totalTasks_Count ${executor.totalTasks}\n") - sb.append(s"${prefix}totalDuration_Value ${executor.totalDuration}\n") - sb.append(s"${prefix}totalGCTime_Value ${executor.totalGCTime}\n") - sb.append(s"${prefix}totalInputBytes_Count ${executor.totalInputBytes}\n") - sb.append(s"${prefix}totalShuffleRead_Count ${executor.totalShuffleRead}\n") - sb.append(s"${prefix}totalShuffleWrite_Count ${executor.totalShuffleWrite}\n") - sb.append(s"${prefix}maxMemory_Count ${executor.maxMemory}\n") + val prefix = "metrics_executor_" + val labels = Seq( +"application_id" -> store.applicationInfo.id, +"application_name" -> store.applicationInfo.name, +"executor_id" -> executor.id + ).map { case (k, v) => s"""$k="$v }.mkString("{", ", ", "}") + sb.append(s"${prefix}rddBlocks_Count$labels ${executor.rddBlocks}\n") Review comment: Actually `prefix` is fixed? So they are now the same metrics. And application id, executor id now are labels on them? Does it have bad impact on the metrics usage later? Because now all applications are recorded under the same metrics. I am not sure how Prometheus processes, but naturally I'd think Prometheus needs to search specified application in the metrics of all applications. Previously you have appId and executor id in metric name. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
viirya commented on a change in pull request #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#discussion_r333259908 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/PrometheusResource.scala ## @@ -40,30 +40,35 @@ private[v1] class PrometheusResource extends ApiRequestContext { def executors(): String = { val sb = new StringBuilder val store = uiRoot.asInstanceOf[SparkUI].store -val appId = store.applicationInfo.id.replaceAll("[^a-zA-Z0-9]", "_") store.executorList(true).foreach { executor => - val prefix = s"metrics_${appId}_${executor.id}_executor_" - sb.append(s"${prefix}rddBlocks_Count ${executor.rddBlocks}\n") - sb.append(s"${prefix}memoryUsed_Count ${executor.memoryUsed}\n") - sb.append(s"${prefix}diskUsed_Count ${executor.diskUsed}\n") - sb.append(s"${prefix}totalCores_Count ${executor.totalCores}\n") - sb.append(s"${prefix}maxTasks_Count ${executor.maxTasks}\n") - sb.append(s"${prefix}activeTasks_Count ${executor.activeTasks}\n") - sb.append(s"${prefix}failedTasks_Count ${executor.failedTasks}\n") - sb.append(s"${prefix}completedTasks_Count ${executor.completedTasks}\n") - sb.append(s"${prefix}totalTasks_Count ${executor.totalTasks}\n") - sb.append(s"${prefix}totalDuration_Value ${executor.totalDuration}\n") - sb.append(s"${prefix}totalGCTime_Value ${executor.totalGCTime}\n") - sb.append(s"${prefix}totalInputBytes_Count ${executor.totalInputBytes}\n") - sb.append(s"${prefix}totalShuffleRead_Count ${executor.totalShuffleRead}\n") - sb.append(s"${prefix}totalShuffleWrite_Count ${executor.totalShuffleWrite}\n") - sb.append(s"${prefix}maxMemory_Count ${executor.maxMemory}\n") + val prefix = "metrics_executor_" + val labels = Seq( +"application_id" -> store.applicationInfo.id, +"application_name" -> store.applicationInfo.name, Review comment: I am not sure if application name is needed, because you have application id already. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature
srowen commented on a change in pull request #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature URL: https://github.com/apache/spark/pull/26077#discussion_r333254145 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -274,12 +274,19 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { + def parmap[I, O, Col[X] <: TraversableOnce[X]] Review comment: Oh, I should be clear: `TraversableOnce` also doesn't exist in 2.13. I don't see the use case for returning the same collection, not in this limited utility method? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on issue #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
srowen commented on issue #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#issuecomment-540222491 (I don't have any opinion on this - don't know Prometheus) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels
dongjoon-hyun commented on issue #26060: [SPARK-29400][CORE] Improve PrometheusResource to use labels URL: https://github.com/apache/spark/pull/26060#issuecomment-540218055 Gentle ping, @gatorsmile , @srowen , @dbtsai , @viirya , @yuecong . **Brief Summary** 1. Apache Spark has been providing a REST API. The new Prometheus servlet is only exposing the same information. If there is a scalability issue inside Apache Spark, we should fix it for both the existing REST API and this. That is orthogonal to this PR. 2. Prometheus (performance or architectural) issues are completely irrelevant to this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] abellina opened a new pull request #26078: SPARK-29151: Support fractional resources for task resource scheduling
abellina opened a new pull request #26078: SPARK-29151: Support fractional resources for task resource scheduling URL: https://github.com/apache/spark/pull/26078 ### What changes were proposed in this pull request? This PR adds the ability for tasks to request fractional resources, in order to be able to execute more than 1 task per resource. For example, if you have 1 GPU in the executor, and the task configuration is 0.5 GPU/task, the executor can schedule two tasks to run on that 1 GPU. ### Why are the changes needed? Currently there is no good way to share a resource such that multiple tasks can run on a single unit. This allows multiple tasks to share an executor resource. ### Does this PR introduce any user-facing change? Yes: There is a configuration change where `spark.task.resource.[resource type].amount` can now be fractional. ### How was this patch tested? Unit tests and manually on standalone mode, and yarn. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
zsxwing commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#discussion_r333249219 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -275,13 +274,7 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] - (in: Col[I], prefix: String, maxThreads: Int) - (f: I => O) - (implicit -cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map -cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]] // for Future.sequence - ): Col[O] = { + def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { Review comment: This one works for 2.12: https://github.com/apache/spark/pull/26077 Could you try it with Scala 2.13? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on issue #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature
zsxwing commented on issue #26077: [SPARK-29413][CORE][FOLLOWUP] A better parmap method signature URL: https://github.com/apache/spark/pull/26077#issuecomment-540216275 @srowen could you help test whether this compiles with Scala 2.13? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing opened a new pull request #26077: [SPARK-29413][CORE] A better parmap method signature
zsxwing opened a new pull request #26077: [SPARK-29413][CORE] A better parmap method signature URL: https://github.com/apache/spark/pull/26077 ### What changes were proposed in this pull request? This PR updates the `parmap` method signature to return the same type of `input` to make the method more flexible. ### Why are the changes needed? `parmap` will be more flexible similar to `Future.sequence`. ### Does this PR introduce any user-facing change? No. This PR updates an internal method. ### How was this patch tested? The new test added in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen edited a comment on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)
JoshRosen edited a comment on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076#issuecomment-540211354 > But if we would like to fix all these problems, all public APIs accepting `Encoder` will need the copy. I think that most existing uses of Encoders are de-facto thread-safe because either (a) the use occurs inside of a Spark task and task gets its own fresh copy when the `Task` is deserialized or (b) the use occurs on the driver but the code calls call `resolveAndBind` (which internally performs a `copy`) prior to using the Encoder. Given this, I suspect that this might be the only non-thread-safe Encoder usage in Spark (excluding code which is only used in Spark's unit tests). I don't think that we need to introduce similar copying in other public APIs. > I did some research about this and found some noticeable performance regression in our internal benchmark. What do you think about improving the performance / reducing the cost of `.copy()` by refactoring the `ExpressionEncoder` class such that (a) all of the immutable `vals` become fields of the case class, (b) the current constructor becomes a `.apply()` on the companion object and the case class constructor becomes `private`, and (c) `resolveAndBind` calls the companion object constructor instead of `copy()`? Given this, I think `copy()` could be _really_ cheap, effectively giving us a fresh copy of the internal mutable state but copying all other immutable attributes without performing any re-resolution, analysis, attribute binding, etc. If we do that, we'd be able to defensively copy at _very_ low cost (e.g. one object allocation) and then could copy-by-default and free users from having to worry about thread-safety. I think that's a potentially huge win from a developer productivity point-of-view: the cost / toil of having to worry about thread-unsafe code is a tax placed on end users and creates a developer education / training burden, so I think it's worth attempting to eliminate this entire class of pitfall. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen edited a comment on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)
JoshRosen edited a comment on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076#issuecomment-540211354 > But if we would like to fix all these problems, all public APIs accepting `Encoder` will need the copy. I think that most existing uses of Encoders are de-facto thread-safe because either (a) the use occurs inside of a Spark task and task gets its own fresh copy of the Encoder when the `Task` is deserialized or (b) the use occurs on the driver but the code calls call `resolveAndBind` (which internally performs a `copy`) prior to using the Encoder. Given this, I suspect that this might be the only non-thread-safe Encoder usage in Spark (excluding code which is only used in Spark's unit tests). I don't think that we need to introduce similar copying in other public APIs. > I did some research about this and found some noticeable performance regression in our internal benchmark. What do you think about improving the performance / reducing the cost of `.copy()` by refactoring the `ExpressionEncoder` class such that (a) all of the immutable `vals` become fields of the case class, (b) the current constructor becomes a `.apply()` on the companion object and the case class constructor becomes `private`, and (c) `resolveAndBind` calls the companion object constructor instead of `copy()`? Given this, I think `copy()` could be _really_ cheap, effectively giving us a fresh copy of the internal mutable state but copying all other immutable attributes without performing any re-resolution, analysis, attribute binding, etc. If we do that, we'd be able to defensively copy at _very_ low cost (e.g. one object allocation) and then could copy-by-default and free users from having to worry about thread-safety. I think that's a potentially huge win from a developer productivity point-of-view: the cost / toil of having to worry about thread-unsafe code is a tax placed on end users and creates a developer education / training burden, so I think it's worth attempting to eliminate this entire class of pitfall. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)
JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076#issuecomment-540211354 > But if we would like to fix all these problems, all public APIs accepting `Encoder` will need the copy. I think that most existing uses of Encoders are thread-safe because either (a) the use occurs inside of a Spark task and task gets its own fresh copy when the `Task` is deserialized or (b) the use occurs on the driver but the code calls call `resolveAndBind` (which internally performs a `copy`) prior to using the Encoder. Given this, I suspect that this might be the only non-thread-safe Encoder usage in Spark (excluding code which is only used in Spark's unit tests). I don't think that we need to introduce similar copying in other public APIs. > I did some research about this and found some noticeable performance regression in our internal benchmark. What do you think about improving the performance / reducing the cost of `.copy()` by refactoring the `ExpressionEncoder` class such that (a) all of the immutable `vals` become fields of the case class, (b) the current constructor becomes a `.apply()` on the companion object and the case class constructor becomes `private`, and (c) `resolveAndBind` calls the companion object constructor instead of `copy()`? Given this, I think `copy()` could be _really_ cheap, effectively giving us a fresh copy of the internal mutable state but copying all other immutable attributes without performing any re-resolution, analysis, attribute binding, etc. If we do that, we'd be able to defensively copy at _very_ low cost (e.g. one object allocation) and then could copy-by-default and free users from having to worry about thread-safety. I think that's a potentially huge win from a developer productivity point-of-view: the cost / toil of having to worry about thread-unsafe code is a tax placed on end users and creates a developer education / training burden, so I think it's worth attempting to eliminate this entire class of pitfall. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BryanCutler commented on a change in pull request #26041: [SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now
BryanCutler commented on a change in pull request #26041: [SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now URL: https://github.com/apache/spark/pull/26041#discussion_r333241522 ## File path: appveyor.yml ## @@ -57,6 +57,8 @@ build_script: environment: NOT_CRAN: true + # See SPARK-29378. + ARROW_PRE_0_15_IPC_FORMAT: 1 Review comment: Does setting this here carry it through to the worker? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BryanCutler commented on issue #24896: [WIP][SPARK-28006] User-defined grouped transform pandas_udf for window operations
BryanCutler commented on issue #24896: [WIP][SPARK-28006] User-defined grouped transform pandas_udf for window operations URL: https://github.com/apache/spark/pull/24896#issuecomment-540198984 @icexelloss I'm a little confused about a couple things 1) is this specific to operations over a window only? 2) can the same functionality be achieved by the `GROUPED_MAP` pandas_udf, only it is slower and awkward because it requires the whole dataframe even though only 1 column is used? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
srowen commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#discussion_r333234146 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -275,13 +274,7 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] - (in: Col[I], prefix: String, maxThreads: Int) - (f: I => O) - (implicit -cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map -cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]] // for Future.sequence - ): Col[O] = { + def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { Review comment: Heh, no, that's the hard part here. If you `./dev/change-scala-version.sh 2.13` after modifying it to accept that value, and tweak a few more things, and use `-Pscala-2.13` you can at least get to the point where there are still tons of compile errors including this one. There are a few things we can't fix now (i.e. deps that don't support 2.13) and are hard (i.e. things that will require multiple source trees). I'm fixing everything short of that that I can. If you can write anything else in 2.12 that doesn't use TraversableLike, I can try it in my hacked up local build. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on issue #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener
squito commented on issue #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener URL: https://github.com/apache/spark/pull/25943#issuecomment-540197120 Sorry I am getting caught up on a lot of stuff here -- how is this related to https://github.com/apache/spark/pull/25577 ? They seem to be two different approaches to the same problem (though https://github.com/apache/spark/pull/25577 does a little bit more than just what is here). Just for storing live entities, this approach seems more promising to me -- using Java serialization is really bad for compatibility, any changes to classes will break deserialization. If we want the snapshot / checkpoint files to be able to entirely replace the event log files, we'll need the compatibility. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
zsxwing commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#discussion_r333232018 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -275,13 +274,7 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] - (in: Col[I], prefix: String, maxThreads: Int) - (f: I => O) - (implicit -cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map -cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]] // for Future.sequence - ): Col[O] = { + def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { Review comment: @srowen I can give it a try. Does our build support 2.13 now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #26057: [SPARK-29377][PYTHON][ML] Parity between Scala ML tuning and Python ML tuning
huaxingao commented on a change in pull request #26057: [SPARK-29377][PYTHON][ML] Parity between Scala ML tuning and Python ML tuning URL: https://github.com/apache/spark/pull/26057#discussion_r333230868 ## File path: python/pyspark/ml/tuning.py ## @@ -199,7 +181,25 @@ def _to_java_impl(self): return java_estimator, java_epms, java_evaluator -class CrossValidator(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels, +class _CrossValidatorParams(_ValidatorParams): Review comment: I don't think adding leading underscore will affect users to extend these classes. The single leading underscore before a class name is only a weak indicator for internal usage. It doesn't enforce privacy. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BryanCutler commented on a change in pull request #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
BryanCutler commented on a change in pull request #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide URL: https://github.com/apache/spark/pull/26045#discussion_r333229971 ## File path: docs/sql-pyspark-pandas-with-arrow.md ## @@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim different than a Pandas timestamp. It is recommended to use Pandas time series functionality when working with timestamps in `pandas_udf`s to get the best performance, see [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. + +### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x Review comment: @HyukjinKwon so you don't need this note for R, Arrow was not used in 2.4.x? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BryanCutler commented on issue #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
BryanCutler commented on issue #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide URL: https://github.com/apache/spark/pull/26045#issuecomment-540191957 Updated to add some clarification and links to JIRA showing the error message if the env var is not set and link to the Arrow blog. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
srowen commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#discussion_r333228861 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -275,13 +274,7 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] - (in: Col[I], prefix: String, maxThreads: Int) - (f: I => O) - (implicit -cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map -cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]] // for Future.sequence - ): Col[O] = { + def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { Review comment: Sure, but, do I necessarily have to get a List from a List? Seq -> Seq seems fine for all current usages. That said, if there's any other way to write this that works in 2.12 and 2.13, definitely, open to that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)
zsxwing commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076#issuecomment-540191322 > the caller-supplied `Encoder` is used in multiple threads I would say this is a user error. But I agree that it's pretty easy to overlook. IMO, I made such mistake when I first used `Encoder`. But if we would like to fix all these problems, all public APIs accepting `Encoder` will need the copy. I did some research about this and found some noticeable performance regression in our internal benchmark. That's why I finally just submitted https://github.com/apache/spark/pull/25209 to make users easy to copy an `Encoder` instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen edited a comment on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)
JoshRosen edited a comment on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076#issuecomment-540186939 ~~I'll submit a separate patch for 2.4.x.~~ Actually, this is a clean merge with 2.4.x, so we can merge this PR to both branches. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety issue in createDataset(Seq)
JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety issue in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076#issuecomment-540186939 I'll submit a separate patch for 2.4.x. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen opened a new pull request #26076: [SPARK-29419][SQL] Fix Encoder thread-safety issue in createDataset(Seq)
JoshRosen opened a new pull request #26076: [SPARK-29419][SQL] Fix Encoder thread-safety issue in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076 ### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. --- Thanks to @smcnamara-stripe for identifying this bug. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #21164: [SPARK-24098][SQL] ScriptTransformationExec should wait process exiting before output iterator finish
viirya commented on a change in pull request #21164: [SPARK-24098][SQL] ScriptTransformationExec should wait process exiting before output iterator finish URL: https://github.com/apache/spark/pull/21164#discussion_r333223463 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala ## @@ -137,13 +137,12 @@ case class ScriptTransformationExec( throw writerThread.exception.get } - if (!proc.isAlive) { -val exitCode = proc.exitValue() -if (exitCode != 0) { - logError(stderrBuffer.toString) // log the stderr circular buffer - throw new SparkException(s"Subprocess exited with status $exitCode. " + -s"Error: ${stderrBuffer.toString}", cause) -} + proc.waitFor() Review comment: I tested with provided python script. Actually at ScriptTransformationWriterThread, no exception is thrown. And, checkFailureAndPropagate is called not due to EOFException, it is called at: https://github.com/apache/spark/blob/8a17d26784c53fb50b6373b566aab71135c8956f/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala#L163-L167 In this case, checkFailureAndPropagate can be called before proc is being terminated due to possible lag, so the !proc.isAlive check fails. I think the fix is to add proc.waitFor() before other checkFailureAndPropagate calls. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BryanCutler commented on a change in pull request #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
BryanCutler commented on a change in pull request #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide URL: https://github.com/apache/spark/pull/26045#discussion_r333215694 ## File path: docs/sql-pyspark-pandas-with-arrow.md ## @@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim different than a Pandas timestamp. It is recommended to use Pandas time series functionality when working with timestamps in `pandas_udf`s to get the best performance, see [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. + +### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x + +Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be set in Review comment: Sure, that would be good thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BryanCutler commented on a change in pull request #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
BryanCutler commented on a change in pull request #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide URL: https://github.com/apache/spark/pull/26045#discussion_r333214127 ## File path: docs/sql-pyspark-pandas-with-arrow.md ## @@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim different than a Pandas timestamp. It is recommended to use Pandas time series functionality when working with timestamps in `pandas_udf`s to get the best performance, see [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. + +### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x + +Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be set in +Spark so that PySpark maintain compatibility with versions on PyArrow 0.15.0 and above. The following can be added to `conf/spark-env.sh` to use the legacy IPC format: + +``` +ARROW_PRE_0_15_IPC_FORMAT=1 Review comment: Yes I agree, I'll clarify this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
dongjoon-hyun commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#discussion_r333206520 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -275,13 +274,7 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] - (in: Col[I], prefix: String, maxThreads: Int) - (f: I => O) - (implicit -cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map -cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]] // for Future.sequence - ): Col[O] = { + def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { Review comment: It seems that Apache Spark doesn't have those use cases yet. Was it for the future-proof? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
dongjoon-hyun commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#discussion_r333206520 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -275,13 +274,7 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] - (in: Col[I], prefix: String, maxThreads: Int) - (f: I => O) - (implicit -cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map -cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]] // for Future.sequence - ): Col[O] = { + def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { Review comment: It seems that Apache Spark doesn't have those use cases yet. Was it for the future proof? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener
squito commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener URL: https://github.com/apache/spark/pull/25943#discussion_r333194793 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/api.scala ## @@ -181,6 +182,7 @@ class RDDStorageInfo private[spark]( val partitions: Option[Seq[RDDPartitionInfo]]) class RDDDataDistribution private[spark]( +val executorId: String, Review comment: the api versioning requirements are listed here: https://spark.apache.org/docs/latest/monitoring.html#api-versioning-policy > New fields may be added to existing endpoints so its OK to add new fields. but, if we're adding things that really don't make sense as part of the api, then maybe we should just store a different object instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener
squito commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener URL: https://github.com/apache/spark/pull/25943#discussion_r333195040 ## File path: core/src/main/scala/org/apache/spark/status/api/v1/api.scala ## @@ -192,14 +194,34 @@ class RDDDataDistribution private[spark]( @JsonDeserialize(contentAs = classOf[JLong]) val onHeapMemoryRemaining: Option[Long], @JsonDeserialize(contentAs = classOf[JLong]) -val offHeapMemoryRemaining: Option[Long]) +val offHeapMemoryRemaining: Option[Long]) { + + def toLiveRDDDistribution(executors: scala.collection.Map[String, LiveExecutor]) + : LiveRDDDistribution = { Review comment: these added methods should probably be `private[spark]` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yifeih commented on a change in pull request #25823: [SPARK-28211][Core][Shuffle] Propose Shuffle Driver Components API
yifeih commented on a change in pull request #25823: [SPARK-28211][Core][Shuffle] Propose Shuffle Driver Components API URL: https://github.com/apache/spark/pull/25823#discussion_r333181721 ## File path: core/src/main/scala/org/apache/spark/SparkContext.scala ## @@ -524,6 +529,12 @@ class SparkContext(config: SparkConf) extends Logging { executorEnvs ++= _conf.getExecutorEnv executorEnvs("SPARK_USER") = sparkUser +val redactionRegex = _conf.get(SECRET_REDACTION_PATTERN) Review comment: ah right, will fix that now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service
tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service URL: https://github.com/apache/spark/pull/26000#discussion_r333181357 ## File path: docs/running-on-yarn.md ## @@ -492,6 +492,13 @@ To use a custom metrics.properties for the application master and executors, upd If it is not set then the YARN application ID is used. + + spark.yarn.shuffle.service.name + spark_shuffle + +Name of the external shuffle service. Review comment: I think we need more description here. This isn't setting what the service runs as, you have to configure that via yarn, this is what executors use for external shuffle service name when launching the container. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #25823: [SPARK-28211][Core][Shuffle] Propose Shuffle Driver Components API
squito commented on a change in pull request #25823: [SPARK-28211][Core][Shuffle] Propose Shuffle Driver Components API URL: https://github.com/apache/spark/pull/25823#discussion_r333178354 ## File path: core/src/main/scala/org/apache/spark/SparkContext.scala ## @@ -524,6 +529,12 @@ class SparkContext(config: SparkConf) extends Logging { executorEnvs ++= _conf.getExecutorEnv executorEnvs("SPARK_USER") = sparkUser +val redactionRegex = _conf.get(SECRET_REDACTION_PATTERN) Review comment: oops, this isn't needed anymore This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
zsxwing commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#discussion_r333177093 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -275,13 +274,7 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] - (in: Col[I], prefix: String, maxThreads: Int) - (f: I => O) - (implicit -cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map -cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]] // for Future.sequence - ): Col[O] = { + def parmap[I, O](in: Seq[I], prefix: String, maxThreads: Int)(f: I => O): Seq[O] = { Review comment: This signature is much less flexible. It always returns `Seq`. It forces us to use `asInstanceOf` like this: ``` val l = List(1, 2, 3) l2 = parmap(l, )(...).asInstanceOf[List[...]] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on issue #26075: [WIP][K8S] Spark operator
holdenk commented on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540134447 Yeah idk if we want this in Spark it's self. The dev@ mailing list is probably the right place to raise this discussion. There are a lot hard coded things in here we would need to change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun removed a comment on issue #26075: [WIP] Spark operator
dongjoon-hyun removed a comment on issue #26075: [WIP] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540130660 Hi, @jkremser . Thank you for making a PR. 1. Please follow the PR template. The template is for the community, but this PR intentionally ignore it. The screencast should be an additional attachment for that. 2. Is this the following? Please give a clear explanation if this is your own contribution. - https://github.com/GoogleCloudPlatform/spark-on-k8s-operator 3. Could you send an email to `d...@spark.apache.org` about this and why this should be inside `Apache Spark` project? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #26075: [WIP] Spark operator
dongjoon-hyun commented on issue #26075: [WIP] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540130738 cc @holdenk This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #26075: [WIP] Spark operator
dongjoon-hyun commented on issue #26075: [WIP] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540130660 Hi, @jkremser . Thank you for making a PR. 1. Please follow the PR template. The template is for the community, but this PR intentionally ignore it. The screencast should be an additional attachment for that. 2. Is this the following? Please give a clear explanation if this is your own contribution. - https://github.com/GoogleCloudPlatform/spark-on-k8s-operator 3. Could you send an email to `d...@spark.apache.org` about this and why this should be inside `Apache Spark` project? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #26073: [SPARK-29416][CORE][ML][SQL][MESOS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13)
srowen commented on a change in pull request #26073: [SPARK-29416][CORE][ML][SQL][MESOS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13) URL: https://github.com/apache/spark/pull/26073#discussion_r333165360 ## File path: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ## @@ -189,6 +189,6 @@ object ChiSqSelectorSuite extends SparkFunSuite { } def checkEqual(a: ChiSqSelectorModel, b: ChiSqSelectorModel): Unit = { -assert(a.selectedFeatures.deep == b.selectedFeatures.deep) +assert(a.selectedFeatures.sameElements(b.selectedFeatures) Review comment: Ugh I messed up something translating between my two builds. Will fix This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #26070: [SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13
dongjoon-hyun commented on issue #26070: [SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13 URL: https://github.com/apache/spark/pull/26070#issuecomment-540128065 The above comment on the `GitHub Action` is an explaination on the current outage of `GitHub Action`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on issue #26070: [SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13
dongjoon-hyun edited a comment on issue #26070: [SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13 URL: https://github.com/apache/spark/pull/26070#issuecomment-540127530 @srowen . `GitHub Action` is just optional infra. I ran UTs locally (core/ml/stream/catalyst and some `sql`). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #26070: [SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13
dongjoon-hyun commented on issue #26070: [SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13 URL: https://github.com/apache/spark/pull/26070#issuecomment-540127530 @srowen . `GitHub Action` is just optional infra. I ran UTs local (core/ml/stream/catalyst and some `sql`). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jkremser opened a new pull request #26075: [WIP] Spark operator
jkremser opened a new pull request #26075: [WIP] Spark operator URL: https://github.com/apache/spark/pull/26075 ![ascii](https://user-images.githubusercontent.com/535866/66508719-f3271500-ead1-11e9-94e1-e3100e5f1d39.gif) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #26073: [SPARK-29416][CORE][ML][SQL][MESOS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13)
dongjoon-hyun commented on a change in pull request #26073: [SPARK-29416][CORE][ML][SQL][MESOS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13) URL: https://github.com/apache/spark/pull/26073#discussion_r333160247 ## File path: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ## @@ -189,6 +189,6 @@ object ChiSqSelectorSuite extends SparkFunSuite { } def checkEqual(a: ChiSqSelectorModel, b: ChiSqSelectorModel): Unit = { -assert(a.selectedFeatures.deep == b.selectedFeatures.deep) +assert(a.selectedFeatures.sameElements(b.selectedFeatures) Review comment: Hi, @srowen . This is a compile error~ ``` [error] /Users/dongjoon/PRS/PR-26073/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala:193: ')' expected but '}' found. [error] } [error] ^ [error] one error found ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #21164: [SPARK-24098][SQL] ScriptTransformationExec should wait process exiting before output iterator finish
viirya commented on a change in pull request #21164: [SPARK-24098][SQL] ScriptTransformationExec should wait process exiting before output iterator finish URL: https://github.com/apache/spark/pull/21164#discussion_r333158224 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala ## @@ -174,7 +173,6 @@ case class ScriptTransformationExec( // Ideally the proc should *not* be alive at this point but // there can be a lag between EOF being written out and the process // being terminated. So explicitly waiting for the process to be done. -proc.waitFor() Review comment: I do not think we should remove proc.waitFor() here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
dongjoon-hyun closed pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
dongjoon-hyun commented on issue #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#issuecomment-540114875 Merged to master. Thank you, @srowen and @MaxGekk . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
dongjoon-hyun commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#discussion_r333141589 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -275,13 +274,7 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] Review comment: I agree with @srowen . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on issue #26070: [SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13
srowen commented on issue #26070: [SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13 URL: https://github.com/apache/spark/pull/26070#issuecomment-540114299 Jenkins tests didn't run. While I'm quite sure that this isn't even a change except at the source level, you think it's ok to go on the compile result from GitHub actions? Could be reasonable in a case like this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13
dongjoon-hyun commented on a change in pull request #26072: [SPARK-29413][CORE] Rewrite ThreadUtils.parmap to avoid TraversableLike, gone in 2.13 URL: https://github.com/apache/spark/pull/26072#discussion_r333141589 ## File path: core/src/main/scala/org/apache/spark/util/ThreadUtils.scala ## @@ -275,13 +274,7 @@ private[spark] object ThreadUtils { * @return new collection in which each element was given from the input collection `in` by * applying the lambda function `f`. */ - def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]] Review comment: I agree with @srowen , but shall we ask @zsxwing to make it sure? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org