[GitHub] [spark] ueshin opened a new pull request #34685: [SPARK-37443][PYTHON] Provide a profiler for Python/Pandas UDFs
ueshin opened a new pull request #34685: URL: https://github.com/apache/spark/pull/34685 ### What changes were proposed in this pull request? ### Why are the changes needed? Currently a profiler is provided for only `RDD` operations, but providing a profiler for Python/Pandas UDFs would be great. ### Does this PR introduce _any_ user-facing change? Yes, the profiler will work and show the results when a Spark conf `spark.python.profile` is set to `true`. ### How was this patch tested? Added some tests and manually checked the behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
SparkQA removed a comment on pull request #34611: URL: https://github.com/apache/spark/pull/34611#issuecomment-975825876 **[Test build #145517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145517/testReport)** for PR 34611 at commit [`86d28cc`](https://github.com/apache/spark/commit/86d28cc6bbe8db8ba3ab79b49f41833df399d535). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
SparkQA commented on pull request #34611: URL: https://github.com/apache/spark/pull/34611#issuecomment-976006138 **[Test build #145517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145517/testReport)** for PR 34611 at commit [`86d28cc`](https://github.com/apache/spark/commit/86d28cc6bbe8db8ba3ab79b49f41833df399d535). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source
SparkQA commented on pull request #34596: URL: https://github.com/apache/spark/pull/34596#issuecomment-975988587 **[Test build #145521 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145521/testReport)** for PR 34596 at commit [`309643b`](https://github.com/apache/spark/commit/309643bea16446d58c9e86e7e0e60988c33934fb). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sathiyapk commented on a change in pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down
sathiyapk commented on a change in pull request #34593: URL: https://github.com/apache/spark/pull/34593#discussion_r754677290 ## File path: sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala ## @@ -319,6 +370,18 @@ class MathFunctionsSuite extends QueryTest with SharedSparkSession { df.withColumn("value_brounded", bround('value)), Seq(Row(BigDecimal("5.9"), BigDecimal("6"))) ) +checkAnswer( + df.withColumn("value_rounded", round('value, 0, "up")), + Seq(Row(BigDecimal("5.9"), BigDecimal("6"))) +) +checkAnswer( + df.withColumn("value_brounded", round('value, 0, "down")), Review comment: yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sathiyapk commented on a change in pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down
sathiyapk commented on a change in pull request #34593: URL: https://github.com/apache/spark/pull/34593#discussion_r754675110 ## File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ## @@ -2197,13 +2197,23 @@ object functions { def round(e: Column): Column = round(e, 0) /** - * Round the value of `e` to `scale` decimal places with HALF_UP round mode - * if `scale` is greater than or equal to 0 or at integral part when `scale` is less than 0. + * Returns the value of the column `e` rounded to 0 decimal places with HALF_UP round mode. * * @group math_funcs * @since 1.5.0 */ - def round(e: Column, scale: Int): Column = withExpr { Round(e.expr, Literal(scale)) } + def round(e: Column, scale: Int): Column = round(e, scale, "half_up") + + /** + * Round the value of `e` to `scale` decimal places with given round mode, default: HALF_UP + * if `scale` is greater than or equal to 0 or at integral part when `scale` is less than 0. + * + * @group math_funcs + * @since 3.3.0 + */ + def round(e: Column, scale: Int, mode: String): Column = withExpr { Review comment: I see this function mostly as a companion to the previously existing function than a completely new function. Otherwise it would be a bit weird the round function with 2 arguments takes scale parameter in `Int` type but the round function with 3 arguments takes the scale parameter in `Column` type. What do you think ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source
sadikovi commented on pull request #34596: URL: https://github.com/apache/spark/pull/34596#issuecomment-975965380 jenkins retest this please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sathiyapk commented on a change in pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down
sathiyapk commented on a change in pull request #34593: URL: https://github.com/apache/spark/pull/34593#discussion_r754671246 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala ## @@ -1500,26 +1514,37 @@ abstract class RoundBase(child: Expression, scale: Expression, } /** - * Round an expression to d decimal places using HALF_UP rounding mode. - * round(2.5) == 3.0, round(3.5) == 4.0. + * Round an expression to d decimal places using given rounding mode, default rounding mode HALF_UP. + * round(2.5) == 3.0, round(3.5) == 4.0, + * round(3.6, 0, 'half_down') == 4.0, round(3.6, 0, 'down') == 3.0. */ // scalastyle:off line.size.limit @ExpressionDescription( - usage = "_FUNC_(expr, d) - Returns `expr` rounded to `d` decimal places using HALF_UP rounding mode.", + usage = "_FUNC_(expr, d) - Returns `expr` rounded to `d` decimal places using given rounding mode.", examples = """ Examples: > SELECT _FUNC_(2.5, 0); 3 + > SELECT _FUNC_(2.6, 0, 'down'); Review comment: yes sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sathiyapk commented on a change in pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down
sathiyapk commented on a change in pull request #34593: URL: https://github.com/apache/spark/pull/34593#discussion_r754670983 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala ## @@ -413,6 +413,18 @@ final class Decimal extends Ordered[Decimal] with Serializable { if (doubled > pow10diff || doubled == pow10diff && longVal % 2 != 0) { longVal += (if (droppedDigits < 0) -1L else 1L) } + case ROUND_UP => Review comment: Hi @sarutak thanks for your review. Out of curiosity, may i know why it will be a bug if `changePrecision` takes `ROUND_UP`, `ROUND_DOWN` or `ROUND_HALF_DOWN` ? So we could do something like this : ``` case ROUND_UP | ROUND_DOWN | ROUND_HALF_DOWN | _ => throw QueryExecutionErrors.unsupportedRoundingMode(roundMode) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975950959 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49991/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests
AmplabJenkins removed a comment on pull request #34636: URL: https://github.com/apache/spark/pull/34636#issuecomment-975950961 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49992/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975950959 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49991/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests
AmplabJenkins commented on pull request #34636: URL: https://github.com/apache/spark/pull/34636#issuecomment-975950961 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49992/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754647909 ## File path: python/pyspark/ml/common.py ## @@ -15,11 +15,15 @@ # limitations under the License. # +from typing import Any, Callable +from pyspark.ml._typing import C, JavaObjectOrPickleDump Review comment: In general, whatever goes into `TYPE_CHECKING` is not imported during normal execution. So as is, all the names we use would be undefined when these scripts are imported. [PEP 563](https://www.python.org/dev/peps/pep-0563/) introduces a concept of postponed evaluation, but were not ready to go there yet (for starters, we still didn't formally drop 3.6 support ‒ I am working on cleaning the code, then we have some components that might require code changes). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975932293 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49991/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests
SparkQA commented on pull request #34636: URL: https://github.com/apache/spark/pull/34636#issuecomment-975931222 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49992/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
nchammas commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754642606 ## File path: python/pyspark/ml/common.py ## @@ -15,11 +15,15 @@ # limitations under the License. # +from typing import Any, Callable +from pyspark.ml._typing import C, JavaObjectOrPickleDump Review comment: Ah OK. The `TYPE_CHECKING` block addresses the namespace pollution issue, I suppose. But can you elaborate on why the type needs to be quoted? I understand that's for when the type is not known at that point in time (e.g. a self-referential type), but that isn't the case here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754637469 ## File path: python/pyspark/ml/common.py ## @@ -15,11 +15,15 @@ # limitations under the License. # +from typing import Any, Callable +from pyspark.ml._typing import C, JavaObjectOrPickleDump Review comment: This import should happen in TYPE_CHECKING block ```python from typing import TYPE_CHECKING if TYPE_CHECKING: from pyspark.ml._typing import C, JavaObjectOrPickleDump ``` Consequently, `JavaObjectOrPickleDump` and `C` have to be quoted when used (`"JavaObjectOrPickleDump"`), i.e. ```python def _java2py(sc: SparkContext, r: "JavaObjectOrPickleDump", encoding: str = "bytes") -> Any: ... ``` That's because objects in stubs have no runtime equivalents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754637469 ## File path: python/pyspark/ml/common.py ## @@ -15,11 +15,15 @@ # limitations under the License. # +from typing import Any, Callable +from pyspark.ml._typing import C, JavaObjectOrPickleDump Review comment: This import should happen in TYPE_CHECKING block ```python from typing import TYPE_CHECKING if TYPE_CHECKING: from pyspark.ml._typing import C, JavaObjectOrPickleDump ``` Consequently, `JavaObjectOrPickleDump` and `C` have to be quoted when used (`"JavaObjectOrPickleDump"`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
AmplabJenkins removed a comment on pull request #34611: URL: https://github.com/apache/spark/pull/34611#issuecomment-975908416 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49989/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
AmplabJenkins removed a comment on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-975908414 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145515/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975908413 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
AmplabJenkins commented on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-975908414 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145515/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
AmplabJenkins commented on pull request #34611: URL: https://github.com/apache/spark/pull/34611#issuecomment-975908416 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49989/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975908413 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975902438 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49991/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests
SparkQA commented on pull request #34636: URL: https://github.com/apache/spark/pull/34636#issuecomment-975900558 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49992/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
SparkQA removed a comment on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-975667590 **[Test build #145515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145515/testReport)** for PR 34668 at commit [`b703492`](https://github.com/apache/spark/commit/b703492482c67cd6616e617cbdd0155c2a9aff27). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
SparkQA commented on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-975894629 **[Test build #145515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145515/testReport)** for PR 34668 at commit [`b703492`](https://github.com/apache/spark/commit/b703492482c67cd6616e617cbdd0155c2a9aff27). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #34655: [SPARK-37380][PYTHON] Miscellaneous Python lint infra cleanup
zero323 commented on pull request #34655: URL: https://github.com/apache/spark/pull/34655#issuecomment-975894324 > @zero323 - Is the cleanup of pycodestyle configs along the lines of what you were expecting? Looks sensible, but I wouldn't mind more eyes on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
SparkQA commented on pull request #34611: URL: https://github.com/apache/spark/pull/34611#issuecomment-975893441 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49989/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975893327 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49990/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975886499 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49988/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests
AmplabJenkins removed a comment on pull request #34636: URL: https://github.com/apache/spark/pull/34636#issuecomment-975877377 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145520/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests
SparkQA removed a comment on pull request #34636: URL: https://github.com/apache/spark/pull/34636#issuecomment-975866493 **[Test build #145520 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145520/testReport)** for PR 34636 at commit [`14022c0`](https://github.com/apache/spark/commit/14022c0b2e3cdb85ac11511e11a5827a920b8070). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests
AmplabJenkins commented on pull request #34636: URL: https://github.com/apache/spark/pull/34636#issuecomment-975877377 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145520/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests
SparkQA commented on pull request #34636: URL: https://github.com/apache/spark/pull/34636#issuecomment-975877062 **[Test build #145520 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145520/testReport)** for PR 34636 at commit [`14022c0`](https://github.com/apache/spark/commit/14022c0b2e3cdb85ac11511e11a5827a920b8070). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975875789 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145519/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975866358 **[Test build #145519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145519/testReport)** for PR 34671 at commit [`15b5fc4`](https://github.com/apache/spark/commit/15b5fc4497ab7d0f3f13a22aae7b133a73af6e12). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975875789 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145519/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975875557 **[Test build #145519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145519/testReport)** for PR 34671 at commit [`15b5fc4`](https://github.com/apache/spark/commit/15b5fc4497ab7d0f3f13a22aae7b133a73af6e12). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests
SparkQA commented on pull request #34636: URL: https://github.com/apache/spark/pull/34636#issuecomment-975866493 **[Test build #145520 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145520/testReport)** for PR 34636 at commit [`14022c0`](https://github.com/apache/spark/commit/14022c0b2e3cdb85ac11511e11a5827a920b8070). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975866358 **[Test build #145519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145519/testReport)** for PR 34671 at commit [`15b5fc4`](https://github.com/apache/spark/commit/15b5fc4497ab7d0f3f13a22aae7b133a73af6e12). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975865837 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145518/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34684: [SPARK-37442] - InMemoryRelation statistics bug causing broadcast join failures with AQE enabled
AmplabJenkins commented on pull request #34684: URL: https://github.com/apache/spark/pull/34684#issuecomment-975866093 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975865837 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145518/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
SparkQA commented on pull request #34611: URL: https://github.com/apache/spark/pull/34611#issuecomment-975862121 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49989/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975861755 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49990/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975857224 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49988/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ChenMichael commented on pull request #34684: [SPARK-37442] - InMemoryRelation statistics bug causing broadcast join failures with AQE enabled
ChenMichael commented on pull request #34684: URL: https://github.com/apache/spark/pull/34684#issuecomment-975851968 @cloud-fan can you take a look when you have some time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on pull request #34655: [SPARK-37380][PYTHON] Miscellaneous Python lint infra cleanup
nchammas commented on pull request #34655: URL: https://github.com/apache/spark/pull/34655#issuecomment-975848602 @zero323 - Is the cleanup of pycodestyle configs along the lines of what you were expecting? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ChenMichael edited a comment on pull request #34684: [SPARK-37442] - Bug when AQE is enabled where replanning tries to rea…
ChenMichael edited a comment on pull request #34684: URL: https://github.com/apache/spark/pull/34684#issuecomment-975845896 I'm not sure if this is the best way to solve this bug so I will detail the other solution I could come up with and then detail the possible problems I could see. 1. Materialize cached rdd immediately (.count) - `buildBuffers` becomes a blocking call. Moves cost of materialization from execution time to planning. - Maybe there is some code that assumes the cached rdd is not materialized until execution. - If someone obtained the cached rdd, but never executes it, then with the new changes there is wasted effort materializing the rdd. From the code, it seems like obtaining the rdd is always followed up by submitting the job to DAGScheduler so I don't know why this would happen. 2. Never use accumulator stats for InMemoryRelation with AQE on - These accumulator stats should be more accurate than the estimated stats, so there can be missed opportunities for optimization -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ChenMichael commented on pull request #34684: [SPARK-37442] - Bug when AQE is enabled where replanning tries to rea…
ChenMichael commented on pull request #34684: URL: https://github.com/apache/spark/pull/34684#issuecomment-975845896 I'm not sure if this is the best way to solve this bug so I will detail the other solution I could come up with and then compare the possible problems with them. 1. Materialize cached rdd immediately (.count) - `buildBuffers` becomes a blocking call. Moves cost of materialization from execution time to planning. - Maybe there is some code that assumes the cached rdd is not materialized until execution. - If someone obtained the cached rdd, but never executes it, then with the new changes there is wasted effort materializing the rdd. From the code, it seems like obtaining the rdd is always followed up by submitting the job to DAGScheduler so I don't know why this would happen. 2. Never use accumulator stats for InMemoryRelation with AQE on - These accumulator stats should be more accurate than the estimated stats, so there can be missed opportunities for optimization -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
nchammas commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754570168 ## File path: python/pyspark/mllib/common.py ## @@ -113,16 +117,16 @@ def _java2py(sc, r, encoding="bytes"): return r -def callJavaFunc(sc, func, *args): +def callJavaFunc(sc: pyspark.context.SparkContext, func: Callable, *args: Any) -> JavaObjectType: Review comment: Never mind, just saw your other comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
nchammas commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754568896 ## File path: python/pyspark/mllib/common.py ## @@ -113,16 +117,16 @@ def _java2py(sc, r, encoding="bytes"): return r -def callJavaFunc(sc, func, *args): +def callJavaFunc(sc: pyspark.context.SparkContext, func: Callable, *args: Any) -> JavaObjectType: Review comment: Not sure of a better alternative to `JavaObjectType` (which I agree is not a great name). Maybe `JavaSerializedType`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975828766 **[Test build #145518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145518/testReport)** for PR 34671 at commit [`9b08ad0`](https://github.com/apache/spark/commit/9b08ad02dfd1c2aa7609f29ac353d091704cc25e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975839386 **[Test build #145518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145518/testReport)** for PR 34671 at commit [`9b08ad0`](https://github.com/apache/spark/commit/9b08ad02dfd1c2aa7609f29ac353d091704cc25e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754565478 ## File path: python/pyspark/ml/_typing.pyi ## @@ -64,7 +65,7 @@ MultilabelClassificationEvaluatorMetricType = Union[ Literal["microRecall"], Literal["microF1Measure"], ] -ClusteringEvaluatorMetricType = Union[Literal["silhouette"]] +ClusteringEvaluatorMetricType = Literal["silhouette"] Review comment: I am OK with this either way, just giving the context. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ChenMichael opened a new pull request #34684: [SPARK-37442] - Bug when AQE is enabled where replanning tries to rea…
ChenMichael opened a new pull request #34684: URL: https://github.com/apache/spark/pull/34684 ### What changes were proposed in this pull request? Immediately materialize underlying rdd cache (using .count) for an InMemoryRelation when `buildBuffers` is called. ### Why are the changes needed? Currently, when `CachedRDDBuilder.buildBuffers` is called, `InMemoryRelation.computeStats` will try to read the accumulators to determine what the relation size is. However, the accumulators are not actually accurate until the cachedRDD is executed and finishes. While this has not happened, the accumulators will report a range from 0 bytes to the accumulator value when the cachedRDD finishes. In AQE, join planning can happen during this time and, if it reads the size as 0 bytes, will likely plan a broadcast join mistakenly believing the build side is very small. If the InMemoryRelation is actually very large in size, then this will cause many issues during execution such as job failure due to broadcasting over 8GB. ### Does this PR introduce _any_ user-facing change? Yes. Before, cache materialization doesn't happen until the job starts to run. Now, it happens when trying to get the rdd representing an InMemoryRelation. ### How was this patch tested? Unit tests Tests added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754564756 ## File path: python/pyspark/mllib/_typing.pyi ## @@ -16,8 +16,11 @@ # specific language governing permissions and limitations # under the License. -from typing import List, Tuple, Union +from typing import List, Tuple, TypeVar, Union from pyspark.mllib.linalg import Vector from numpy import ndarray # noqa: F401 +from py4j.java_gateway import JavaObject VectorLike = Union[ndarray, Vector, List[float], Tuple[float, ...]] +C = TypeVar("C", bound=type) +JavaObjectType = Union[JavaObject, bytearray, bytes] Review comment: On a second thought, let's just use something more descriptive ‒ `JavaObjectOrPickleDump` or something around these lines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754563130 ## File path: python/pyspark/mllib/common.py ## @@ -113,16 +117,16 @@ def _java2py(sc, r, encoding="bytes"): return r -def callJavaFunc(sc, func, *args): +def callJavaFunc(sc: pyspark.context.SparkContext, func: Callable, *args: Any) -> JavaObjectType: Review comment: If you want to keep the alias (https://github.com/apache/spark/pull/34671/files#r754559725, maybe with more descriptive name) we could use it here as well `func: Callable[..., JavaObjectType]`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754563130 ## File path: python/pyspark/mllib/common.py ## @@ -113,16 +117,16 @@ def _java2py(sc, r, encoding="bytes"): return r -def callJavaFunc(sc, func, *args): +def callJavaFunc(sc: pyspark.context.SparkContext, func: Callable, *args: Any) -> JavaObjectType: Review comment: If you want to keep the alias (https://github.com/apache/spark/pull/34671/files#r754559725, maybe with more descriptive name) we could have t used here `func: Callable[..., JavaObjectType]`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
nchammas commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754560349 ## File path: python/pyspark/mllib/_typing.pyi ## @@ -16,8 +16,11 @@ # specific language governing permissions and limitations # under the License. -from typing import List, Tuple, Union +from typing import List, Tuple, TypeVar, Union from pyspark.mllib.linalg import Vector from numpy import ndarray # noqa: F401 +from py4j.java_gateway import JavaObject VectorLike = Union[ndarray, Vector, List[float], Tuple[float, ...]] +C = TypeVar("C", bound=type) +JavaObjectType = Union[JavaObject, bytearray, bytes] Review comment: OK, will do. I originally had it inlined but created the alias since it's used in 4 different places in `mllib/common.py`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754559725 ## File path: python/pyspark/mllib/_typing.pyi ## @@ -16,8 +16,11 @@ # specific language governing permissions and limitations # under the License. -from typing import List, Tuple, Union +from typing import List, Tuple, TypeVar, Union from pyspark.mllib.linalg import Vector from numpy import ndarray # noqa: F401 +from py4j.java_gateway import JavaObject VectorLike = Union[ndarray, Vector, List[float], Tuple[float, ...]] +C = TypeVar("C", bound=type) +JavaObjectType = Union[JavaObject, bytearray, bytes] Review comment: I think it makes more sense to inline this `Union` directly where it is used ‒ it is not shared, and alias doesn't really help to understand meaning of the unioned types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
nchammas commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754558752 ## File path: python/pyspark/ml/_typing.pyi ## @@ -64,7 +65,7 @@ MultilabelClassificationEvaluatorMetricType = Union[ Literal["microRecall"], Literal["microF1Measure"], ] -ClusteringEvaluatorMetricType = Union[Literal["silhouette"]] +ClusteringEvaluatorMetricType = Literal["silhouette"] Review comment: Shall I revert this change, then? My editor was complaining about the `Union` since it only had one member (and I agree with my editor in this case, but it's not a big deal to me). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975828766 **[Test build #145518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145518/testReport)** for PR 34671 at commit [`9b08ad0`](https://github.com/apache/spark/commit/9b08ad02dfd1c2aa7609f29ac353d091704cc25e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
xinrong-databricks commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754556211 ## File path: python/pyspark/ml/common.py ## @@ -53,24 +57,24 @@ def _new_smart_decode(obj): # this will call the ML version of pythonToJava() -def _to_java_object_rdd(rdd): +def _to_java_object_rdd(rdd: RDD) -> JavaObject: """Return an JavaRDD of Object by unpickling It will convert each Python object into Java object by Pickle, whenever the RDD is serialized in batch or not. """ -rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) -return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) +rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) # type: ignore[attr-defined] +return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) # type: ignore[attr-defined] Review comment: FYI https://github.com/apache/spark/pull/34466 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
SparkQA commented on pull request #34611: URL: https://github.com/apache/spark/pull/34611#issuecomment-975825876 **[Test build #145517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145517/testReport)** for PR 34611 at commit [`86d28cc`](https://github.com/apache/spark/commit/86d28cc6bbe8db8ba3ab79b49f41833df399d535). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975825755 > The number of `ignore[attr-defined]` hints required seems a little wrong. But I suppose addressing that would require changes to `SparkContext`, which is out of scope for this PR. That's not optimal, but expected (see https://github.com/apache/spark/pull/34680#issuecomment-975397090). If you encounter case where there is no ongoing migration work and you can avoid ignores, it should be OK to extend the stub. Otherwise, we'll do another pass (ignores on `_jvm` are likely to stay, even if different error code, because we cannot lurk into JVM at this stage). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
xinrong-databricks commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754555187 ## File path: python/pyspark/ml/common.py ## @@ -53,24 +57,24 @@ def _new_smart_decode(obj): # this will call the ML version of pythonToJava() -def _to_java_object_rdd(rdd): +def _to_java_object_rdd(rdd: RDD) -> JavaObject: """Return an JavaRDD of Object by unpickling It will convert each Python object into Java object by Pickle, whenever the RDD is serialized in batch or not. """ -rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) -return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) +rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) # type: ignore[attr-defined] +return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) # type: ignore[attr-defined] Review comment: I think it is fine to leave `type: ignore[attr-defined]` here. We may come back to these after inlining type hints for SparkContext for example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
xinrong-databricks commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754554096 ## File path: python/pyspark/ml/common.py ## @@ -53,24 +57,24 @@ def _new_smart_decode(obj): # this will call the ML version of pythonToJava() -def _to_java_object_rdd(rdd): +def _to_java_object_rdd(rdd: RDD) -> JavaObject: """Return an JavaRDD of Object by unpickling It will convert each Python object into Java object by Pickle, whenever the RDD is serialized in batch or not. """ -rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) -return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) +rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) # type: ignore[attr-defined] +return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) # type: ignore[attr-defined] Review comment: We didn't annotate types of `_reserialize` or`_jvm` in https://github.com/apache/spark/blob/master/python/pyspark/rdd.pyi or https://github.com/apache/spark/blob/master/python/pyspark/context.pyi. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975823020 **[Test build #145516 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145516/testReport)** for PR 34671 at commit [`ed56686`](https://github.com/apache/spark/commit/ed56686a519462f53050aaba0b0766e1f3e8ddf5). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975823967 **[Test build #145516 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145516/testReport)** for PR 34671 at commit [`ed56686`](https://github.com/apache/spark/commit/ed56686a519462f53050aaba0b0766e1f3e8ddf5). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975823990 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145516/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975823990 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145516/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kazuyukitanimura commented on a change in pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
kazuyukitanimura commented on a change in pull request #34611: URL: https://github.com/apache/spark/pull/34611#discussion_r754552871 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java ## @@ -53,19 +53,47 @@ public void skip() { throw new UnsupportedOperationException(); } + private void updateCurrentByte() { +try { + currentByte = (byte) in.read(); +} catch (IOException e) { + throw new ParquetDecodingException("Failed to read a byte", e); +} + } + @Override public final void readBooleans(int total, WritableColumnVector c, int rowId) { -// TODO: properly vectorize this -for (int i = 0; i < total; i++) { - c.putBoolean(rowId + i, readBoolean()); +int i = 0; +if (bitOffset > 0) { + i = Math.min(8 - bitOffset, total); Review comment: I don't think we need to update the `total` here. Only the `i` gets updated. Any particular examples that you are worried? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975823020 **[Test build #145516 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145516/testReport)** for PR 34671 at commit [`ed56686`](https://github.com/apache/spark/commit/ed56686a519462f53050aaba0b0766e1f3e8ddf5). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kazuyukitanimura commented on a change in pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
kazuyukitanimura commented on a change in pull request #34611: URL: https://github.com/apache/spark/pull/34611#discussion_r754552760 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java ## @@ -53,19 +53,47 @@ public void skip() { throw new UnsupportedOperationException(); } + private void updateCurrentByte() { Review comment: Thanks, I would keep `updateCurrentByte` for now. Because public methods are using `read***` prefix, I chose `updateCurrentByte` to distinguish this private method. ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ## @@ -470,6 +493,18 @@ public final int appendBooleans(int count, boolean v) { return result; } + /** + * Append bits from [src[offset], src[offset + count]) + * src must contain bit-packed 8 Booleans in the byte. Review comment: Updated ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala ## @@ -130,6 +133,89 @@ class ColumnarBatchSuite extends SparkFunSuite { } } + testVector("Boolean APIs", 1024, BooleanType) { +column => + val reference = mutable.ArrayBuffer.empty[Boolean] + + var values = Array(true, false, true, false, false) + var bits = values.foldRight(0)((b, i) => i << 1 | (if (b) 1 else 0)).toByte + column.appendBooleans(2, bits, 0) + reference ++= values.slice(0, 2) + + column.appendBooleans(3, bits, 2) + reference ++= values.slice(2, 5) + + column.appendBooleans(6, true) + reference ++= Array.fill(6)(true) + + column.appendBoolean(false) + reference += false + + var idx = column.elementsAppended + + values = Array(true, true, false, true, false, true, false, true) + bits = values.foldRight(0)((b, i) => i << 1 | (if (b) 1 else 0)).toByte + column.putBooleans(idx, 2, bits, 0) + reference ++= values.slice(0, 2) + idx += 2 + + column.putBooleans(idx, 3, bits, 2) + reference ++= values.slice(2, 5) + idx += 3 + + column.putBooleans(idx, bits) + reference ++= values + idx += 8 + + column.putBoolean(idx, false) + reference += false + idx += 1 + + column.putBooleans(idx, 3, true) + reference ++= Array.fill(3)(true) + idx += 3 + + implicit def intToByte(i: Int): Byte = i.toByte + val buf = ByteBuffer.wrap(Array(0x33, 0x5A, 0xA5, 0xCC, 0x0F, 0xF0, 0xEE)) + val reader = new VectorizedPlainValuesReader() + reader.initFromPage(0, ByteBufferInputStream.wrap(buf)) + column.putBoolean(idx, reader.readBoolean) // bit index 0 + reference += true + idx += 1 + + reader.skipBooleans(1) + + column.putBoolean(idx, reader.readBoolean) // bit index 2 + reference += false + idx += 1 + + reader.skipBooleans(5) + + column.putBoolean(idx, reader.readBoolean) // bit index 8 + reference += false + idx += 1 + + reader.skipBooleans(8) Review comment: Updated ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java ## @@ -53,19 +53,47 @@ public void skip() { throw new UnsupportedOperationException(); } + private void updateCurrentByte() { +try { + currentByte = (byte) in.read(); +} catch (IOException e) { + throw new ParquetDecodingException("Failed to read a byte", e); +} + } + @Override public final void readBooleans(int total, WritableColumnVector c, int rowId) { -// TODO: properly vectorize this -for (int i = 0; i < total; i++) { - c.putBoolean(rowId + i, readBoolean()); +int i = 0; +if (bitOffset > 0) { + i = Math.min(8 - bitOffset, total); Review comment: I don't think we need to update the `total` here. Only the `I` gets updated. Any particular examples that you are worried? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kazuyukitanimura commented on a change in pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
kazuyukitanimura commented on a change in pull request #34611: URL: https://github.com/apache/spark/pull/34611#discussion_r754552682 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java ## @@ -53,19 +53,47 @@ public void skip() { throw new UnsupportedOperationException(); } + private void updateCurrentByte() { +try { + currentByte = (byte) in.read(); +} catch (IOException e) { + throw new ParquetDecodingException("Failed to read a byte", e); +} + } + @Override public final void readBooleans(int total, WritableColumnVector c, int rowId) { -// TODO: properly vectorize this -for (int i = 0; i < total; i++) { - c.putBoolean(rowId + i, readBoolean()); +int i = 0; +if (bitOffset > 0) { + i = Math.min(8 - bitOffset, total); + c.putBooleans(rowId, i, currentByte, bitOffset); + bitOffset = (bitOffset + i) & 7; +} +for (; i + 7 < total; i += 8) { + updateCurrentByte(); + c.putBooleans(rowId + i, currentByte); +} +if (i < total) { + updateCurrentByte(); + bitOffset = total - i; + c.putBooleans(rowId + i, bitOffset, currentByte, 0); } } @Override public final void skipBooleans(int total) { -// TODO: properly vectorize this -for (int i = 0; i < total; i++) { - readBoolean(); +// using >>3 instead of /8 below since Java division rounds towards zero i.e. (-1)/8=0 Review comment: `(total - (8 - bitOffset))` can be negative. E.g. `total=1`, `bitOffset=1`, for this combination, `(total - (8 - bitOffset))=(1-(8-1))=-6`. And `(-7)>>3=(-1)` is what we need. So the comment still holds. This example is covered in `ColumnarBatchSuite.scala` Added `skipBooleans(0)` test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754552177 ## File path: python/pyspark/ml/_typing.pyi ## @@ -64,7 +65,7 @@ MultilabelClassificationEvaluatorMetricType = Union[ Literal["microRecall"], Literal["microF1Measure"], ] -ClusteringEvaluatorMetricType = Union[Literal["silhouette"]] +ClusteringEvaluatorMetricType = Literal["silhouette"] Review comment: This was intentional to make it consistent with the remaining `*EvaluatorMetricTypes` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754551374 ## File path: python/pyspark/ml/common.py ## @@ -53,24 +57,24 @@ def _new_smart_decode(obj): # this will call the ML version of pythonToJava() -def _to_java_object_rdd(rdd): +def _to_java_object_rdd(rdd: RDD) -> JavaObject: """Return an JavaRDD of Object by unpickling It will convert each Python object into Java object by Pickle, whenever the RDD is serialized in batch or not. """ -rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) -return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) +rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) # type: ignore[attr-defined] +return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) # type: ignore[attr-defined] Review comment: At the moment, both `SparkContext` and `RDD` don't have inline annotations and stubs, intentionally, cover only public API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 edited a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
zero323 edited a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-974525917 > Just to be clear, are you saying I should split this PR into ml/common.py vs. mllib/common.py? > > And then have an umbrella ticket for adding type annotations to all of ml/, and another one for mllib/? For the context ‒ we're in the middle of the process of inlining hints from stubs to inline hints. At the moment we have two umbrella tickets ‒ SPARK-36845 and SPARK-37094 for SQL and core respectively. We should follow this convention for ml and mllib as well. It should be OK to have two tickets (`ml.common` and `mllib.common`) and resolve both in this PR, since you've already started working on that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
nchammas commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975814940 The number of `ignore[attr-defined]` hints required seems a little wrong. But I suppose addressing that would require changes to `SparkContext`, which is out of scope for this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yliou commented on pull request #34622: SPARK-37340 Display StageIds in Operators for SQL UI
yliou commented on pull request #34622: URL: https://github.com/apache/spark/pull/34622#issuecomment-975808167 cc @tgravescs would this feature be of interest? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yliou commented on pull request #34637: Spark-37349 add SQL Rest API parsing logic
yliou commented on pull request #34637: URL: https://github.com/apache/spark/pull/34637#issuecomment-975806584 cc @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
nchammas commented on a change in pull request #34671: URL: https://github.com/apache/spark/pull/34671#discussion_r754533879 ## File path: python/pyspark/ml/common.py ## @@ -53,24 +57,24 @@ def _new_smart_decode(obj): # this will call the ML version of pythonToJava() -def _to_java_object_rdd(rdd): +def _to_java_object_rdd(rdd: RDD) -> JavaObject: """Return an JavaRDD of Object by unpickling It will convert each Python object into Java object by Pickle, whenever the RDD is serialized in batch or not. """ -rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) -return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) +rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) # type: ignore[attr-defined] +return rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True) # type: ignore[attr-defined] Review comment: I don't get exactly why mypy doesn't recognize `_reserialize` and `_jvm` as attributes. In the case of `_reserialize`, is it the leading underscore making it a "private" method? In the case of `_jvm`, is it that it's a class attribute? ## File path: python/pyspark/ml/_typing.pyi ## @@ -64,7 +65,7 @@ MultilabelClassificationEvaluatorMetricType = Union[ Literal["microRecall"], Literal["microF1Measure"], ] -ClusteringEvaluatorMetricType = Union[Literal["silhouette"]] +ClusteringEvaluatorMetricType = Literal["silhouette"] Review comment: Not sure what the `Union` here was for. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation
AmplabJenkins removed a comment on pull request #34681: URL: https://github.com/apache/spark/pull/34681#issuecomment-975786325 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145513/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation
SparkQA removed a comment on pull request #34681: URL: https://github.com/apache/spark/pull/34681#issuecomment-975667360 **[Test build #145513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145513/testReport)** for PR 34681 at commit [`65a8758`](https://github.com/apache/spark/commit/65a875846d4f3e5a8fb4978c3d82a2fd1ba3a720). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation
AmplabJenkins commented on pull request #34681: URL: https://github.com/apache/spark/pull/34681#issuecomment-975786325 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145513/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation
SparkQA commented on pull request #34681: URL: https://github.com/apache/spark/pull/34681#issuecomment-975786004 **[Test build #145513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145513/testReport)** for PR 34681 at commit [`65a8758`](https://github.com/apache/spark/commit/65a875846d4f3e5a8fb4978c3d82a2fd1ba3a720). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975775073 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49985/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
AmplabJenkins removed a comment on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-975775068 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49986/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation
AmplabJenkins commented on pull request #34681: URL: https://github.com/apache/spark/pull/34681#issuecomment-975775070 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49987/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation
AmplabJenkins removed a comment on pull request #34681: URL: https://github.com/apache/spark/pull/34681#issuecomment-975775070 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49987/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
AmplabJenkins commented on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-975775068 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49986/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975775073 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49985/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation
SparkQA commented on pull request #34681: URL: https://github.com/apache/spark/pull/34681#issuecomment-975766915 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49987/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-975760231 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49985/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
SparkQA commented on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-975753614 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49986/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #29024: [SPARK-32001][SQL]Create JDBC authentication provider developer API
gaborgsomogyi commented on a change in pull request #29024: URL: https://github.com/apache/spark/pull/29024#discussion_r754487136 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala ## @@ -18,60 +18,45 @@ package org.apache.spark.sql.execution.datasources.jdbc.connection import java.sql.{Connection, Driver} -import java.util.Properties +import java.util.ServiceLoader -import org.apache.spark.internal.Logging -import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions - -/** - * Connection provider which opens connection toward various databases (database specific instance - * needed). If kerberos authentication required then it's the provider's responsibility to set all - * the parameters. - */ -private[jdbc] trait ConnectionProvider { - /** - * Additional properties for data connection (Data source property takes precedence). - */ - def getAdditionalProperties(): Properties = new Properties() +import scala.collection.mutable - /** - * Opens connection toward the database. - */ - def getConnection(): Connection -} +import org.apache.spark.internal.Logging +import org.apache.spark.security.SecurityConfigurationLock +import org.apache.spark.sql.jdbc.JdbcConnectionProvider +import org.apache.spark.util.Utils private[jdbc] object ConnectionProvider extends Logging { - def create(driver: Driver, options: JDBCOptions): ConnectionProvider = { -if (options.keytab == null || options.principal == null) { - logDebug("No authentication configuration found, using basic connection provider") - new BasicConnectionProvider(driver, options) -} else { - logDebug("Authentication configuration found, using database specific connection provider") - options.driverClass match { -case PostgresConnectionProvider.driverClass => - logDebug("Postgres connection provider found") - new PostgresConnectionProvider(driver, options) - -case MariaDBConnectionProvider.driverClass => - logDebug("MariaDB connection provider found") - new MariaDBConnectionProvider(driver, options) - -case DB2ConnectionProvider.driverClass => - logDebug("DB2 connection provider found") - new DB2ConnectionProvider(driver, options) - -case MSSQLConnectionProvider.driverClass => - logDebug("MS SQL connection provider found") - new MSSQLConnectionProvider(driver, options) - -case OracleConnectionProvider.driverClass => - logDebug("Oracle connection provider found") - new OracleConnectionProvider(driver, options) - -case _ => - throw new IllegalArgumentException(s"Driver ${options.driverClass} does not support " + -"Kerberos authentication") + private val providers = loadProviders() + + def loadProviders(): Seq[JdbcConnectionProvider] = { +val loader = ServiceLoader.load(classOf[JdbcConnectionProvider], + Utils.getContextOrSparkClassLoader) +val providers = mutable.ArrayBuffer[JdbcConnectionProvider]() + +val iterator = loader.iterator +while (iterator.hasNext) { + try { +val provider = iterator.next +logDebug(s"Loaded built in provider: $provider") +providers += provider + } catch { +case t: Throwable => + logError(s"Failed to load built in provider.", t) } } +// Seems duplicate but it's needed for Scala 2.13 +providers.toSeq + } + + def create(driver: Driver, options: Map[String, String]): Connection = { +val filteredProviders = providers.filter(_.canHandle(driver, options)) +require(filteredProviders.size == 1, + "JDBC connection initiated but not exactly one connection provider found which can handle " + +s"it. Found active providers: ${filteredProviders.mkString(", ")}") +SecurityConfigurationLock.synchronized { Review comment: What I can imagine is to pre-set the JVM JAAS config for databases from file like this: `java -Djava.security.auth.login.config=jaas.conf`. And then a new flag can be introduced like: `spark.jdbc.doNotSyncronize` which is default `true`. That case security credentials MUSTN'T be provided by Spark's JDBC properties but only from JAAS file. However this would be super risky and for advanced users only. That said I've spent at least a month to debug the JVM security context what's going on... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: