[GitHub] [spark] ueshin opened a new pull request #34685: [SPARK-37443][PYTHON] Provide a profiler for Python/Pandas UDFs

2021-11-22 Thread GitBox


ueshin opened a new pull request #34685:
URL: https://github.com/apache/spark/pull/34685


   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   Currently a profiler is provided for only `RDD` operations, but providing a 
profiler for Python/Pandas UDFs would be great.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the profiler will work and show the results when a Spark conf 
`spark.python.profile` is set to `true`.
   
   ### How was this patch tested?
   
   Added some tests and manually checked the behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


SparkQA removed a comment on pull request #34611:
URL: https://github.com/apache/spark/pull/34611#issuecomment-975825876


   **[Test build #145517 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145517/testReport)**
 for PR 34611 at commit 
[`86d28cc`](https://github.com/apache/spark/commit/86d28cc6bbe8db8ba3ab79b49f41833df399d535).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


SparkQA commented on pull request #34611:
URL: https://github.com/apache/spark/pull/34611#issuecomment-976006138


   **[Test build #145517 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145517/testReport)**
 for PR 34611 at commit 
[`86d28cc`](https://github.com/apache/spark/commit/86d28cc6bbe8db8ba3ab79b49f41833df399d535).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source

2021-11-22 Thread GitBox


SparkQA commented on pull request #34596:
URL: https://github.com/apache/spark/pull/34596#issuecomment-975988587


   **[Test build #145521 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145521/testReport)**
 for PR 34596 at commit 
[`309643b`](https://github.com/apache/spark/commit/309643bea16446d58c9e86e7e0e60988c33934fb).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sathiyapk commented on a change in pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down

2021-11-22 Thread GitBox


sathiyapk commented on a change in pull request #34593:
URL: https://github.com/apache/spark/pull/34593#discussion_r754677290



##
File path: sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala
##
@@ -319,6 +370,18 @@ class MathFunctionsSuite extends QueryTest with 
SharedSparkSession {
   df.withColumn("value_brounded", bround('value)),
   Seq(Row(BigDecimal("5.9"), BigDecimal("6")))
 )
+checkAnswer(
+  df.withColumn("value_rounded", round('value, 0, "up")),
+  Seq(Row(BigDecimal("5.9"), BigDecimal("6")))
+)
+checkAnswer(
+  df.withColumn("value_brounded", round('value, 0, "down")),

Review comment:
   yes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sathiyapk commented on a change in pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down

2021-11-22 Thread GitBox


sathiyapk commented on a change in pull request #34593:
URL: https://github.com/apache/spark/pull/34593#discussion_r754675110



##
File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
##
@@ -2197,13 +2197,23 @@ object functions {
   def round(e: Column): Column = round(e, 0)
 
   /**
-   * Round the value of `e` to `scale` decimal places with HALF_UP round mode
-   * if `scale` is greater than or equal to 0 or at integral part when `scale` 
is less than 0.
+   * Returns the value of the column `e` rounded to 0 decimal places with 
HALF_UP round mode.
*
* @group math_funcs
* @since 1.5.0
*/
-  def round(e: Column, scale: Int): Column = withExpr { Round(e.expr, 
Literal(scale)) }
+  def round(e: Column, scale: Int): Column = round(e, scale, "half_up")
+
+  /**
+   * Round the value of `e` to `scale` decimal places with given round mode, 
default: HALF_UP
+   * if `scale` is greater than or equal to 0 or at integral part when `scale` 
is less than 0.
+   *
+   * @group math_funcs
+   * @since 3.3.0
+   */
+  def round(e: Column, scale: Int, mode: String): Column = withExpr {

Review comment:
   I see this function mostly as a companion to the previously existing 
function than a completely new function. Otherwise it would be a bit weird the 
round function with 2 arguments takes scale parameter in `Int` type but the 
round function with 3 arguments takes the scale parameter in `Column` type. 
What do you think ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source

2021-11-22 Thread GitBox


sadikovi commented on pull request #34596:
URL: https://github.com/apache/spark/pull/34596#issuecomment-975965380


   jenkins retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sathiyapk commented on a change in pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down

2021-11-22 Thread GitBox


sathiyapk commented on a change in pull request #34593:
URL: https://github.com/apache/spark/pull/34593#discussion_r754671246



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
##
@@ -1500,26 +1514,37 @@ abstract class RoundBase(child: Expression, scale: 
Expression,
 }
 
 /**
- * Round an expression to d decimal places using HALF_UP rounding mode.
- * round(2.5) == 3.0, round(3.5) == 4.0.
+ * Round an expression to d decimal places using given rounding mode, default 
rounding mode HALF_UP.
+ * round(2.5) == 3.0, round(3.5) == 4.0,
+ * round(3.6, 0, 'half_down') == 4.0, round(3.6, 0, 'down') == 3.0.
  */
 // scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "_FUNC_(expr, d) - Returns `expr` rounded to `d` decimal places 
using HALF_UP rounding mode.",
+  usage = "_FUNC_(expr, d) - Returns `expr` rounded to `d` decimal places 
using given rounding mode.",
   examples = """
 Examples:
   > SELECT _FUNC_(2.5, 0);
3
+  > SELECT _FUNC_(2.6, 0, 'down');

Review comment:
   yes sure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sathiyapk commented on a change in pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down

2021-11-22 Thread GitBox


sathiyapk commented on a change in pull request #34593:
URL: https://github.com/apache/spark/pull/34593#discussion_r754670983



##
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
##
@@ -413,6 +413,18 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
 if (doubled > pow10diff || doubled == pow10diff && longVal % 2 != 
0) {
   longVal += (if (droppedDigits < 0) -1L else 1L)
 }
+  case ROUND_UP =>

Review comment:
   Hi @sarutak thanks for your review. Out of curiosity, may i know why it 
will be a bug if `changePrecision` takes `ROUND_UP`, `ROUND_DOWN` or 
`ROUND_HALF_DOWN` ?
   
   So we could do something like this :
   ```
   case ROUND_UP | ROUND_DOWN | ROUND_HALF_DOWN | _ =>
   throw QueryExecutionErrors.unsupportedRoundingMode(roundMode)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975950959


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49991/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34636:
URL: https://github.com/apache/spark/pull/34636#issuecomment-975950961


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49992/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975950959


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49991/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34636:
URL: https://github.com/apache/spark/pull/34636#issuecomment-975950961


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49992/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754647909



##
File path: python/pyspark/ml/common.py
##
@@ -15,11 +15,15 @@
 # limitations under the License.
 #
 
+from typing import Any, Callable
+from pyspark.ml._typing import C, JavaObjectOrPickleDump

Review comment:
   In general, whatever goes into `TYPE_CHECKING` is not imported during 
normal execution. So as is, all the names we use would be undefined when these 
scripts are imported.
   
   [PEP 563](https://www.python.org/dev/peps/pep-0563/) introduces a concept of 
postponed evaluation, but were not ready to go there yet (for starters, we 
still didn't formally drop 3.6 support ‒ I am working on cleaning the code, 
then we have some components that might require code changes).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975932293


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49991/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests

2021-11-22 Thread GitBox


SparkQA commented on pull request #34636:
URL: https://github.com/apache/spark/pull/34636#issuecomment-975931222


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49992/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


nchammas commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754642606



##
File path: python/pyspark/ml/common.py
##
@@ -15,11 +15,15 @@
 # limitations under the License.
 #
 
+from typing import Any, Callable
+from pyspark.ml._typing import C, JavaObjectOrPickleDump

Review comment:
   Ah OK. The `TYPE_CHECKING` block addresses the namespace pollution 
issue, I suppose.
   
   But can you elaborate on why the type needs to be quoted? I understand 
that's for when the type is not known at that point in time (e.g. a 
self-referential type), but that isn't the case here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754637469



##
File path: python/pyspark/ml/common.py
##
@@ -15,11 +15,15 @@
 # limitations under the License.
 #
 
+from typing import Any, Callable
+from pyspark.ml._typing import C, JavaObjectOrPickleDump

Review comment:
   This import should happen in TYPE_CHECKING block
   
   ```python
   from typing import TYPE_CHECKING
   
   if TYPE_CHECKING:
   from pyspark.ml._typing import C, JavaObjectOrPickleDump
   ```
   
   Consequently, `JavaObjectOrPickleDump` and `C` have to be quoted when used 
(`"JavaObjectOrPickleDump"`), i.e.
   
   ```python
   def _java2py(sc: SparkContext, r: "JavaObjectOrPickleDump", encoding: str = 
"bytes") -> Any: ...
   ```
   
   That's because objects in stubs have no runtime equivalents.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754637469



##
File path: python/pyspark/ml/common.py
##
@@ -15,11 +15,15 @@
 # limitations under the License.
 #
 
+from typing import Any, Callable
+from pyspark.ml._typing import C, JavaObjectOrPickleDump

Review comment:
   This import should happen in TYPE_CHECKING block
   
   ```python
   from typing import TYPE_CHECKING
   
   if TYPE_CHECKING:
   from pyspark.ml._typing import C, JavaObjectOrPickleDump
   ```
   
   Consequently, `JavaObjectOrPickleDump` and `C` have to be quoted when used 
(`"JavaObjectOrPickleDump"`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34611:
URL: https://github.com/apache/spark/pull/34611#issuecomment-975908416


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49989/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-975908414


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145515/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975908413






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-975908414


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145515/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34611:
URL: https://github.com/apache/spark/pull/34611#issuecomment-975908416


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49989/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975908413






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975902438


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49991/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests

2021-11-22 Thread GitBox


SparkQA commented on pull request #34636:
URL: https://github.com/apache/spark/pull/34636#issuecomment-975900558


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49992/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-22 Thread GitBox


SparkQA removed a comment on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-975667590


   **[Test build #145515 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145515/testReport)**
 for PR 34668 at commit 
[`b703492`](https://github.com/apache/spark/commit/b703492482c67cd6616e617cbdd0155c2a9aff27).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-22 Thread GitBox


SparkQA commented on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-975894629


   **[Test build #145515 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145515/testReport)**
 for PR 34668 at commit 
[`b703492`](https://github.com/apache/spark/commit/b703492482c67cd6616e617cbdd0155c2a9aff27).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on pull request #34655: [SPARK-37380][PYTHON] Miscellaneous Python lint infra cleanup

2021-11-22 Thread GitBox


zero323 commented on pull request #34655:
URL: https://github.com/apache/spark/pull/34655#issuecomment-975894324


   > @zero323 - Is the cleanup of pycodestyle configs along the lines of what 
you were expecting?
   
   Looks sensible, but I wouldn't mind more eyes on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


SparkQA commented on pull request #34611:
URL: https://github.com/apache/spark/pull/34611#issuecomment-975893441


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49989/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975893327


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49990/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975886499


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49988/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34636:
URL: https://github.com/apache/spark/pull/34636#issuecomment-975877377


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145520/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests

2021-11-22 Thread GitBox


SparkQA removed a comment on pull request #34636:
URL: https://github.com/apache/spark/pull/34636#issuecomment-975866493


   **[Test build #145520 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145520/testReport)**
 for PR 34636 at commit 
[`14022c0`](https://github.com/apache/spark/commit/14022c0b2e3cdb85ac11511e11a5827a920b8070).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34636:
URL: https://github.com/apache/spark/pull/34636#issuecomment-975877377


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145520/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests

2021-11-22 Thread GitBox


SparkQA commented on pull request #34636:
URL: https://github.com/apache/spark/pull/34636#issuecomment-975877062


   **[Test build #145520 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145520/testReport)**
 for PR 34636 at commit 
[`14022c0`](https://github.com/apache/spark/commit/14022c0b2e3cdb85ac11511e11a5827a920b8070).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975875789


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145519/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975866358


   **[Test build #145519 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145519/testReport)**
 for PR 34671 at commit 
[`15b5fc4`](https://github.com/apache/spark/commit/15b5fc4497ab7d0f3f13a22aae7b133a73af6e12).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975875789


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145519/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975875557


   **[Test build #145519 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145519/testReport)**
 for PR 34671 at commit 
[`15b5fc4`](https://github.com/apache/spark/commit/15b5fc4497ab7d0f3f13a22aae7b133a73af6e12).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34636: [WIP][SPARK-37359][K8S] Cleanup the Spark Kubernetes Integration tests

2021-11-22 Thread GitBox


SparkQA commented on pull request #34636:
URL: https://github.com/apache/spark/pull/34636#issuecomment-975866493


   **[Test build #145520 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145520/testReport)**
 for PR 34636 at commit 
[`14022c0`](https://github.com/apache/spark/commit/14022c0b2e3cdb85ac11511e11a5827a920b8070).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975866358


   **[Test build #145519 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145519/testReport)**
 for PR 34671 at commit 
[`15b5fc4`](https://github.com/apache/spark/commit/15b5fc4497ab7d0f3f13a22aae7b133a73af6e12).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975865837


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145518/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34684: [SPARK-37442] - InMemoryRelation statistics bug causing broadcast join failures with AQE enabled

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-975866093


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975865837


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145518/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


SparkQA commented on pull request #34611:
URL: https://github.com/apache/spark/pull/34611#issuecomment-975862121


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49989/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975861755


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49990/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975857224


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49988/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ChenMichael commented on pull request #34684: [SPARK-37442] - InMemoryRelation statistics bug causing broadcast join failures with AQE enabled

2021-11-22 Thread GitBox


ChenMichael commented on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-975851968


   @cloud-fan can you take a look when you have some time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nchammas commented on pull request #34655: [SPARK-37380][PYTHON] Miscellaneous Python lint infra cleanup

2021-11-22 Thread GitBox


nchammas commented on pull request #34655:
URL: https://github.com/apache/spark/pull/34655#issuecomment-975848602


   @zero323 - Is the cleanup of pycodestyle configs along the lines of what you 
were expecting?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ChenMichael edited a comment on pull request #34684: [SPARK-37442] - Bug when AQE is enabled where replanning tries to rea…

2021-11-22 Thread GitBox


ChenMichael edited a comment on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-975845896


   I'm not sure if this is the best way to solve this bug so I will detail the 
other solution I could come up with and then detail the possible problems I 
could see.
   
   1. Materialize cached rdd immediately (.count)
   - `buildBuffers` becomes a blocking call. Moves cost of materialization 
from execution time to planning.
   - Maybe there is some code that assumes the cached rdd is not 
materialized until execution.
   - If someone obtained the cached rdd, but never executes it, then with 
the new changes there is wasted effort materializing the rdd. From the code, it 
seems like obtaining the rdd is always followed up by submitting the job to 
DAGScheduler so I don't know why this would happen.
   
   2. Never use accumulator stats for InMemoryRelation with AQE on
   - These accumulator stats should be more accurate than the estimated 
stats, so there can be missed opportunities for optimization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ChenMichael commented on pull request #34684: [SPARK-37442] - Bug when AQE is enabled where replanning tries to rea…

2021-11-22 Thread GitBox


ChenMichael commented on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-975845896


   I'm not sure if this is the best way to solve this bug so I will detail the 
other solution I could come up with and then compare the possible problems with 
them.
   
   1. Materialize cached rdd immediately (.count)
   - `buildBuffers` becomes a blocking call. Moves cost of materialization 
from execution time to planning.
   - Maybe there is some code that assumes the cached rdd is not 
materialized until execution.
   - If someone obtained the cached rdd, but never executes it, then with 
the new changes there is wasted effort materializing the rdd. From the code, it 
seems like obtaining the rdd is always followed up by submitting the job to 
DAGScheduler so I don't know why this would happen.
   
   2. Never use accumulator stats for InMemoryRelation with AQE on
   - These accumulator stats should be more accurate than the estimated 
stats, so there can be missed opportunities for optimization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


nchammas commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754570168



##
File path: python/pyspark/mllib/common.py
##
@@ -113,16 +117,16 @@ def _java2py(sc, r, encoding="bytes"):
 return r
 
 
-def callJavaFunc(sc, func, *args):
+def callJavaFunc(sc: pyspark.context.SparkContext, func: Callable, *args: Any) 
-> JavaObjectType:

Review comment:
   Never mind, just saw your other comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


nchammas commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754568896



##
File path: python/pyspark/mllib/common.py
##
@@ -113,16 +117,16 @@ def _java2py(sc, r, encoding="bytes"):
 return r
 
 
-def callJavaFunc(sc, func, *args):
+def callJavaFunc(sc: pyspark.context.SparkContext, func: Callable, *args: Any) 
-> JavaObjectType:

Review comment:
   Not sure of a better alternative to `JavaObjectType` (which I agree is 
not a great name).
   
   Maybe `JavaSerializedType`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975828766


   **[Test build #145518 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145518/testReport)**
 for PR 34671 at commit 
[`9b08ad0`](https://github.com/apache/spark/commit/9b08ad02dfd1c2aa7609f29ac353d091704cc25e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975839386


   **[Test build #145518 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145518/testReport)**
 for PR 34671 at commit 
[`9b08ad0`](https://github.com/apache/spark/commit/9b08ad02dfd1c2aa7609f29ac353d091704cc25e).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754565478



##
File path: python/pyspark/ml/_typing.pyi
##
@@ -64,7 +65,7 @@ MultilabelClassificationEvaluatorMetricType = Union[
 Literal["microRecall"],
 Literal["microF1Measure"],
 ]
-ClusteringEvaluatorMetricType = Union[Literal["silhouette"]]
+ClusteringEvaluatorMetricType = Literal["silhouette"]

Review comment:
   I am OK with this either way, just giving the context.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ChenMichael opened a new pull request #34684: [SPARK-37442] - Bug when AQE is enabled where replanning tries to rea…

2021-11-22 Thread GitBox


ChenMichael opened a new pull request #34684:
URL: https://github.com/apache/spark/pull/34684


   
   
   
   
   ### What changes were proposed in this pull request?
   Immediately materialize underlying rdd cache (using .count) for an 
InMemoryRelation when `buildBuffers` is called.
   
   
   
   ### Why are the changes needed?
   
   Currently, when `CachedRDDBuilder.buildBuffers` is called, 
`InMemoryRelation.computeStats` will try to read the accumulators to determine 
what the relation size is. However, the accumulators are not actually accurate 
until the cachedRDD is executed and finishes. While this has not happened, the 
accumulators will report a range from 0 bytes to the accumulator value when the 
cachedRDD finishes. In AQE, join planning can happen during this time and, if 
it reads the size as 0 bytes, will likely plan a broadcast join mistakenly 
believing the build side is very small. If the InMemoryRelation is actually 
very large in size, then this will cause many issues during execution such as 
job failure due to broadcasting over 8GB.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Before, cache materialization doesn't happen until the job starts to 
run. Now, it happens when trying to get the rdd representing an 
InMemoryRelation.
   
   ### How was this patch tested?
   
   Unit tests
   Tests added


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754564756



##
File path: python/pyspark/mllib/_typing.pyi
##
@@ -16,8 +16,11 @@
 # specific language governing permissions and limitations
 # under the License.
 
-from typing import List, Tuple, Union
+from typing import List, Tuple, TypeVar, Union
 from pyspark.mllib.linalg import Vector
 from numpy import ndarray  # noqa: F401
+from py4j.java_gateway import JavaObject
 
 VectorLike = Union[ndarray, Vector, List[float], Tuple[float, ...]]
+C = TypeVar("C", bound=type)
+JavaObjectType = Union[JavaObject, bytearray, bytes]

Review comment:
   On a second thought, let's just use something more descriptive ‒ 
`JavaObjectOrPickleDump` or something around these lines.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754563130



##
File path: python/pyspark/mllib/common.py
##
@@ -113,16 +117,16 @@ def _java2py(sc, r, encoding="bytes"):
 return r
 
 
-def callJavaFunc(sc, func, *args):
+def callJavaFunc(sc: pyspark.context.SparkContext, func: Callable, *args: Any) 
-> JavaObjectType:

Review comment:
   If you want to keep the alias 
(https://github.com/apache/spark/pull/34671/files#r754559725, maybe with more 
descriptive name)  we could use it here as well
   
   `func: Callable[..., JavaObjectType]`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754563130



##
File path: python/pyspark/mllib/common.py
##
@@ -113,16 +117,16 @@ def _java2py(sc, r, encoding="bytes"):
 return r
 
 
-def callJavaFunc(sc, func, *args):
+def callJavaFunc(sc: pyspark.context.SparkContext, func: Callable, *args: Any) 
-> JavaObjectType:

Review comment:
   If you want to keep the alias 
(https://github.com/apache/spark/pull/34671/files#r754559725, maybe with more 
descriptive name)  we could have t used here
   
   `func: Callable[..., JavaObjectType]`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


nchammas commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754560349



##
File path: python/pyspark/mllib/_typing.pyi
##
@@ -16,8 +16,11 @@
 # specific language governing permissions and limitations
 # under the License.
 
-from typing import List, Tuple, Union
+from typing import List, Tuple, TypeVar, Union
 from pyspark.mllib.linalg import Vector
 from numpy import ndarray  # noqa: F401
+from py4j.java_gateway import JavaObject
 
 VectorLike = Union[ndarray, Vector, List[float], Tuple[float, ...]]
+C = TypeVar("C", bound=type)
+JavaObjectType = Union[JavaObject, bytearray, bytes]

Review comment:
   OK, will do. I originally had it inlined but created the alias since 
it's used in 4 different places in `mllib/common.py`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754559725



##
File path: python/pyspark/mllib/_typing.pyi
##
@@ -16,8 +16,11 @@
 # specific language governing permissions and limitations
 # under the License.
 
-from typing import List, Tuple, Union
+from typing import List, Tuple, TypeVar, Union
 from pyspark.mllib.linalg import Vector
 from numpy import ndarray  # noqa: F401
+from py4j.java_gateway import JavaObject
 
 VectorLike = Union[ndarray, Vector, List[float], Tuple[float, ...]]
+C = TypeVar("C", bound=type)
+JavaObjectType = Union[JavaObject, bytearray, bytes]

Review comment:
   I think it makes more sense to inline this `Union` directly where it is 
used ‒ it is not shared, and alias doesn't really help to understand meaning of 
the unioned types.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


nchammas commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754558752



##
File path: python/pyspark/ml/_typing.pyi
##
@@ -64,7 +65,7 @@ MultilabelClassificationEvaluatorMetricType = Union[
 Literal["microRecall"],
 Literal["microF1Measure"],
 ]
-ClusteringEvaluatorMetricType = Union[Literal["silhouette"]]
+ClusteringEvaluatorMetricType = Literal["silhouette"]

Review comment:
   Shall I revert this change, then? My editor was complaining about the 
`Union` since it only had one member (and I agree with my editor in this case, 
but it's not a big deal to me).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975828766


   **[Test build #145518 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145518/testReport)**
 for PR 34671 at commit 
[`9b08ad0`](https://github.com/apache/spark/commit/9b08ad02dfd1c2aa7609f29ac353d091704cc25e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


xinrong-databricks commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754556211



##
File path: python/pyspark/ml/common.py
##
@@ -53,24 +57,24 @@ def _new_smart_decode(obj):
 
 
 # this will call the ML version of pythonToJava()
-def _to_java_object_rdd(rdd):
+def _to_java_object_rdd(rdd: RDD) -> JavaObject:
 """Return an JavaRDD of Object by unpickling
 
 It will convert each Python object into Java object by Pickle, whenever the
 RDD is serialized in batch or not.
 """
-rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
-return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)
+rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))  # type: 
ignore[attr-defined]
+return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)  
# type: ignore[attr-defined]

Review comment:
   FYI https://github.com/apache/spark/pull/34466




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


SparkQA commented on pull request #34611:
URL: https://github.com/apache/spark/pull/34611#issuecomment-975825876


   **[Test build #145517 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145517/testReport)**
 for PR 34611 at commit 
[`86d28cc`](https://github.com/apache/spark/commit/86d28cc6bbe8db8ba3ab79b49f41833df399d535).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975825755


   > The number of `ignore[attr-defined]` hints required seems a little wrong. 
But I suppose addressing that would require changes to `SparkContext`, which is 
out of scope for this PR.
   
   That's not optimal, but expected (see 
https://github.com/apache/spark/pull/34680#issuecomment-975397090). If you 
encounter case where there is no ongoing migration work and you can avoid 
ignores, it should be OK to extend the stub. Otherwise, we'll do another pass 
(ignores on `_jvm` are likely to stay, even if different error code, because we 
cannot lurk into JVM at this stage).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


xinrong-databricks commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754555187



##
File path: python/pyspark/ml/common.py
##
@@ -53,24 +57,24 @@ def _new_smart_decode(obj):
 
 
 # this will call the ML version of pythonToJava()
-def _to_java_object_rdd(rdd):
+def _to_java_object_rdd(rdd: RDD) -> JavaObject:
 """Return an JavaRDD of Object by unpickling
 
 It will convert each Python object into Java object by Pickle, whenever the
 RDD is serialized in batch or not.
 """
-rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
-return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)
+rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))  # type: 
ignore[attr-defined]
+return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)  
# type: ignore[attr-defined]

Review comment:
   I think it is fine to leave `type: ignore[attr-defined]` here. We may 
come back to these after inlining type hints for SparkContext for example.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


xinrong-databricks commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754554096



##
File path: python/pyspark/ml/common.py
##
@@ -53,24 +57,24 @@ def _new_smart_decode(obj):
 
 
 # this will call the ML version of pythonToJava()
-def _to_java_object_rdd(rdd):
+def _to_java_object_rdd(rdd: RDD) -> JavaObject:
 """Return an JavaRDD of Object by unpickling
 
 It will convert each Python object into Java object by Pickle, whenever the
 RDD is serialized in batch or not.
 """
-rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
-return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)
+rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))  # type: 
ignore[attr-defined]
+return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)  
# type: ignore[attr-defined]

Review comment:
   We didn't annotate types of `_reserialize` or`_jvm`  in 
https://github.com/apache/spark/blob/master/python/pyspark/rdd.pyi or 
https://github.com/apache/spark/blob/master/python/pyspark/context.pyi.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975823020


   **[Test build #145516 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145516/testReport)**
 for PR 34671 at commit 
[`ed56686`](https://github.com/apache/spark/commit/ed56686a519462f53050aaba0b0766e1f3e8ddf5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975823967


   **[Test build #145516 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145516/testReport)**
 for PR 34671 at commit 
[`ed56686`](https://github.com/apache/spark/commit/ed56686a519462f53050aaba0b0766e1f3e8ddf5).
* This patch **fails Python style tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975823990


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145516/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975823990


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145516/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] kazuyukitanimura commented on a change in pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


kazuyukitanimura commented on a change in pull request #34611:
URL: https://github.com/apache/spark/pull/34611#discussion_r754552871



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
##
@@ -53,19 +53,47 @@ public void skip() {
 throw new UnsupportedOperationException();
   }
 
+  private void updateCurrentByte() {
+try {
+  currentByte = (byte) in.read();
+} catch (IOException e) {
+  throw new ParquetDecodingException("Failed to read a byte", e);
+}
+  }
+
   @Override
   public final void readBooleans(int total, WritableColumnVector c, int rowId) 
{
-// TODO: properly vectorize this
-for (int i = 0; i < total; i++) {
-  c.putBoolean(rowId + i, readBoolean());
+int i = 0;
+if (bitOffset > 0) {
+  i = Math.min(8 - bitOffset, total);

Review comment:
   I don't think we need to update the `total` here. Only the `i` gets 
updated.
   Any particular examples that you are worried?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975823020


   **[Test build #145516 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145516/testReport)**
 for PR 34671 at commit 
[`ed56686`](https://github.com/apache/spark/commit/ed56686a519462f53050aaba0b0766e1f3e8ddf5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] kazuyukitanimura commented on a change in pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


kazuyukitanimura commented on a change in pull request #34611:
URL: https://github.com/apache/spark/pull/34611#discussion_r754552760



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
##
@@ -53,19 +53,47 @@ public void skip() {
 throw new UnsupportedOperationException();
   }
 
+  private void updateCurrentByte() {

Review comment:
   Thanks, I would keep `updateCurrentByte` for now. Because public methods 
are using `read***` prefix, I chose `updateCurrentByte` to distinguish this 
private method.

##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
##
@@ -470,6 +493,18 @@ public final int appendBooleans(int count, boolean v) {
 return result;
   }
 
+  /**
+   * Append bits from [src[offset], src[offset + count])
+   * src must contain bit-packed 8 Booleans in the byte.

Review comment:
   Updated

##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
##
@@ -130,6 +133,89 @@ class ColumnarBatchSuite extends SparkFunSuite {
   }
   }
 
+  testVector("Boolean APIs", 1024, BooleanType) {
+column =>
+  val reference = mutable.ArrayBuffer.empty[Boolean]
+
+  var values = Array(true, false, true, false, false)
+  var bits = values.foldRight(0)((b, i) => i << 1 | (if (b) 1 else 
0)).toByte
+  column.appendBooleans(2, bits, 0)
+  reference ++= values.slice(0, 2)
+
+  column.appendBooleans(3, bits, 2)
+  reference ++= values.slice(2, 5)
+
+  column.appendBooleans(6, true)
+  reference ++= Array.fill(6)(true)
+
+  column.appendBoolean(false)
+  reference += false
+
+  var idx = column.elementsAppended
+
+  values = Array(true, true, false, true, false, true, false, true)
+  bits = values.foldRight(0)((b, i) => i << 1 | (if (b) 1 else 0)).toByte
+  column.putBooleans(idx, 2, bits, 0)
+  reference ++= values.slice(0, 2)
+  idx += 2
+
+  column.putBooleans(idx, 3, bits, 2)
+  reference ++= values.slice(2, 5)
+  idx += 3
+
+  column.putBooleans(idx, bits)
+  reference ++= values
+  idx += 8
+
+  column.putBoolean(idx, false)
+  reference += false
+  idx += 1
+
+  column.putBooleans(idx, 3, true)
+  reference ++= Array.fill(3)(true)
+  idx += 3
+
+  implicit def intToByte(i: Int): Byte = i.toByte
+  val buf = ByteBuffer.wrap(Array(0x33, 0x5A, 0xA5, 0xCC, 0x0F, 0xF0, 
0xEE))
+  val reader = new VectorizedPlainValuesReader()
+  reader.initFromPage(0, ByteBufferInputStream.wrap(buf))
+  column.putBoolean(idx, reader.readBoolean) // bit index 0
+  reference += true
+  idx += 1
+
+  reader.skipBooleans(1)
+
+  column.putBoolean(idx, reader.readBoolean) // bit index 2
+  reference += false
+  idx += 1
+
+  reader.skipBooleans(5)
+
+  column.putBoolean(idx, reader.readBoolean) // bit index 8
+  reference += false
+  idx += 1
+
+  reader.skipBooleans(8)

Review comment:
   Updated

##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
##
@@ -53,19 +53,47 @@ public void skip() {
 throw new UnsupportedOperationException();
   }
 
+  private void updateCurrentByte() {
+try {
+  currentByte = (byte) in.read();
+} catch (IOException e) {
+  throw new ParquetDecodingException("Failed to read a byte", e);
+}
+  }
+
   @Override
   public final void readBooleans(int total, WritableColumnVector c, int rowId) 
{
-// TODO: properly vectorize this
-for (int i = 0; i < total; i++) {
-  c.putBoolean(rowId + i, readBoolean());
+int i = 0;
+if (bitOffset > 0) {
+  i = Math.min(8 - bitOffset, total);

Review comment:
   I don't think we need to update the `total` here. Only the `I` gets 
updated.
   Any particular examples that you are worried?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] kazuyukitanimura commented on a change in pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-22 Thread GitBox


kazuyukitanimura commented on a change in pull request #34611:
URL: https://github.com/apache/spark/pull/34611#discussion_r754552682



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
##
@@ -53,19 +53,47 @@ public void skip() {
 throw new UnsupportedOperationException();
   }
 
+  private void updateCurrentByte() {
+try {
+  currentByte = (byte) in.read();
+} catch (IOException e) {
+  throw new ParquetDecodingException("Failed to read a byte", e);
+}
+  }
+
   @Override
   public final void readBooleans(int total, WritableColumnVector c, int rowId) 
{
-// TODO: properly vectorize this
-for (int i = 0; i < total; i++) {
-  c.putBoolean(rowId + i, readBoolean());
+int i = 0;
+if (bitOffset > 0) {
+  i = Math.min(8 - bitOffset, total);
+  c.putBooleans(rowId, i, currentByte, bitOffset);
+  bitOffset = (bitOffset + i) & 7;
+}
+for (; i + 7 < total; i += 8) {
+  updateCurrentByte();
+  c.putBooleans(rowId + i, currentByte);
+}
+if (i < total) {
+  updateCurrentByte();
+  bitOffset = total - i;
+  c.putBooleans(rowId + i, bitOffset, currentByte, 0);
 }
   }
 
   @Override
   public final void skipBooleans(int total) {
-// TODO: properly vectorize this
-for (int i = 0; i < total; i++) {
-  readBoolean();
+// using >>3 instead of /8 below since Java division rounds towards zero 
i.e. (-1)/8=0

Review comment:
   `(total - (8 - bitOffset))` can be negative.
   E.g. `total=1`, `bitOffset=1`, for this combination, `(total - (8 - 
bitOffset))=(1-(8-1))=-6`. And `(-7)>>3=(-1)` is what we need. So the comment 
still holds. This example is covered in `ColumnarBatchSuite.scala`
   
   Added `skipBooleans(0)` test.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754552177



##
File path: python/pyspark/ml/_typing.pyi
##
@@ -64,7 +65,7 @@ MultilabelClassificationEvaluatorMetricType = Union[
 Literal["microRecall"],
 Literal["microF1Measure"],
 ]
-ClusteringEvaluatorMetricType = Union[Literal["silhouette"]]
+ClusteringEvaluatorMetricType = Literal["silhouette"]

Review comment:
   This was intentional to make it consistent with the remaining 
`*EvaluatorMetricTypes`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754551374



##
File path: python/pyspark/ml/common.py
##
@@ -53,24 +57,24 @@ def _new_smart_decode(obj):
 
 
 # this will call the ML version of pythonToJava()
-def _to_java_object_rdd(rdd):
+def _to_java_object_rdd(rdd: RDD) -> JavaObject:
 """Return an JavaRDD of Object by unpickling
 
 It will convert each Python object into Java object by Pickle, whenever the
 RDD is serialized in batch or not.
 """
-rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
-return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)
+rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))  # type: 
ignore[attr-defined]
+return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)  
# type: ignore[attr-defined]

Review comment:
   At the moment, both `SparkContext` and `RDD` don't have inline 
annotations and stubs, intentionally, cover only public API. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 edited a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


zero323 edited a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-974525917


   > Just to be clear, are you saying I should split this PR into ml/common.py 
vs. mllib/common.py?
   >
   > And then have an umbrella ticket for adding type annotations to all of 
ml/, and another one for mllib/?
   
   For the context ‒ we're in the middle of the process of inlining hints from 
stubs to inline hints. At the moment we have two umbrella tickets ‒ SPARK-36845 
and SPARK-37094 for SQL and core respectively. We should follow this convention 
for ml and mllib as well.  
   
   It should be OK to have two tickets (`ml.common` and `mllib.common`) and 
resolve both in this PR, since you've already started working on that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nchammas commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


nchammas commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975814940


   The number of `ignore[attr-defined]` hints required seems a little wrong. 
But I suppose addressing that would require changes to `SparkContext`, which is 
out of scope for this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yliou commented on pull request #34622: SPARK-37340 Display StageIds in Operators for SQL UI

2021-11-22 Thread GitBox


yliou commented on pull request #34622:
URL: https://github.com/apache/spark/pull/34622#issuecomment-975808167


   cc @tgravescs would this feature be of interest?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yliou commented on pull request #34637: Spark-37349 add SQL Rest API parsing logic

2021-11-22 Thread GitBox


yliou commented on pull request #34637:
URL: https://github.com/apache/spark/pull/34637#issuecomment-975806584


   cc @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nchammas commented on a change in pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


nchammas commented on a change in pull request #34671:
URL: https://github.com/apache/spark/pull/34671#discussion_r754533879



##
File path: python/pyspark/ml/common.py
##
@@ -53,24 +57,24 @@ def _new_smart_decode(obj):
 
 
 # this will call the ML version of pythonToJava()
-def _to_java_object_rdd(rdd):
+def _to_java_object_rdd(rdd: RDD) -> JavaObject:
 """Return an JavaRDD of Object by unpickling
 
 It will convert each Python object into Java object by Pickle, whenever the
 RDD is serialized in batch or not.
 """
-rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
-return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)
+rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))  # type: 
ignore[attr-defined]
+return 
rdd.ctx._jvm.org.apache.spark.ml.python.MLSerDe.pythonToJava(rdd._jrdd, True)  
# type: ignore[attr-defined]

Review comment:
   I don't get exactly why mypy doesn't recognize `_reserialize` and `_jvm` 
as attributes. In the case of `_reserialize`, is it the leading underscore 
making it a "private" method? In the case of `_jvm`, is it that it's a class 
attribute?

##
File path: python/pyspark/ml/_typing.pyi
##
@@ -64,7 +65,7 @@ MultilabelClassificationEvaluatorMetricType = Union[
 Literal["microRecall"],
 Literal["microF1Measure"],
 ]
-ClusteringEvaluatorMetricType = Union[Literal["silhouette"]]
+ClusteringEvaluatorMetricType = Literal["silhouette"]

Review comment:
   Not sure what the `Union` here was for.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34681:
URL: https://github.com/apache/spark/pull/34681#issuecomment-975786325


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145513/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread GitBox


SparkQA removed a comment on pull request #34681:
URL: https://github.com/apache/spark/pull/34681#issuecomment-975667360


   **[Test build #145513 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145513/testReport)**
 for PR 34681 at commit 
[`65a8758`](https://github.com/apache/spark/commit/65a875846d4f3e5a8fb4978c3d82a2fd1ba3a720).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34681:
URL: https://github.com/apache/spark/pull/34681#issuecomment-975786325


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145513/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread GitBox


SparkQA commented on pull request #34681:
URL: https://github.com/apache/spark/pull/34681#issuecomment-975786004


   **[Test build #145513 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145513/testReport)**
 for PR 34681 at commit 
[`65a8758`](https://github.com/apache/spark/commit/65a875846d4f3e5a8fb4978c3d82a2fd1ba3a720).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975775073


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49985/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-975775068


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49986/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34681:
URL: https://github.com/apache/spark/pull/34681#issuecomment-975775070


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49987/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread GitBox


AmplabJenkins removed a comment on pull request #34681:
URL: https://github.com/apache/spark/pull/34681#issuecomment-975775070


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49987/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-975775068


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49986/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


AmplabJenkins commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975775073


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49985/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34681: [SPARK-37438][SQL] ANSI mode: Use store assignment rules for resolving function invocation

2021-11-22 Thread GitBox


SparkQA commented on pull request #34681:
URL: https://github.com/apache/spark/pull/34681#issuecomment-975766915


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49987/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-22 Thread GitBox


SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-975760231


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49985/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-22 Thread GitBox


SparkQA commented on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-975753614


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49986/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gaborgsomogyi commented on a change in pull request #29024: [SPARK-32001][SQL]Create JDBC authentication provider developer API

2021-11-22 Thread GitBox


gaborgsomogyi commented on a change in pull request #29024:
URL: https://github.com/apache/spark/pull/29024#discussion_r754487136



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala
##
@@ -18,60 +18,45 @@
 package org.apache.spark.sql.execution.datasources.jdbc.connection
 
 import java.sql.{Connection, Driver}
-import java.util.Properties
+import java.util.ServiceLoader
 
-import org.apache.spark.internal.Logging
-import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
-
-/**
- * Connection provider which opens connection toward various databases 
(database specific instance
- * needed). If kerberos authentication required then it's the provider's 
responsibility to set all
- * the parameters.
- */
-private[jdbc] trait ConnectionProvider {
-  /**
-   * Additional properties for data connection (Data source property takes 
precedence).
-   */
-  def getAdditionalProperties(): Properties = new Properties()
+import scala.collection.mutable
 
-  /**
-   * Opens connection toward the database.
-   */
-  def getConnection(): Connection
-}
+import org.apache.spark.internal.Logging
+import org.apache.spark.security.SecurityConfigurationLock
+import org.apache.spark.sql.jdbc.JdbcConnectionProvider
+import org.apache.spark.util.Utils
 
 private[jdbc] object ConnectionProvider extends Logging {
-  def create(driver: Driver, options: JDBCOptions): ConnectionProvider = {
-if (options.keytab == null || options.principal == null) {
-  logDebug("No authentication configuration found, using basic connection 
provider")
-  new BasicConnectionProvider(driver, options)
-} else {
-  logDebug("Authentication configuration found, using database specific 
connection provider")
-  options.driverClass match {
-case PostgresConnectionProvider.driverClass =>
-  logDebug("Postgres connection provider found")
-  new PostgresConnectionProvider(driver, options)
-
-case MariaDBConnectionProvider.driverClass =>
-  logDebug("MariaDB connection provider found")
-  new MariaDBConnectionProvider(driver, options)
-
-case DB2ConnectionProvider.driverClass =>
-  logDebug("DB2 connection provider found")
-  new DB2ConnectionProvider(driver, options)
-
-case MSSQLConnectionProvider.driverClass =>
-  logDebug("MS SQL connection provider found")
-  new MSSQLConnectionProvider(driver, options)
-
-case OracleConnectionProvider.driverClass =>
-  logDebug("Oracle connection provider found")
-  new OracleConnectionProvider(driver, options)
-
-case _ =>
-  throw new IllegalArgumentException(s"Driver ${options.driverClass} 
does not support " +
-"Kerberos authentication")
+  private val providers = loadProviders()
+
+  def loadProviders(): Seq[JdbcConnectionProvider] = {
+val loader = ServiceLoader.load(classOf[JdbcConnectionProvider],
+  Utils.getContextOrSparkClassLoader)
+val providers = mutable.ArrayBuffer[JdbcConnectionProvider]()
+
+val iterator = loader.iterator
+while (iterator.hasNext) {
+  try {
+val provider = iterator.next
+logDebug(s"Loaded built in provider: $provider")
+providers += provider
+  } catch {
+case t: Throwable =>
+  logError(s"Failed to load built in provider.", t)
   }
 }
+// Seems duplicate but it's needed for Scala 2.13
+providers.toSeq
+  }
+
+  def create(driver: Driver, options: Map[String, String]): Connection = {
+val filteredProviders = providers.filter(_.canHandle(driver, options))
+require(filteredProviders.size == 1,
+  "JDBC connection initiated but not exactly one connection provider found 
which can handle " +
+s"it. Found active providers: ${filteredProviders.mkString(", ")}")
+SecurityConfigurationLock.synchronized {

Review comment:
   What I can imagine is to pre-set the JVM JAAS config for databases from 
file like this: `java -Djava.security.auth.login.config=jaas.conf`.
   And then a new flag can be introduced like: `spark.jdbc.doNotSyncronize` 
which is default `true`.
   That case security credentials MUSTN'T be provided by Spark's JDBC 
properties but only from JAAS file.
   However this would be super risky and for advanced users only. That said 
I've spent at least a month to debug the JVM security context what's going on...
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: 

<    1   2   3   4   5   6   >