date:20230305

[GitHub] [spark] LuciferYang commented on pull request #40283: [SPARK-42673][BUILD] Make `build/mvn` build Spark only with the verified maven version

2023-03-05 Thread via GitHub



LuciferYang commented on PR #40283:
URL: https://github.com/apache/spark/pull/40283#issuecomment-1455648868

   > https://issues.apache.org/jira/browse/MNG-7697
   
   OK, let me test 3.9.1-SNAPSHOT later. @pan3793 Do you have any other issues 
besides those in GA task?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hboutemy commented on pull request #40283: [SPARK-42673][BUILD] Make `build/mvn` build Spark only with the verified maven version

2023-03-05 Thread via GitHub



hboutemy commented on PR #40283:
URL: https://github.com/apache/spark/pull/40283#issuecomment-1455637895

   [@cstamas ](https://github.com/cstamas) do you know if the lax parsing 
covers that `org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 
BOM plus xml decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen

[GitHub] [spark] hboutemy commented on pull request #40283: [SPARK-42673][BUILD] Make `build/mvn` build Spark only with the verified maven version

2023-03-05 Thread via GitHub



hboutemy commented on PR #40283:
URL: https://github.com/apache/spark/pull/40283#issuecomment-1455633233

   there is a known issue in Maven 3.9.0 (related to plexus-utils XML stricter 
reading https://github.com/codehaus-plexus/plexus-utils/issues/238 ) that is 
fixed in 3.9.1-SNAPSHOT: https://issues.apache.org/jira/browse/MNG-7697
   
   3.9.1 will be released soon: can you eventually check with 3.9.1-SNAPSHOT if 
you're in a different case of this "too strict" XML parsing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on pull request #38358: [SPARK-40588] FileFormatWriter materializes AQE plan before accessing outputOrdering

2023-03-05 Thread via GitHub



EnricoMi commented on PR #38358:
URL: https://github.com/apache/spark/pull/38358#issuecomment-1455620898

   Yes, it looks like it removes the **empty** table location after 
**overwriting** the table failed due to the `ArithmeticException`.
   
   @cloud-fan do you consider the removal of an empty table location after 
overwriting the table fails is a regression?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-05 Thread via GitHub



itholic commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1455567210

   Thanks @panbingkun for the nice fix!
   Btw, think I found another `createDataFrame` bug which is not working 
properly with non-nullable schema as below:
   ```python
   >>> from pyspark.sql.types import *
   >>> schema_false = StructType([StructField("id", IntegerType(), False)])
   >>> spark.createDataFrame([[1]], schema=schema_false)
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.AnalysisException: 
[NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required 
to be non-nullable.
   ```
   whereas working find with nullable schema as below:
   ```python
   >>> schema_true = StructType([StructField("id", IntegerType(), True)])
   >>> spark.createDataFrame([[1]], schema=schema_true)
   DataFrame[id: int]
   ```
   
   Do you have any idea what might be causing this? Could you take a look at it 
if you're interested in? I have filed an issue at SPARK-42679.
   
   Also cc @hvanhovell as an original author for `createDataFrame`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-05 Thread via GitHub



itholic commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1455565581

   Thanks @panbingkun for the nice fix!
   Btw, think I found another `createDataFrame` bug which is not working 
properly with non-nullable schema as below:
   ```python
   >>> from pyspark.sql.types import *
   >>> schema_false = StructType([StructField("id", IntegerType(), False)])
   >>> spark.createDataFrame([[1]], schema=schema_false)
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.AnalysisException: 
[NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required 
to be non-nullable.
   ```
   whereas working find with nullable schema as below:
   ```python
   >>> schema_true = StructType([StructField("id", IntegerType(), True)])
   >>> spark.createDataFrame([[1]], schema=schema_true)
   DataFrame[id: int]
   ```
   
   Do you have any idea what might be causing this? Could you take a look at it 
if you're interested in? I have filed an issue at SPARK-42679.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR closed pull request #40292: [SPARK-42676][SS] Write temp checkpoints for streaming queries to local filesystem even if default FS is set differently

2023-03-05 Thread via GitHub



HeartSaVioR closed pull request #40292: [SPARK-42676][SS] Write temp 
checkpoints for streaming queries to local filesystem even if default FS is set 
differently
URL: https://github.com/apache/spark/pull/40292


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #40292: [SPARK-42676][SS] Write temp checkpoints for streaming queries to local filesystem even if default FS is set differently

2023-03-05 Thread via GitHub



HeartSaVioR commented on PR #40292:
URL: https://github.com/apache/spark/pull/40292#issuecomment-1455549225

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum opened a new pull request, #40294: [SPARK-40610][SQL] Support unwrap date type to string type

2023-03-05 Thread via GitHub



wangyum opened a new pull request, #40294:
URL: https://github.com/apache/spark/pull/40294

   ### What changes were proposed in this pull request?
   
   This PR enhances `UnwrapCastInBinaryComparison` to support unwrapping date 
type to string type.
   
   ### Why are the changes needed?
   
   Avoid always fetching all partitions because the partition filters cannot be 
pushed down to the Hive metastore. For example:
   ```sql
   CREATE TABLE t1(id int, dt string) using parquet PARTITIONED BY (dt);
   EXPLAIN SELECT * FROM t1 WHERE dt > date_add(current_date(), -7);
   ```
   
   Before SPARK-27638. It pushes partition filters to Hive metastore:
   ```
   == Physical Plan ==
   *(1) FileScan parquet default.t1[id#2,dt#3] Batched: true, Format: Parquet, 
Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dt#3), (dt#3 > 2023-02-27)], PushedFilters: [], ReadSchema: 
struct
   ```
   
   After SPARK-27638. Because it will not [convert partition 
filters](https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L794-L798)
 to hive metastore filters, it will not push partition filters to Hive 
metastore. As a result, it always takes all the parititons:
   ```
   == Physical Plan ==
   *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#5,dt#6] Batched: true, DataFilters: [], 
Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: 
[isnotnull(dt#6), (cast(dt#6 as date) > 2023-02-27)], PushedFilters: [], 
ReadSchema: struct
   ```
   
   After this PR. It unwraps date type to string type and then pushes partition 
filters to Hive metastore:
   ```
   == Physical Plan ==
   *(1) ColumnarToRow
   +- FileScan parquet spark_catalog.default.t1[id#0,dt#1] Batched: true, 
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
PartitionFilters: [isnotnull(dt#1), (dt#1 > 2023-02-26)], PushedFilters: [], 
ReadSchema: struct
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40291:
URL: https://github.com/apache/spark/pull/40291#discussion_r1125957292


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:
##
@@ -345,6 +345,37 @@ final class DataFrameWriter[T] private[sql] (ds: 
Dataset[T]) {
 })
   }
 
+  /**
+   * Saves the content of the `DataFrame` to an external database table via 
JDBC. In the case the
+   * table already exists in the external database, behavior of this function 
depends on the save
+   * mode, specified by the `mode` function (default to throwing an exception).
+   *
+   * Don't create too many partitions in parallel on a large cluster; 
otherwise Spark might crash
+   * your external database systems.
+   *
+   * JDBC-specific option and parameter documentation for storing tables via 
JDBC in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option";>
+   * Data Source Option in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string 
tag/value. Normally at least
+   *   a "user" and "password" property should be included. "batchsize" can be 
used to control the
+   *   number of rows per insert. "isolationLevel" can be one of "NONE", 
"READ_COMMITTED",
+   *   "READ_UNCOMMITTED", "REPEATABLE_READ", or "SERIALIZABLE", corresponding 
to standard
+   *   transaction isolation levels defined by JDBC's Connection object, with 
default of
+   *   "READ_UNCOMMITTED".
+   * @since 3.4.0
+   */
+  def jdbc(url: String, table: String, connectionProperties: Properties): Unit 
= {
+// connectionProperties should override settings in extraOptions.

Review Comment:
   I have a question @hvanhovell @beliefer . For the connect-client api, should 
we verify the parameters on the client side or on the server side?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40283: [SPARK-42673][BUILD] Make `build/mvn` build Spark only with the verified maven version

2023-03-05 Thread via GitHub



LuciferYang commented on PR #40283:
URL: https://github.com/apache/spark/pull/40283#issuecomment-1455497586

   also cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40283: [SPARK-42673][BUILD] Make `build/mvn` build Spark only with the verified maven version

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40283:
URL: https://github.com/apache/spark/pull/40283#discussion_r1125949503


##
build/mvn:
##
@@ -119,7 +119,8 @@ install_mvn() {
   if [ "$MVN_BIN" ]; then
 local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
   fi
-  if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #40244: [WIP][SPARK-42643][CONNECT][PYTHON] Implement `spark.udf.registerJavaFunction`

2023-03-05 Thread via GitHub



xinrong-meng commented on code in PR #40244:
URL: https://github.com/apache/spark/pull/40244#discussion_r1125939747


##
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##
@@ -303,14 +303,15 @@ message Expression {
 message CommonInlineUserDefinedFunction {
   // (Required) Name of the user-defined function.
   string function_name = 1;
-  // (Required) Indicate if the user-defined function is deterministic.
+  // (Optional) Indicate if the user-defined function is deterministic.
   bool deterministic = 2;

Review Comment:
   JavaUDF has no `deterministic` field but the server doesn't have logic that 
relies on that field. So only the comment is changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40283:
URL: https://github.com/apache/spark/pull/40283#discussion_r1125934336


##
build/mvn:
##
@@ -119,7 +119,8 @@ install_mvn() {
   if [ "$MVN_BIN" ]; then
 local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
   fi
-  if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then

Review Comment:
   Good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] pan3793 commented on a diff in pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub



pan3793 commented on code in PR #40283:
URL: https://github.com/apache/spark/pull/40283#discussion_r1125930947


##
build/mvn:
##
@@ -119,7 +119,8 @@ install_mvn() {
   if [ "$MVN_BIN" ]; then
 local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
   fi
-  if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then

Review Comment:
   I'm change `-lt` to `-ne`, and always respect `maven.version` defined in 
`pom.xml`



##
build/mvn:
##
@@ -119,7 +119,8 @@ install_mvn() {
   if [ "$MVN_BIN" ]; then
 local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
   fi
-  if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then

Review Comment:
   I mean change `-lt` to `-ne`, and always respect `maven.version` defined in 
`pom.xml`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40283:
URL: https://github.com/apache/spark/pull/40283#discussion_r1125929900


##
build/mvn:
##
@@ -119,7 +119,8 @@ install_mvn() {
   if [ "$MVN_BIN" ]; then
 local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
   fi
-  if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then

Review Comment:
   I am not sure when 3.9.x will be supported due to there is a pre-work need 
to complete, if the pre-work cannot be completed, the version after 3.9.0 also 
needs to be temporarily disabled



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] pan3793 commented on a diff in pull request #40283: [SPARK-42673][BUILD] Ban Maven 3.9.x for Spark build

2023-03-05 Thread via GitHub



pan3793 commented on code in PR #40283:
URL: https://github.com/apache/spark/pull/40283#discussion_r1125927078


##
build/mvn:
##
@@ -119,7 +119,8 @@ install_mvn() {
   if [ "$MVN_BIN" ]; then
 local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
   fi
-  if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then

Review Comment:
   why not use exact match here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #40228: [SPARK-41874][CONNECT][PYTHON] Support SameSemantics in Spark Connect

2023-03-05 Thread via GitHub



zhengruifeng commented on PR #40228:
URL: https://github.com/apache/spark/pull/40228#issuecomment-1455466444

   merged into master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #40228: [SPARK-41874][CONNECT][PYTHON] Support SameSemantics in Spark Connect

2023-03-05 Thread via GitHub



zhengruifeng closed pull request #40228: [SPARK-41874][CONNECT][PYTHON] Support 
SameSemantics in Spark Connect
URL: https://github.com/apache/spark/pull/40228


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-05 Thread via GitHub



hvanhovell commented on PR #40291:
URL: https://github.com/apache/spark/pull/40291#issuecomment-1455425240

   hmmm - let me think about it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer opened a new pull request, #40293: [SPARK-42677][SQL] Fix the invalid tests for broadcast hint

2023-03-05 Thread via GitHub



beliefer opened a new pull request, #40293:
URL: https://github.com/apache/spark/pull/40293

   ### What changes were proposed in this pull request?
   Currently, there are a lot of test cases for broadcast hint is invalid. 
Because the data size is smaller than broadcast threshold.
   
   
   ### Why are the changes needed?
   Fix the invalid tests for broadcast hint.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   Just modify the test cases.
   
   
   ### How was this patch tested?
   Correct test cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db commented on pull request #40292: [SPARK-42676] Write temp checkpoints for streaming queries to local filesystem even if default FS is set differently

2023-03-05 Thread via GitHub



anishshri-db commented on PR #40292:
URL: https://github.com/apache/spark/pull/40292#issuecomment-1455397903

   @HeartSaVioR - please take a look. Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db opened a new pull request, #40292: [SPARK-42676] Write temp checkpoints for streaming queries to local filesystem even if default FS is set differently

2023-03-05 Thread via GitHub



anishshri-db opened a new pull request, #40292:
URL: https://github.com/apache/spark/pull/40292

   ### What changes were proposed in this pull request?
   Write temp checkpoints for streaming queries to local filesystem even if 
default FS is set differently
   
   ### Why are the changes needed?
   We have seen cases where the default FS could be a remote file system and 
since the path for streaming checkpoints is not specified explcitily, this 
could cause pileup under 2 cases:
   
   - query exits with exception and the flag to force checkpoint removal is not 
set
   - driver/cluster terminates without query being terminated gracefully
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Verified that the checkpoint is resolved and written to the local FS
   
   ```
   23/03/04 01:42:49 INFO ResolveWriteToStream: Checkpoint root 
file:/local_disk0/tmp/temporary-c97ab8bd-6b03-4c28-93ea-751d30a2d3f9 resolved 
to file:/local_disk0/tmp/temporary-c97ab8bd-6b03-4c28-93ea-751d30a2d3f9.
   ...
   23/03/04 01:46:37 INFO MicroBatchExecution: [queryId = 66c4c] Deleting 
checkpoint file:/local_disk0/tmp/temporary-c97ab8bd-6b03-4c28-93ea-751d30a2d3f9.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub



beliefer commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1455392063

   > I guess we will need to rewrite the lamda function in spark connect 
planner.
   
   Yeah.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub



beliefer commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1455390728

   
![image](https://user-images.githubusercontent.com/8486025/223014232-bf9b26ee-d0e8-4de4-a8fe-2d252813ac4d.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-05 Thread via GitHub



beliefer commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1125854126


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -140,6 +141,21 @@ message Read {
 // (Optional) A list of path for file-system backed data sources.
 repeated string paths = 4;
   }
+
+  message PartitionedJDBC {
+// (Required) JDBC URL.
+string url = 1;
+
+// (Required) Name of the table in the external database.
+string table = 2;
+
+// (Optional) Condition in the where clause for each partition.
+repeated string predicates = 3;

Review Comment:
   But the transform path is very different from DataSource.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125854357


##
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##
@@ -189,6 +190,11 @@ message Expression {
   int32 days = 2;
   int64 microseconds = 3;
 }
+
+message Array {
+  DataType elementType = 1;
+  repeated Literal element = 2;

Review Comment:
   Thanks for your confirmation
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub



zhengruifeng commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1455388960

   I guess we will need to rewrite the lamda function in spark connect planner.
   
   cc @ueshin as well, since existing implementation follows the fix in 
https://github.com/apache/spark/pull/32523


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huangxiaopingRD closed pull request #40196: [SPARK-42603][SQL] Set spark.sql.legacy.createHiveTableByDefault to false.

2023-03-05 Thread via GitHub



huangxiaopingRD closed pull request #40196: [SPARK-42603][SQL] Set 
spark.sql.legacy.createHiveTableByDefault  to false.
URL: https://github.com/apache/spark/pull/40196


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125852404


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+toLiteralProtoBuilder(litera

[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125852404


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+toLiteralProtoBuilder(litera

[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125852404


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+toLiteralProtoBuilder(litera

[GitHub] [spark] beliefer commented on pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-05 Thread via GitHub



beliefer commented on PR #40291:
URL: https://github.com/apache/spark/pull/40291#issuecomment-1455384866

   @hvanhovell It seems that add test cases no way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub



beliefer commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1455384317

   @hvanhovell After my test, `python/run-tests --testnames 
'pyspark.sql.connect.dataframe'` will not passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer opened a new pull request, #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-05 Thread via GitHub



beliefer opened a new pull request, #40291:
URL: https://github.com/apache/spark/pull/40291

   ### What changes were proposed in this pull request?
   Currently, the connect project have the new `DataFrameWriter` API which is 
corresponding to Spark `DataFrameWriter` API. But the connect's 
`DataFrameWriter` missing the jdbc API.
   
   
   ### Why are the changes needed?
   This PR try to add JDBC to `DataFrameWriter`.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature.
   
   
   ### How was this patch tested?
   @hvanhovell It seems that add test cases no way.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikf commented on pull request #40290: [SPARK-42478][SQL][3.3] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-05 Thread via GitHub



Yikf commented on PR #40290:
URL: https://github.com/apache/spark/pull/40290#issuecomment-1455380079

   cc @cloud-fan @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikf commented on pull request #40289: [SPARK-42478][SQL][3.2] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-05 Thread via GitHub



Yikf commented on PR #40289:
URL: https://github.com/apache/spark/pull/40289#issuecomment-1455379959

   cc @cloud-fan @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikf opened a new pull request, #40290: [SPARK-42478][SQL][3.3] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-05 Thread via GitHub



Yikf opened a new pull request, #40290:
URL: https://github.com/apache/spark/pull/40290

   This is a backport of https://github.com/apache/spark/pull/40064 for 
branch-3.3
   
   
   ### What changes were proposed in this pull request?
   
   Make a serializable jobTrackerId instead of a non-serializable JobID in 
FileWriterFactory
   
   ### Why are the changes needed?
   
   [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make 
consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a 
serializable issue, JobId is non-serializable.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikf opened a new pull request, #40289: [SPARK-42478][SQL][3.2] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-05 Thread via GitHub



Yikf opened a new pull request, #40289:
URL: https://github.com/apache/spark/pull/40289

   This is a backport of https://github.com/apache/spark/pull/40064
   
   
   ### What changes were proposed in this pull request?
   
   Make a serializable jobTrackerId instead of a non-serializable JobID in 
FileWriterFactory
   
   ### Why are the changes needed?
   
   [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make 
consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a 
serializable issue, JobId is non-serializable.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #38358: [SPARK-40588] FileFormatWriter materializes AQE plan before accessing outputOrdering

2023-03-05 Thread via GitHub



wangyum commented on PR #38358:
URL: https://github.com/apache/spark/pull/38358#issuecomment-1455371977

   @EnricoMi It seems it will remove the table location if a 
`java.lang.ArithmeticException` is thrown after this change.
   How to reproduce:
   ```scala
   import org.apache.hadoop.fs.{FileSystem, Path}
   
   import org.apache.spark.sql.QueryTest
   import org.apache.spark.sql.catalyst.TableIdentifier
   
   sql("CREATE TABLE IF NOT EXISTS spark32_overwrite(amt1 int) STORED AS ORC")
   sql("CREATE TABLE IF NOT EXISTS spark32_overwrite2(amt1 long) STORED AS ORC")
   sql("INSERT OVERWRITE TABLE spark32_overwrite2 select 644164")
   sql("set spark.sql.ansi.enabled=true")
   
   val loc =
 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("spark32_overwrite")).location
   
   val fs = FileSystem.get(loc, spark.sparkContext.hadoopConfiguration)
   println("Location exists: " + fs.exists(new Path(loc)))
   
   try {
 sql("INSERT OVERWRITE TABLE spark32_overwrite select amt1 from " +
   "(select cast(amt1 as int) as amt1 from spark32_overwrite2 distribute by 
amt1)")
   } finally {
 println("Location exists: " + fs.exists(new Path(loc)))
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125837371


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+toLiteralProtoBuilder(litera

[GitHub] [spark] anishshri-db commented on pull request #40273: [SPARK-42668][SS] Catch exception while trying to close compressed stream in HDFSStateStoreProvider abort

2023-03-05 Thread via GitHub



anishshri-db commented on PR #40273:
URL: https://github.com/apache/spark/pull/40273#issuecomment-1455371384

   > Mind retriggering the build, please? Probably simplest way to do is 
pushing an empty commit. You can retrigger the build in your fork but it won't 
be reflected here.
   
   Sure done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub



LuciferYang commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125837371


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+toLiteralProtoBuilder(litera

[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-05 Thread via GitHub



hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1125835789


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -140,6 +141,21 @@ message Read {
 // (Optional) A list of path for file-system backed data sources.
 repeated string paths = 4;
   }
+
+  message PartitionedJDBC {
+// (Required) JDBC URL.
+string url = 1;
+
+// (Required) Name of the table in the external database.
+string table = 2;
+
+// (Optional) Condition in the where clause for each partition.
+repeated string predicates = 3;

Review Comment:
   Can we just put the predicates into the DataSource message?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub



hvanhovell commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1455366786

   @HyukjinKwon @zhengruifeng the rationale for this change is that analyzer 
takes care of making lambda variables unique.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikf commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-05 Thread via GitHub



Yikf commented on PR #40064:
URL: https://github.com/apache/spark/pull/40064#issuecomment-1455364691

   > @Yikf can you help to open a backport PR for 3.2/3.3? Thanks!
   
   Sure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub



beliefer commented on PR #39091:
URL: https://github.com/apache/spark/pull/39091#issuecomment-1455360592

   @hvanhovell @grundprinzip @HyukjinKwon @zhengruifeng @amaliujia Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #40228: [SPARK-41874][CONNECT][PYTHON] Support SameSemantics in Spark Connect

2023-03-05 Thread via GitHub



amaliujia commented on PR #40228:
URL: https://github.com/apache/spark/pull/40228#issuecomment-1455359011

   @hvanhovell waiting for CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub



beliefer commented on PR #40275:
URL: https://github.com/apache/spark/pull/40275#issuecomment-1455357573

   @hvanhovell @LuciferYang Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40228: [SPARK-41874][CONNECT][PYTHON] Support SameSemantics in Spark Connect

2023-03-05 Thread via GitHub



hvanhovell commented on PR #40228:
URL: https://github.com/apache/spark/pull/40228#issuecomment-1455352755

   @amaliujia can you update the PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40217: [SPARK-42559][CONNECT] Implement DataFrameNaFunctions

2023-03-05 Thread via GitHub



hvanhovell commented on code in PR #40217:
URL: https://github.com/apache/spark/pull/40217#discussion_r1125825287


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionSuite.scala:
##
@@ -0,0 +1,377 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.connect.client.util.QueryTest
+
+class DataFrameNaFunctionSuite extends QueryTest {

Review Comment:
   Is this a line for line copy of the original test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40217: [SPARK-42559][CONNECT] Implement DataFrameNaFunctions

2023-03-05 Thread via GitHub



hvanhovell commented on PR #40217:
URL: https://github.com/apache/spark/pull/40217#issuecomment-1455351159

   @panbingkun can you update the CompatibilitySuite?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard metrics 4.2.17

2023-03-05 Thread via GitHub



LuciferYang commented on PR #40254:
URL: https://github.com/apache/spark/pull/40254#issuecomment-1455349598

   Thanks @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40288: [SPARK-42496][CONNECT][DOCS] Introduction Spark Connect at main page.

2023-03-05 Thread via GitHub



itholic commented on PR #40288:
URL: https://github.com/apache/spark/pull/40288#issuecomment-1455348864

   cc @tgravescs since this is a Spark Connect introduction including note 
about built in authentication you mentioned in JIRA ticket before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40217: [SPARK-42559][CONNECT] Implement DataFrameNaFunctions

2023-03-05 Thread via GitHub



hvanhovell commented on PR #40217:
URL: https://github.com/apache/spark/pull/40217#issuecomment-1455348717

   @panbingkun can you update your PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub



hvanhovell commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125820525


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/LiteralProtoConverter.scala:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.expressions
+
+import java.lang.{Boolean => JBoolean, Byte => JByte, Character => JChar, 
Double => JDouble, Float => JFloat, Integer => JInteger, Long => JLong, Short 
=> JShort}
+import java.math.{BigDecimal => JBigDecimal}
+import java.sql.{Date, Timestamp}
+import java.time._
+
+import com.google.protobuf.ByteString
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.util.{DateTimeUtils, IntervalUtils}
+import org.apache.spark.sql.connect.client.unsupported
+import org.apache.spark.sql.types.{DayTimeIntervalType, Decimal, DecimalType, 
YearMonthIntervalType}
+import org.apache.spark.unsafe.types.CalendarInterval
+
+object LiteralProtoConverter {
+
+  private lazy val nullType =
+
proto.DataType.newBuilder().setNull(proto.DataType.NULL.getDefaultInstance).build()
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal.Builder`.
+   *
+   * @return
+   *   proto.Expression.Literal.Builder
+   */
+  @scala.annotation.tailrec
+  def toLiteralProtoBuilder(literal: Any): proto.Expression.Literal.Builder = {
+val builder = proto.Expression.Literal.newBuilder()
+
+def decimalBuilder(precision: Int, scale: Int, value: String) = {
+  
builder.getDecimalBuilder.setPrecision(precision).setScale(scale).setValue(value)
+}
+
+def calendarIntervalBuilder(months: Int, days: Int, microseconds: Long) = {
+  builder.getCalendarIntervalBuilder
+.setMonths(months)
+.setDays(days)
+.setMicroseconds(microseconds)
+}
+
+def arrayBuilder(array: Array[_]) = {
+  val ab = builder.getArrayBuilder
+.setElementType(componentTypeToProto(array.getClass.getComponentType))
+  array.foreach(x => ab.addElement(toLiteralProto(x)))
+  ab
+}
+
+literal match {
+  case v: Boolean => builder.setBoolean(v)
+  case v: Byte => builder.setByte(v)
+  case v: Short => builder.setShort(v)
+  case v: Int => builder.setInteger(v)
+  case v: Long => builder.setLong(v)
+  case v: Float => builder.setFloat(v)
+  case v: Double => builder.setDouble(v)
+  case v: BigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: JBigDecimal =>
+builder.setDecimal(decimalBuilder(v.precision, v.scale, v.toString))
+  case v: String => builder.setString(v)
+  case v: Char => builder.setString(v.toString)
+  case v: Array[Char] => builder.setString(String.valueOf(v))
+  case v: Array[Byte] => builder.setBinary(ByteString.copyFrom(v))
+  case v: collection.mutable.WrappedArray[_] => 
toLiteralProtoBuilder(v.array)
+  case v: LocalDate => builder.setDate(v.toEpochDay.toInt)
+  case v: Decimal =>
+builder.setDecimal(decimalBuilder(Math.max(v.precision, v.scale), 
v.scale, v.toString))
+  case v: Instant => builder.setTimestamp(DateTimeUtils.instantToMicros(v))
+  case v: Timestamp => 
builder.setTimestamp(DateTimeUtils.fromJavaTimestamp(v))
+  case v: LocalDateTime => 
builder.setTimestampNtz(DateTimeUtils.localDateTimeToMicros(v))
+  case v: Date => builder.setDate(DateTimeUtils.fromJavaDate(v))
+  case v: Duration => 
builder.setDayTimeInterval(IntervalUtils.durationToMicros(v))
+  case v: Period => 
builder.setYearMonthInterval(IntervalUtils.periodToMonths(v))
+  case v: Array[_] => builder.setArray(arrayBuilder(v))
+  case v: CalendarInterval =>
+builder.setCalendarInterval(calendarIntervalBuilder(v.months, v.days, 
v.microseconds))
+  case null => builder.setNull(nullType)
+  case _ => unsupported(s"literal $literal not supported (yet).")
+}
+  }
+
+  /**
+   * Transforms literal value to the `proto.Expression.Literal`.
+   *
+   * @return
+   *   proto.Expression.Literal
+   */
+  def toLiteralProto(literal: Any): proto.Expression.Literal =
+toLiteralProtoBuilder(literal

[GitHub] [spark] hvanhovell commented on a diff in pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-05 Thread via GitHub



hvanhovell commented on code in PR #40218:
URL: https://github.com/apache/spark/pull/40218#discussion_r1125817796


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/LiteralValueProtoConverter.scala:
##
@@ -130,4 +138,61 @@ object LiteralValueProtoConverter {
   case o => throw new Exception(s"Unsupported value type: $o")
 }
   }
+
+  private def toArrayData(array: proto.Expression.Literal.Array): Any = {
+def makeArrayData[T](converter: proto.Expression.Literal => T)(implicit
+tag: ClassTag[T]): Array[T] = {

Review Comment:
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #40288: [SPARK-42496][CONNECT][DOCS] Introduction Spark Connect at main page.

2023-03-05 Thread via GitHub



itholic opened a new pull request, #40288:
URL: https://github.com/apache/spark/pull/40288

   ### What changes were proposed in this pull request?
   
   This PR proposes to add a brief description of Spark Connect to the PySpark 
main page.
   
   https://user-images.githubusercontent.com/44108233/223006571-42fccf6f-cb7b-479c-9f11-5c246b442fac.png";>
   
   
   ### Why are the changes needed?
   
   Spark Connect is a new and experimental feature of PySpark that enables 
Spark to run anywhere and work with different data stores and services. Adding 
a brief description of Spark Connect to the main page will inform users about 
this feature and how it can benefit their use cases. Additionally, the note 
about the experimental nature of this feature will help users understand the 
potential risks of using Spark Connect in production environments.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No for API usage, but this PR introduces a user-facing documents by adding a 
new section about Spark Connect to the PySpark main page.
   
   
   ### How was this patch tested?
   
   To ensure that the documentation builds correctly, the `make clean html` 
command was executed to test the build process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40270: [SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function.

2023-03-05 Thread via GitHub



hvanhovell commented on code in PR #40270:
URL: https://github.com/apache/spark/pull/40270#discussion_r1125815690


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -781,3 +782,10 @@ message FrameMap {
   CommonInlineUserDefinedFunction func = 2;
 }
 
+message WithSequenceColumn {

Review Comment:
   Well my argument against this is that it is just a project with a specific 
type of expression attached to it. There is no need to complicate the protocol, 
it is just that.
   
   As for what this does, and this comment is more aimed at the original PR, 
two things:
   - The IDs are not stable at all. This is an order based, and well we 
basically do not guarantee that the order is stable during processing.
   - The IDs can contain gaps or duplicates if any of the shuffles of the input 
contains a non-deterministic column, and a retry of one of the input 
tasks/stages occurs. This is a result of the double scanning that 
RDD.zipWithIndex requires.
   - Finally two scans can be slow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard metrics 4.2.17

2023-03-05 Thread via GitHub



srowen commented on PR #40254:
URL: https://github.com/apache/spark/pull/40254#issuecomment-1455341403

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen closed pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard metrics 4.2.17

2023-03-05 Thread via GitHub



srowen closed pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard 
metrics 4.2.17
URL: https://github.com/apache/spark/pull/40254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-05 Thread via GitHub



cloud-fan commented on PR #40064:
URL: https://github.com/apache/spark/pull/40064#issuecomment-1455335925

   @Yikf can you help to open a backport PR for 3.2/3.3? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40254: [SPARK-42654][BUILD] Upgrade dropwizard metrics 4.2.17

2023-03-05 Thread via GitHub



LuciferYang commented on PR #40254:
URL: https://github.com/apache/spark/pull/40254#issuecomment-1455328473

   friendly ping @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub



hvanhovell closed pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement 
`DataFrame.observe`
URL: https://github.com/apache/spark/pull/39091


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub



hvanhovell commented on PR #39091:
URL: https://github.com/apache/spark/pull/39091#issuecomment-1455327845

   Merging to master/3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40285: [SPARK-42675][CONNECT][TESTS] Drop temp view after test `test temp view`

2023-03-05 Thread via GitHub



LuciferYang commented on PR #40285:
URL: https://github.com/apache/spark/pull/40285#issuecomment-1455325164

   Thanks @wangyum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40255: [SPARK-42558][CONNECT] Implement `DataFrameStatFunctions` except `bloomFilter` functions

2023-03-05 Thread via GitHub



LuciferYang commented on PR #40255:
URL: https://github.com/apache/spark/pull/40255#issuecomment-1455324716

   Thanks @hvanhovell @HyukjinKwon @zhengruifeng @amaliujia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #40255: [SPARK-42558][CONNECT] Implement `DataFrameStatFunctions` except `bloomFilter` functions

2023-03-05 Thread via GitHub



hvanhovell closed pull request #40255: [SPARK-42558][CONNECT] Implement 
`DataFrameStatFunctions` except `bloomFilter` functions
URL: https://github.com/apache/spark/pull/40255


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40255: [SPARK-42558][CONNECT] Implement `DataFrameStatFunctions` except `bloomFilter` functions

2023-03-05 Thread via GitHub



hvanhovell commented on PR #40255:
URL: https://github.com/apache/spark/pull/40255#issuecomment-1455323028

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub



hvanhovell closed pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to 
functions
URL: https://github.com/apache/spark/pull/40275


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub



hvanhovell commented on PR #40275:
URL: https://github.com/apache/spark/pull/40275#issuecomment-1455321694

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #40279: [MINOR][CONNECT] Remove unused protobuf imports to eliminate build warnings

2023-03-05 Thread via GitHub



hvanhovell closed pull request #40279: [MINOR][CONNECT] Remove unused protobuf 
imports to eliminate build warnings
URL: https://github.com/apache/spark/pull/40279


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-05 Thread via GitHub



hvanhovell closed pull request #40280: [SPARK-42671][CONNECT] Fix bug for 
createDataFrame from complex type schema
URL: https://github.com/apache/spark/pull/40280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-05 Thread via GitHub



hvanhovell commented on code in PR #40280:
URL: https://github.com/apache/spark/pull/40280#discussion_r1125800378


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -115,7 +115,7 @@ class SparkSession private[sql] (
   private def createDataset[T](encoder: AgnosticEncoder[T], data: 
Iterator[T]): Dataset[T] = {
 newDataset(encoder) { builder =>
   val localRelationBuilder = builder.getLocalRelationBuilder
-.setSchema(encoder.schema.catalogString)

Review Comment:
   json is ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub



mridulm commented on code in PR #40286:
URL: https://github.com/apache/spark/pull/40286#discussion_r1125790750


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -232,6 +232,13 @@ private[spark] class DAGScheduler(
 sc.getConf.getInt("spark.stage.maxConsecutiveAttempts",
   DAGScheduler.DEFAULT_MAX_CONSECUTIVE_STAGE_ATTEMPTS)
 
+  /**
+   * Max stage attempts allowed before a stage is aborted.
+   */
+  private[scheduler] val maxStageAttempts: Int = {
+Math.max(maxConsecutiveStageAttempts, 
sc.getConf.get(config.STAGE_MAX_ATTEMPTS))

Review Comment:
   Modify this suitably to return an `Option` if using optin



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub



mridulm commented on code in PR #40286:
URL: https://github.com/apache/spark/pull/40286#discussion_r1125790378


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2479,4 +2479,14 @@ package object config {
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)
+
+  private[spark] val STAGE_MAX_ATTEMPTS =
+ConfigBuilder("spark.stage.maxAttempts")
+  .doc("The max attempts for a stage, the spark job will be aborted if any 
of its stages is " +
+"resubmitted multiple times beyond the limitation. The value should be 
no less " +
+"than `spark.stage.maxConsecutiveAttempts` which defines the max 
attempts for " +
+"fetch failures.")
+  .version("3.5.0")
+  .intConf
+  .createWithDefault(16)

Review Comment:
   Since this is a behavior change, let us make this an optional parameter - 
and preserve current behavior when not configured (or make default int max 
value).
   We can change this to a more restrictive default value in a future release.
   
   Given cascading stage retries and deployments where decom is not applicable 
(yarn for example), particularly due to `INDETERMINATE` stage, this minimizes 
application failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub



mridulm commented on code in PR #40286:
URL: https://github.com/apache/spark/pull/40286#discussion_r1125790378


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2479,4 +2479,14 @@ package object config {
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)
+
+  private[spark] val STAGE_MAX_ATTEMPTS =
+ConfigBuilder("spark.stage.maxAttempts")
+  .doc("The max attempts for a stage, the spark job will be aborted if any 
of its stages is " +
+"resubmitted multiple times beyond the limitation. The value should be 
no less " +
+"than `spark.stage.maxConsecutiveAttempts` which defines the max 
attempts for " +
+"fetch failures.")
+  .version("3.5.0")
+  .intConf
+  .createWithDefault(16)

Review Comment:
   Since this is a behavior change, let us make this an optional parameter - 
and preserve current behavior when not configured (or make default int max 
value).
   We can change this to a more restrictive default value in a future release.
   
   Given cascading stage retries, particularly due to `INDETERMINATE` stage, 
this minimizes application failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer opened a new pull request, #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-05 Thread via GitHub



beliefer opened a new pull request, #40287:
URL: https://github.com/apache/spark/pull/40287

   ### What changes were proposed in this pull request?
   UnresolvedNamedLambdaVariable do not need unique names in python. We already 
did this for the scala client, and it is good to have parity between the two 
implementations.
   
   
   ### Why are the changes needed?
   Try to avoid unique names for UnresolvedNamedLambdaVariable.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature
   
   
   ### How was this patch tested?
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub



mridulm commented on code in PR #40286:
URL: https://github.com/apache/spark/pull/40286#discussion_r1125790378


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2479,4 +2479,14 @@ package object config {
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)
+
+  private[spark] val STAGE_MAX_ATTEMPTS =
+ConfigBuilder("spark.stage.maxAttempts")
+  .doc("The max attempts for a stage, the spark job will be aborted if any 
of its stages is " +
+"resubmitted multiple times beyond the limitation. The value should be 
no less " +
+"than `spark.stage.maxConsecutiveAttempts` which defines the max 
attempts for " +
+"fetch failures.")
+  .version("3.5.0")
+  .intConf
+  .createWithDefault(16)

Review Comment:
   Since this is a behavior change, let us make this an optional paraeter - and 
preserve current behavior when not configured.
   We can change this to a default value in a future release.
   
   Given cascading stage retries, particularly due to `INDETERMINATE` stage, 
this minimizes application failures.



##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -232,6 +232,13 @@ private[spark] class DAGScheduler(
 sc.getConf.getInt("spark.stage.maxConsecutiveAttempts",
   DAGScheduler.DEFAULT_MAX_CONSECUTIVE_STAGE_ATTEMPTS)
 
+  /**
+   * Max stage attempts allowed before a stage is aborted.
+   */
+  private[scheduler] val maxStageAttempts: Int = {
+Math.max(maxConsecutiveStageAttempts, 
sc.getConf.get(config.STAGE_MAX_ATTEMPTS))

Review Comment:
   Modify this suitably to return an `Option`.



##
core/src/main/scala/org/apache/spark/scheduler/Stage.scala:
##
@@ -70,6 +70,7 @@ private[scheduler] abstract class Stage(
 
   /** The ID to use for the next new attempt for this stage. */
   private var nextAttemptId: Int = 0
+  private[scheduler] def getNextAttemptId(): Int = nextAttemptId

Review Comment:
   `getNextAttemptId()` -> `getNextAttemptId`



##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -406,12 +406,13 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 blockManagerMaster = spy(new MyBlockManagerMaster(sc.getConf))
 doNothing().when(blockManagerMaster).updateRDDBlockVisibility(any(), any())
 scheduler = new MyDAGScheduler(
+scheduler = spy(new MyDAGScheduler(

Review Comment:
   This should cause a compilation failure - remove the prev line ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on pull request #40262: [SPARK-42651][SQL] Optimize global sort to driver sort

2023-03-05 Thread via GitHub



ulysses-you commented on PR #40262:
URL: https://github.com/apache/spark/pull/40262#issuecomment-1455303198

   cc @cloud-fan @viirya thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub



beliefer commented on code in PR #39091:
URL: https://github.com/apache/spark/pull/39091#discussion_r1125777299


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -338,6 +340,22 @@ class SparkConnectPlanner(session: SparkSession) {
 }
   }
 
+  private def transformCollectMetrics(rel: proto.CollectMetrics): LogicalPlan 
= {
+val metrics = rel.getMetricsList.asScala.map { expr =>
+  Column(transformExpression(expr))
+}
+
+if (rel.getIsObservation) {

Review Comment:
   @hvanhovell is_observation has been removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40275: [SPARK-42557][CONNECT] Add Broadcast to functions

2023-03-05 Thread via GitHub



beliefer commented on PR #40275:
URL: https://github.com/apache/spark/pull/40275#issuecomment-1455280706

   ping @HyukjinKwon @zhengruifeng @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-05 Thread via GitHub



beliefer commented on PR #40277:
URL: https://github.com/apache/spark/pull/40277#issuecomment-1455280396

   ping @hvanhovell @HyukjinKwon @dongjoon-hyun  cc @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #39091: [SPARK-41527][CONNECT][PYTHON] Implement `DataFrame.observe`

2023-03-05 Thread via GitHub



beliefer commented on PR #39091:
URL: https://github.com/apache/spark/pull/39091#issuecomment-1455279364

   > @beliefer can you please remove the is_observation code path? And take 
another look at the protocol. Otherwise I think it looks good.
   
   is_observation code path has been removed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub



itholic commented on PR #40271:
URL: https://github.com/apache/spark/pull/40271#issuecomment-1455275958

   Looks good otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub



itholic commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1125771590


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1268,6 +1268,12 @@ def test_bucket(self):
 message_parameters={"arg_name": "numBuckets", "arg_type": "str"},
 )
 
+def test_no_cast(self):

Review Comment:
   How about add a `test_cast` with practical cases?
   
   Seems like the test for `functions.cast` is missing (And that's also the 
root reason why we haven't noticed this is wrong until now)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub



itholic commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1125771590


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1268,6 +1268,12 @@ def test_bucket(self):
 message_parameters={"arg_name": "numBuckets", "arg_type": "str"},
 )
 
+def test_no_cast(self):

Review Comment:
   How about add a `test_cast`?
   
   Seems like the test for `functions.cast` is missing (And that's also the 
root reason why we haven't noticed this is wrong until now)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #40281: [SPARK-41497][CORE][Follow UP]Modify config `spark.rdd.cache.visibilityTracking.enabled` support version to 3.5.0

2023-03-05 Thread via GitHub



HyukjinKwon closed pull request #40281: [SPARK-41497][CORE][Follow UP]Modify 
config `spark.rdd.cache.visibilityTracking.enabled` support version to 3.5.0
URL: https://github.com/apache/spark/pull/40281


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-05 Thread via GitHub



HyukjinKwon commented on PR #40282:
URL: https://github.com/apache/spark/pull/40282#issuecomment-1455270795

   cc @MaxGekk and @srielau 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #40284: [SPARK-42674][BUILD] Upgrade scalafmt from 3.7.1 to 3.7.2

2023-03-05 Thread via GitHub



HyukjinKwon closed pull request #40284: [SPARK-42674][BUILD] Upgrade scalafmt 
from 3.7.1 to 3.7.2
URL: https://github.com/apache/spark/pull/40284


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40284: [SPARK-42674][BUILD] Upgrade scalafmt from 3.7.1 to 3.7.2

2023-03-05 Thread via GitHub



HyukjinKwon commented on PR #40284:
URL: https://github.com/apache/spark/pull/40284#issuecomment-1455270404

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #36265: [SPARK-38951][SQL] Aggregate aliases override field names in ResolveAggregateFunctions

2023-03-05 Thread via GitHub



github-actions[bot] closed pull request #36265: [SPARK-38951][SQL] Aggregate 
aliases override field names in ResolveAggregateFunctions
URL: https://github.com/apache/spark/pull/36265


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #38736: [SPARK-41214][SQL] - SQL Metrics are missing from Spark UI when AQE for Cached DataFrame is enabled

2023-03-05 Thread via GitHub



github-actions[bot] commented on PR #38736:
URL: https://github.com/apache/spark/pull/38736#issuecomment-1455262719

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #40285: [SPARK-42675][CONNECT][TESTS] Drop temp view after test `test temp view`

2023-03-05 Thread via GitHub



wangyum commented on PR #40285:
URL: https://github.com/apache/spark/pull/40285#issuecomment-1455258629

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum closed pull request #40285: [SPARK-42675][CONNECT][TESTS] Drop temp view after test `test temp view`

2023-03-05 Thread via GitHub



wangyum closed pull request #40285: [SPARK-42675][CONNECT][TESTS] Drop temp 
view after test `test temp view`
URL: https://github.com/apache/spark/pull/40285


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] FurcyPin commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub



FurcyPin commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1125698656


##
python/pyspark/sql/functions.py:
##
@@ -22,20 +22,10 @@
 import sys
 import functools
 import warnings
-from typing import (
-Any,
-cast,

Review Comment:
   In the end, I went for `from typing import cast as _cast` which makes the 
intent even more explicit, I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] FurcyPin commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-05 Thread via GitHub



FurcyPin commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1125695676


##
python/pyspark/sql/functions.py:
##
@@ -22,20 +22,10 @@
 import sys
 import functools
 import warnings
-from typing import (
-Any,
-cast,

Review Comment:
   I agree that the change seemed quite cumbersome. It made me wish Python had 
some kind of "private import" keyword to handle such cases more easily.
   
   I agree with you that `cast` is the only name that might be confusing 
(perhaps `overload` too, but all the other names start with an uppercase). The 
`functions` module feels a little special to me because it is the module I use 
the most as a Spark user, it's definitely a public API. The 201 other modules 
don't require such change.
   
   It's out of the scope of this MR, but perhaps for the long term you could 
consider reorganizing this module [the same way as I did in one of my 
projects](https://github.com/FurcyPin/spark-frame/blob/main/spark_frame/transformations.py),
 which had two advantages:
   - the code of each method was isolated in a separate file (that would 
prevent having a 10 000-lines files)
   - there was no import pollution
   
   For now, I'll do as you suggest: only handle `typing.cast` as a special case 
and add a unit test to make sure it does not get imported again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40236: [SPARK-38735][SQL][TESTS] Add tests for the error class: INTERNAL_ERROR

2023-03-05 Thread via GitHub



itholic commented on code in PR #40236:
URL: https://github.com/apache/spark/pull/40236#discussion_r1125682909


##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala:
##
@@ -765,6 +770,58 @@ class QueryExecutionErrorsSuite
   )
 }
   }
+
+  test("INTERNAL_ERROR: Calling eval on Unevaluable expression") {
+val e = intercept[SparkException] {
+  Parameter("foo").eval()
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map("message" -> "Cannot evaluate expression: 
parameter(foo)"),
+  sqlState = "XX000")
+  }
+
+  test("INTERNAL_ERROR: Calling doGenCode on unresolved") {
+val e = intercept[SparkException] {
+  val ctx = new CodegenContext
+  Grouping(Parameter("foo")).genCode(ctx)
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map(
+"message" -> ("Cannot generate code for expression: " +
+  "grouping(parameter(foo))")),
+  sqlState = "XX000")
+  }
+
+  test("INTERNAL_ERROR: Calling terminate on UnresolvedGenerator") {
+val e = intercept[SparkException] {
+  UnresolvedGenerator(FunctionIdentifier("foo"), Seq.empty).terminate()
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map("message" -> "Cannot terminate expression: 'foo()"),
+  sqlState = "XX000")
+  }
+
+  test("INTERNAL_ERROR: Initializing JavaBean with non existing method") {
+val e = intercept[SparkException] {
+  val initializeWithNonexistingMethod = InitializeJavaBean(
+Literal.fromObject(new java.util.LinkedList[Int]),
+Map("nonexistent" -> Literal(1)))
+  initializeWithNonexistingMethod.eval()
+}
+checkError(
+  exception = e,
+  errorClass = "INTERNAL_ERROR",
+  parameters = Map(
+"message" -> ("""A method named "nonexistent" is not declared in """ +
+  "any enclosing class nor any supertype")),
+  sqlState = "XX000")

Review Comment:
   It's fine if it's already tested :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40274: [SPARK-42215][CONNECT] Simplify Scala Client IT tests

2023-03-05 Thread via GitHub



LuciferYang commented on PR #40274:
URL: https://github.com/apache/spark/pull/40274#issuecomment-1455105130

   There is another problem that needs to be confirmed, which may not related 
to current pr: if other Suites inherit `RemoteSparkSession`, they will share 
the same connect server, right? (`SparkConnectServerUtils` is an object, so 
`SparkConnect` will only submit once)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ivoson opened a new pull request, #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-05 Thread via GitHub



ivoson opened a new pull request, #40286:
URL: https://github.com/apache/spark/pull/40286

   ### What changes were proposed in this pull request?
   Currently a stage will be resubmitted in a few scenarios:
   1. Task failed with `FetchFailed` will trigger stage re-submit;
   2. Barrier task failed;
   3. Shuffle data loss due to executor/host decommissioned;
   
   For the first 2 scenarios, there is a config 
`spark.stage.maxConsecutiveAttempts` to limit the retry times. While for the 
3rd scenario, there'll be potential risks for inifinite retry if there are 
always executors hosting the shuffle data from successful tasks got 
killed/lost, the stage will be re-run again and again.
   
   To avoid the potential risk, the proposal in this PR is to add a new config 
`spark.stage.maxConsecutiveAttempts` to limit the overall max attempts number 
for each stage, the stage will be aborted once the retry times beyond the 
limitation.
   
   ### Why are the changes needed?
   To avoid the potential risks for stage infinite retry.
   
   ### Does this PR introduce _any_ user-facing change?
   Added limitation for stage retry times, so jobs may fail if they need to 
retry for mutiplte times beyond the limitation.
   
   ### How was this patch tested?
   Added new UT.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 125 matches

Mail list logo