date:20230915

[GitHub] [spark] itholic opened a new pull request, #42954: [WIP][SPARK-43458][SPARK-43561][PS][TESTS] Enable `test_to_latex` for (Series|DataFrame) conversion

2023-09-15 Thread via GitHub



itholic opened a new pull request, #42954:
URL: https://github.com/apache/spark/pull/42954

   ### What changes were proposed in this pull request?
   
   This PR proposes to enable `SeriesConversionTests.test_to_latex` and 
`DataFrameConversionTests.test_to_latex`.
   
   
   ### Why are the changes needed?
   
   To improve the test coverage.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Enabling the existing tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ConeyLiu commented on pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-09-15 Thread via GitHub



ConeyLiu commented on PR #42612:
URL: https://github.com/apache/spark/pull/42612#issuecomment-1722150942

   @sunchao that's OK, rebased.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #42953: [SPARK-45185][BUILD][PYTHON] Ignore type check for preventing unexpected linter failure

2023-09-15 Thread via GitHub



itholic commented on PR #42953:
URL: https://github.com/apache/spark/pull/42953#issuecomment-1722146432

   cc @zhenglaizhang @HyukjinKwon @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #42953: [SPARK-45185][BUILD][PYTHON] Ignore type check for preventing unexpected linter failure

2023-09-15 Thread via GitHub



itholic opened a new pull request, #42953:
URL: https://github.com/apache/spark/pull/42953

   
   
   ### What changes were proposed in this pull request?
   
   The current Python linter from CI is failing due to unexpected mypy check 
failure as below:
   
   https://github.com/apache/spark/assets/44108233/5f293178-4ffa-4dd9-9c22-cd91f1970365";>
   
   linter complains that `_process_plot_format` is undefined, but it works find 
in actual code base without any warning:
   
   ```python
   Welcome to
   __
/ __/__  ___ _/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
 /_/
   
   Using Python version 3.9.17 (main, Jul  5 2023 15:35:09)
   Spark context Web UI available at http://172.30.1.51:4040
   Spark context available as 'sc' (master = local[*], app id = 
local-1694843335414).
   SparkSession available as 'spark'.
   >>> from matplotlib.axes._base import _process_plot_format
   ```
   
   ### Why are the changes needed?
   
   To fix the Python linter failure.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   CI should pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

2023-09-15 Thread via GitHub



itholic commented on code in PR #42793:
URL: https://github.com/apache/spark/pull/42793#discussion_r1327908926


##
python/pyspark/pandas/frame.py:
##
@@ -1321,11 +1323,76 @@ def applymap(self, func: Callable[[Any], Any]) -> 
"DataFrame":
 0   1.00   4.494400
 1  11.262736  20.857489
 """
+warnings.warn(
+"DataFrame.applymap has been deprecated. Use DataFrame.map 
instead", FutureWarning
+)
 
 # TODO: We can implement shortcut theoretically since it creates new 
DataFrame
 #  anyway and we don't have to worry about operations on different 
DataFrames.
 return self._apply_series_op(lambda psser: psser.apply(func))
 
+def map(self, func: Callable[[Any], Any]) -> "DataFrame":
+"""
+Apply a function to a Dataframe elementwise.
+
+This method applies a function that accepts and returns a scalar
+to every element of a DataFrame.
+
+.. versionadded:: 4.0.0
+DataFrame.applymap was deprecated and renamed to DataFrame.map.
+
+.. note:: this API executes the function once to infer the type which 
is
+ potentially expensive, for instance, when the dataset is created 
after
+ aggregations or sorting.
+
+ To avoid this, specify return type in ``func``, for instance, as 
below:
+
+ >>> def square(x) -> np.int32:
+ ... return x ** 2
+
+ pandas-on-Spark uses return type hints and does not try to infer 
the type.
+
+Parameters
+--
+func : callable
+Python function returns a single value from a single value.
+
+Returns
+---
+DataFrame
+Transformed DataFrame.
+
+Examples
+
+>>> df = ps.DataFrame([[1, 2.12], [3.356, 4.567]])
+>>> df
+   0  1
+0  1.000  2.120
+1  3.356  4.567
+
+>>> def str_len(x) -> int:
+... return len(str(x))
+>>> df.map(str_len)
+   0  1
+0  3  4
+1  5  5
+
+>>> def power(x) -> float:
+... return x ** 2
+>>> df.map(power)
+   0  1
+0   1.00   4.494400
+1  11.262736  20.857489
+
+You can omit type hints and let pandas-on-Spark infer its type.
+
+>>> df.map(lambda x: x ** 2)
+   0  1
+0   1.00   4.494400
+1  11.262736  20.857489
+"""
+return self.applymap(func=func)

Review Comment:
   Oh, yeah we shouldn't call `applymap` here.
   
   Just applied the suggestion. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic closed pull request #42946: [DO-NOT-MERGE] Test Jinja2 latest

2023-09-15 Thread via GitHub



itholic closed pull request #42946: [DO-NOT-MERGE] Test Jinja2 latest
URL: https://github.com/apache/spark/pull/42946


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] shuwang21 commented on a diff in pull request #42357: [SPARK-44306][YARN] Group FileStatus with few RPC calls within Yarn Client

2023-09-15 Thread via GitHub



shuwang21 commented on code in PR #42357:
URL: https://github.com/apache/spark/pull/42357#discussion_r1327901950


##
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala:
##
@@ -462,6 +462,30 @@ package object config extends Logging {
 .stringConf
 .createWithDefault("yarn.io/fpga")
 
+  private[spark] val YARN_CLIENT_STAT_CACHE_PRELOAD_ENABLED =
+ConfigBuilder("spark.yarn.client.statCache.preload.enabled")
+.doc("This configuration enables statCache to be preloaded at YARN client 
side. This feature " +
+  "analyzes the pattern of resources paths, and if multiple resources 
shared the same parent " +
+  "directory, a single listStatus will be invoked on the 
parent directory " +
+  "instead of multiple getFileStatus performed on each 
individual resources. " +
+  "If most resources from a small set of directories, this can 
substantially improve job " +
+  "submission time. Enabling this feature may potentially increase client 
memory overhead.")

Review Comment:
   Do you mean `listStatus ` with `PathFilter `? I can try that. 
   
   
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#listStatus-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.PathFilter-



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #42952: [SPARK-45184][SQL] Remove orphaned error class documents

2023-09-15 Thread via GitHub



panbingkun commented on PR #42952:
URL: https://github.com/apache/spark/pull/42952#issuecomment-1722118084

   cc @cloud-fan @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun opened a new pull request, #42952: [SPARK-45184][SQL] Remove orphaned error class documents

2023-09-15 Thread via GitHub



panbingkun opened a new pull request, #42952:
URL: https://github.com/apache/spark/pull/42952

   ### What changes were proposed in this pull request?
   The pr aims to remove orphaned error class documents, include:
   1.Introducing an automated mechanism for removing orphaned files.
   2.Remove two orphaned error class documents:  
`sql-error-conditions-incompatible-data-to-table-error-class.md` and 
`sql-error-conditions-unsupported-temp-view-operation-error-class.md`
   
   ### Why are the changes needed?
   - Keep error documents clear.
   - There are two possibilities for generating orphaned error documents: 
   The first, when we are performing error class refactoring without deleting 
the error class document that has already been refactored; 
   The second, modify the error class name based on the suggestions submitted 
by the reviewer in code review, but unintentionally submit the previous error 
class document.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   - Manually test.
   - Pass GA.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42908: [SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42908:
URL: https://github.com/apache/spark/pull/42908#issuecomment-1722105482

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #42908: [SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer

2023-09-15 Thread via GitHub



dongjoon-hyun closed pull request #42908: [SPARK-44872][CONNECT][FOLLOWUP] 
Deflake ReattachableExecuteSuite and increase retry buffer
URL: https://github.com/apache/spark/pull/42908


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42947: [SPARK-45181][BUILD] Upgrade buf to v1.26.1

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42947:
URL: https://github.com/apache/spark/pull/42947#issuecomment-1722102594

   Thank you, @zhengruifeng .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42947: [SPARK-45181][BUILD] Upgrade buf to v1.26.1

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42947:
URL: https://github.com/apache/spark/pull/42947#issuecomment-1722102560

   Merged to master for Apache Spark 4.0.0.
   
   The only failure is a known one, `ReattachableExecuteSuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #42947: [SPARK-45181][BUILD] Upgrade buf to v1.26.1

2023-09-15 Thread via GitHub



dongjoon-hyun closed pull request #42947: [SPARK-45181][BUILD] Upgrade buf to 
v1.26.1
URL: https://github.com/apache/spark/pull/42947


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42918: [SPARK-40497][BUILD] Re-upgrade Scala to 2.13.11

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42918:
URL: https://github.com/apache/spark/pull/42918#issuecomment-1722102166

   Merged to master for Apache Spark 4.0.0. Thank you, @LuciferYang and all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #42918: [SPARK-40497][BUILD] Re-upgrade Scala to 2.13.11

2023-09-15 Thread via GitHub



dongjoon-hyun closed pull request #42918: [SPARK-40497][BUILD] Re-upgrade Scala 
to 2.13.11
URL: https://github.com/apache/spark/pull/42918


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42926: [SPARK-45164][PS] Remove deprecated `Index` APIs

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42926:
URL: https://github.com/apache/spark/pull/42926#issuecomment-1722101954

   Merged to master. Thank you, @itholic and all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #42926: [SPARK-45164][PS] Remove deprecated `Index` APIs

2023-09-15 Thread via GitHub



dongjoon-hyun closed pull request #42926: [SPARK-45164][PS] Remove deprecated 
`Index` APIs
URL: https://github.com/apache/spark/pull/42926


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42935: [SPARK-45173][UI] Remove some unnecessary sourceMapping files in UI

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42935:
URL: https://github.com/apache/spark/pull/42935#issuecomment-1722101732

   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #42935: [SPARK-45173][UI] Remove some unnecessary sourceMapping files in UI

2023-09-15 Thread via GitHub



dongjoon-hyun closed pull request #42935: [SPARK-45173][UI] Remove some 
unnecessary sourceMapping files in UI
URL: https://github.com/apache/spark/pull/42935


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42935: [SPARK-45173][UI] Remove some unnecessary sourceMapping files in UI

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42935:
URL: https://github.com/apache/spark/pull/42935#issuecomment-1722101587

   Thank you for the confirmation and updating the PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Hisoka-X commented on pull request #42951: [SPARK-45078][SQL] Fix `array_insert` ImplicitCastInputTypes not work

2023-09-15 Thread via GitHub



Hisoka-X commented on PR #42951:
URL: https://github.com/apache/spark/pull/42951#issuecomment-1722088871

   cc @cloud-fan @dongjoon-hyun @Daniel-Davies


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Hisoka-X opened a new pull request, #42951: [SPARK-45078][SQL] Fix `array_insert` ImplicitCastInputTypes not work

2023-09-15 Thread via GitHub



Hisoka-X opened a new pull request, #42951:
URL: https://github.com/apache/spark/pull/42951

   
   
   ### What changes were proposed in this pull request?
   This PR fix call `array_insert` with different type between array and insert 
column, will throw exception. Sometimes it should be execute successed.
   eg:
   ```sql
   select array_insert(array(1), 2, cast(2 as tinyint))
   ```
   The `ImplicitCastInputTypes` in `ArrayInsert` always return empty array at 
now. So that Spark can not convert `tinyint` to `int`.
   
   
   
   ### Why are the changes needed?
   Fix error behavior in `array_insert`
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   
   ### How was this patch tested?
   Add new test.
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42948: [SPARK-45166][PYTHON][FOLLOWUP] Delete unused `pyarrow_version_less_than_minimum` from `pyspark.sql.pandas.utils`

2023-09-15 Thread via GitHub



zhengruifeng commented on PR #42948:
URL: https://github.com/apache/spark/pull/42948#issuecomment-1722080810

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #42948: [SPARK-45166][PYTHON][FOLLOWUP] Delete unused `pyarrow_version_less_than_minimum` from `pyspark.sql.pandas.utils`

2023-09-15 Thread via GitHub



zhengruifeng closed pull request #42948: [SPARK-45166][PYTHON][FOLLOWUP] Delete 
unused `pyarrow_version_less_than_minimum` from `pyspark.sql.pandas.utils`
URL: https://github.com/apache/spark/pull/42948


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates

2023-09-15 Thread via GitHub



github-actions[bot] closed pull request #40128: [SPARK-42466][K8S]: Cleanup k8s 
upload directory when job terminates
URL: https://github.com/apache/spark/pull/40128


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #41203: [SPARK-16484][SQL] Update hll function type checks to also check for non-foldable inputs

2023-09-15 Thread via GitHub



github-actions[bot] closed pull request #41203: [SPARK-16484][SQL] Update hll 
function type checks to also check for non-foldable inputs
URL: https://github.com/apache/spark/pull/41203


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin commented on a diff in pull request #42793: [SPARK-45065][PYTHON][PS] Support Pandas 2.1.0

2023-09-15 Thread via GitHub



ueshin commented on code in PR #42793:
URL: https://github.com/apache/spark/pull/42793#discussion_r132150


##
python/pyspark/pandas/frame.py:
##
@@ -1321,11 +1323,76 @@ def applymap(self, func: Callable[[Any], Any]) -> 
"DataFrame":
 0   1.00   4.494400
 1  11.262736  20.857489
 """
+warnings.warn(
+"DataFrame.applymap has been deprecated. Use DataFrame.map 
instead", FutureWarning
+)
 
 # TODO: We can implement shortcut theoretically since it creates new 
DataFrame
 #  anyway and we don't have to worry about operations on different 
DataFrames.
 return self._apply_series_op(lambda psser: psser.apply(func))
 
+def map(self, func: Callable[[Any], Any]) -> "DataFrame":
+"""
+Apply a function to a Dataframe elementwise.
+
+This method applies a function that accepts and returns a scalar
+to every element of a DataFrame.
+
+.. versionadded:: 4.0.0
+DataFrame.applymap was deprecated and renamed to DataFrame.map.
+
+.. note:: this API executes the function once to infer the type which 
is
+ potentially expensive, for instance, when the dataset is created 
after
+ aggregations or sorting.
+
+ To avoid this, specify return type in ``func``, for instance, as 
below:
+
+ >>> def square(x) -> np.int32:
+ ... return x ** 2
+
+ pandas-on-Spark uses return type hints and does not try to infer 
the type.
+
+Parameters
+--
+func : callable
+Python function returns a single value from a single value.
+
+Returns
+---
+DataFrame
+Transformed DataFrame.
+
+Examples
+
+>>> df = ps.DataFrame([[1, 2.12], [3.356, 4.567]])
+>>> df
+   0  1
+0  1.000  2.120
+1  3.356  4.567
+
+>>> def str_len(x) -> int:
+... return len(str(x))
+>>> df.map(str_len)
+   0  1
+0  3  4
+1  5  5
+
+>>> def power(x) -> float:
+... return x ** 2
+>>> df.map(power)
+   0  1
+0   1.00   4.494400
+1  11.262736  20.857489
+
+You can omit type hints and let pandas-on-Spark infer its type.
+
+>>> df.map(lambda x: x ** 2)
+   0  1
+0   1.00   4.494400
+1  11.262736  20.857489
+"""
+return self.applymap(func=func)

Review Comment:
   This call will show a deprecation warning from `applymap`?
   
   I guess we should call `return self._apply_series_op(lambda psser: 
psser.apply(func))` here and `applymap` should call `map` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mayurdb commented on pull request #42950: [SPARK-45182][CORE] Ignore task completion from old stage after retrying indeterminate stages

2023-09-15 Thread via GitHub



mayurdb commented on PR #42950:
URL: https://github.com/apache/spark/pull/42950#issuecomment-1721738630

   @cloud-fan @caican00 can you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mayurdb opened a new pull request, #42950: [SPARK-45182][CORE] Ignore task completion from old stage after retrying indeterminate stages

2023-09-15 Thread via GitHub



mayurdb opened a new pull request, #42950:
URL: https://github.com/apache/spark/pull/42950

   ### What changes were proposed in this pull request?
   [SPARK-25342](https://issues.apache.org/jira/browse/SPARK-25342) Added a 
support for rolling back shuffle map stage so that all tasks of the stage can 
be retried when the stage output is indeterminate. This is done by clearing all 
map outputs at the time of stage submission. This approach workouts well except 
for this case:
   
   Assume both Shuffle 1 and 2 are indeterminate
   
   ShuffleMapStage1 ––> Shuffle 1 ---–> ShuffleMapStage2 > Shuffle 2 > 
ResultStage
   
   - ShuffleMapStage1 is complete
   - A task from ShuffleMapStage2 fails with FetchFailed. Other tasks are still 
running
   - Both ShuffleMapStage1 and ShuffleMapStage2 are retried
   - ShuffleMapStage1 is retried and completes
   - ShuffleMapStage2 reattempt is scheduled for execution
   - Before all tasks of ShuffleMapStage2 reattempt could finish, one/more 
laggard tasks from the original attempt of ShuffleMapStage2 finish and 
ShuffleMapStage2 also gets marked as complete
   - Result Stage gets scheduled and finishes
   
   After this change, such laggard tasks from the old attempt of the 
indeterminate stage will be ignored 
   
   ### Why are the changes needed?
   This can give wrong result when indeterminate stages needs to be retried 
under the circumstances mentioned above
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   A new test case
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] neilramaswamy commented on a diff in pull request #42895: [SPARK-45138][SS] Define a new error class and apply it when checkpointing state to DFS fails

2023-09-15 Thread via GitHub



neilramaswamy commented on code in PR #42895:
URL: https://github.com/apache/spark/pull/42895#discussion_r1327645005


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala:
##
@@ -135,17 +135,15 @@ private[sql] class HDFSBackedStateStoreProvider extends 
StateStoreProvider with
 
 /** Commit all the updates that have been made to the store, and return 
the new version. */
 override def commit(): Long = {
-  verify(state == UPDATING, "Cannot commit after already committed or 
aborted")
-
   try {
+verify(state == UPDATING, "Cannot commit after already committed or 
aborted")
 commitUpdates(newVersion, mapToUpdate, compressedStream)
 state = COMMITTED
 logInfo(s"Committed version $newVersion for $this to file 
$finalDeltaFile")
 newVersion
   } catch {
-case NonFatal(e) =>
-  throw new IllegalStateException(
-s"Error committing version $newVersion into $this", e)

Review Comment:
   Yeah this is a good point, it does capture operator/partition/etc. Updated 
to log `this` now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-09-15 Thread via GitHub



sunchao commented on PR #42612:
URL: https://github.com/apache/spark/pull/42612#issuecomment-1721607397

   Apologies @ConeyLiu , just saw this PR. I think this makes sense. Could you 
rebase it? I'll review afterwards.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



allisonwang-db commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1327555626


##
python/pyspark/sql/functions.py:
##
@@ -13041,6 +13041,120 @@ def json_object_keys(col: "ColumnOrName") -> Column:
 return _invoke_function_over_columns("json_object_keys", col)
 
 
+@_try_remote_functions
+def from_xml(
+col: "ColumnOrName",
+schema: Union[StructType, Column, str],
+options: Optional[Dict[str, str]] = None,
+) -> Column:
+"""
+Parses a column containing a XML string to a row with
+the specified schema. Returns `null`, in the case of an unparseable string.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+col : :class:`~pyspark.sql.Column` or str
+a column or column name in XML format
+schema : :class:`StructType` or str
+a StructType or Python string literal with a DDL-formatted string
+to use when parsing the Xml column
+options : dict, optional
+options to control parsing. accepts the same options as the Xml 
datasource.
+See `Data Source Option 
`_
+for the version you use.
+
+.. # noqa
+
+Returns
+---
+:class:`~pyspark.sql.Column`
+a new column of complex type from given XML object.
+
+Examples

Review Comment:
   Documentation is extremely important for a better user-experience. 
@sandip-db could you please create a ticket under 
https://issues.apache.org/jira/browse/SPARK-44728.



##
python/pyspark/sql/functions.py:
##
@@ -13041,6 +13041,120 @@ def json_object_keys(col: "ColumnOrName") -> Column:
 return _invoke_function_over_columns("json_object_keys", col)
 
 
+@_try_remote_functions
+def from_xml(
+col: "ColumnOrName",
+schema: Union[StructType, Column, str],
+options: Optional[Dict[str, str]] = None,
+) -> Column:
+"""
+Parses a column containing a XML string to a row with
+the specified schema. Returns `null`, in the case of an unparseable string.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+col : :class:`~pyspark.sql.Column` or str
+a column or column name in XML format
+schema : :class:`StructType` or str
+a StructType or Python string literal with a DDL-formatted string
+to use when parsing the Xml column
+options : dict, optional
+options to control parsing. accepts the same options as the Xml 
datasource.
+See `Data Source Option 
`_
+for the version you use.
+
+.. # noqa
+
+Returns
+---
+:class:`~pyspark.sql.Column`
+a new column of complex type from given XML object.
+
+Examples

Review Comment:
   Documentation is extremely important for a better user-experience. 
@sandip-db could you please create a ticket under 
https://issues.apache.org/jira/browse/SPARK-44728?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42915: [SPARK-45159][PYTHON] Handle named arguments only when necessary

2023-09-15 Thread via GitHub



allisonwang-db commented on code in PR #42915:
URL: https://github.com/apache/spark/pull/42915#discussion_r1327519954


##
python/pyspark/worker.py:
##
@@ -810,28 +847,26 @@ def check_return_value(res):
 },
 )
 
-def evaluate(*args: pd.Series, **kwargs: pd.Series):
-if len(args) == 0 and len(kwargs) == 0:
+def evaluate(*args: pd.Series):
+if len(args) == 0:
 res = func()
 check_return_value(res)
 yield verify_result(pd.DataFrame(res)), arrow_return_type
 else:
 # Create tuples from the input pandas Series, each tuple
 # represents a row across all Series.
-keys = list(kwargs.keys())
-len_args = len(args)
-row_tuples = zip(*args, *[kwargs[key] for key in keys])
+row_tuples = zip(*args)
 for row in row_tuples:
-res = func(
-*row[:len_args],
-**{key: row[len_args + i] for i, key in 
enumerate(keys)},
-)
+res = func(*row)
 check_return_value(res)
 yield verify_result(pd.DataFrame(res)), 
arrow_return_type
 
 return evaluate
 
-eval = wrap_arrow_udtf(getattr(udtf, "eval"), return_type)
+eval_func_kwargs_support, args_kwargs_offsets = wrap_kwargs_support(

Review Comment:
   It would be really good to comment here on why we need to wrap the kwargs 
separately. This can provide valuable context for those who work on this code 
in the future. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vasa47 commented on pull request #41067: [SPARK-43496][KUBERNETES] Add configuration for pod memory limits

2023-09-15 Thread via GitHub



vasa47 commented on PR #41067:
URL: https://github.com/apache/spark/pull/41067#issuecomment-1721506888

   I need this feature. when can we expect this in main release branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] juliuszsompolski commented on pull request #42908: [SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer

2023-09-15 Thread via GitHub



juliuszsompolski commented on PR #42908:
URL: https://github.com/apache/spark/pull/42908#issuecomment-1721461319

   @LuciferYang I tried looking at 
https://github.com/apache/spark/pull/42560#issuecomment-1718968002 but did not 
reproduce it yet. If you have more instances of CI runs where it failed with 
that stack overflow, that could be useful.
   Inspecting the code, I don't see how that iterator could get looped like 
that...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] juliuszsompolski commented on pull request #42908: [SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer

2023-09-15 Thread via GitHub



juliuszsompolski commented on PR #42908:
URL: https://github.com/apache/spark/pull/42908#issuecomment-1721459768

   @dongjoon-hyun I don't think the SparkConnectSessionHolderSuite failures are 
related, and I don't know what's going on there.
   ```
   Streaming foreachBatch worker is starting with url 
sc://localhost:15002/;user_id=testUser and sessionId 
9863bb98-6682-43ad-bc86-b32d8486fb47.
   Traceback (most recent call last):
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/pandas/utils.py",
 line 27, in require_minimum_pandas_version
   import pandas
   ModuleNotFoundError: No module named 'pandas'
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
 File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 
194, in _run_module_as_main
   return _run_code(code, main_globals, None,
 File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 
87, in _run_code
   exec(code, run_globals)
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/streaming/worker/foreach_batch_worker.py",
 line 86, in 
   main(sock_file, sock_file)
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/streaming/worker/foreach_batch_worker.py",
 line 51, in main
   spark_connect_session = 
SparkSession.builder.remote(connect_url).getOrCreate()
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/session.py", 
line 464, in getOrCreate
   from pyspark.sql.connect.session import SparkSession as 
RemoteSparkSession
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/session.py",
 line 19, in 
   check_dependencies(__name__)
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/utils.py",
 line 33, in check_dependencies
   require_minimum_pandas_version()
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/pandas/utils.py",
 line 34, in require_minimum_pandas_version
   raise ImportError(
   ImportError: Pandas >= 1.0.5 must be installed; however, it was not found.
   [info] - python foreachBatch process: process terminates after query is 
stopped *** FAILED *** (1 second, 115 milliseconds)
   
   Streaming query listener worker is starting with url 
sc://localhost:15002/;user_id=testUser and sessionId 
ab6cfcde-a9f1-4b96-8ca3-7aab5c6ff438.
   Traceback (most recent call last):
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/pandas/utils.py",
 line 27, in require_minimum_pandas_version
   import pandas
   ModuleNotFoundError: No module named 'pandas'
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
 File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 
194, in _run_module_as_main
   return _run_code(code, main_globals, None,
 File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 
87, in _run_code
   exec(code, run_globals)
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/streaming/worker/listener_worker.py",
 line 99, in 
   main(sock_file, sock_file)
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/streaming/worker/listener_worker.py",
 line 59, in main
   spark_connect_session = 
SparkSession.builder.remote(connect_url).getOrCreate()
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/session.py", 
line 464, in getOrCreate
   from pyspark.sql.connect.session import SparkSession as 
RemoteSparkSession
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/session.py",
 line 19, in 
   check_dependencies(__name__)
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/connect/utils.py",
 line 33, in check_dependencies
   require_minimum_pandas_version()
 File 
"/home/runner/work/apache-spark/apache-spark/python/pyspark/sql/pandas/utils.py",
 line 34, in require_minimum_pandas_version
   raise ImportError(
   ImportError: Pandas >= 1.0.5 must be installed; however, it was not found.
   [info] - python listener process: process terminates after listener is 
removed *** FAILED *** (434 milliseconds)
   [info]   java.io.EOFException:
   ```
   it looks to me like some (intermittent?) environment issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h

[GitHub] [spark] cdkrot commented on pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-15 Thread via GitHub



cdkrot commented on PR #42949:
URL: https://github.com/apache/spark/pull/42949#issuecomment-1721446450

   cc @HyukjinKwon, @nija-at 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot opened a new pull request, #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-15 Thread via GitHub



cdkrot opened a new pull request, #42949:
URL: https://github.com/apache/spark/pull/42949

   ### What changes were proposed in this pull request?
   
   Add error logging into `addArtifact`  (see example in "How this is tested). 
The logging code is moved into separate file to avoid circular dependency.
   
   ### Why are the changes needed?
   
   Currently, in case `addArtifact` is executed with the file which doesn't 
exist, the user gets cryptic error
   
   ```grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that 
terminated with:
   status = StatusCode.UNKNOWN
   details = "Exception iterating requests!"
   debug_error_string = "None"
   >
   ```
   
   Which is impossible to debug without deep digging into the subject.
   
   This happens because addArtifact is implemented as client-side streaming and 
the actual error happens during grpc consuming iterator generating requests. 
Unfortunately grpc doesn't print any debug information for user to understand 
the problem.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Additional logging which is opt-in same way as before with 
`SPARK_CONNECT_LOG_LEVEL` environment variable.
   
   ### How was this patch tested?
   
   ```
   >>> s.addArtifact("XYZ", file=True)
   2023-09-15 17:06:40,078 11789 ERROR _create_requests Failed to execute 
addArtifact: [Errno 2] No such file or directory: 
'/Users/alice.sayutina/apache_spark/python/XYZ'
   Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/alice.sayutina/apache_spark/python/pyspark/sql/connect/session.py", 
line 743, in addArtifacts
   self._client.add_artifacts(*path, pyfile=pyfile, archive=archive, 
file=file)
   
   []
   
 File 
"/Users/alice.sayutina/oss-venv/lib/python3.11/site-packages/grpc/_channel.py", 
line 910, in _end_unary_response_blocking
   raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
   ^^
   grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated 
with:
   status = StatusCode.UNKNOWN
   details = "Exception iterating requests!"
   debug_error_string = "None"
   >
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #42935: [SPARK-45173][UI] Remove some unnecessary sourceMapping files in UI

2023-09-15 Thread via GitHub



yaooqinn commented on PR #42935:
URL: https://github.com/apache/spark/pull/42935#issuecomment-1721425614

   Yeah, the source map files are for debugging purposes which enables browsers 
to map JS/CSS created by a preprocessor back to the original source file. For 
production, we'd better not ship them.(Added to PR desc too)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sandip-db commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



sandip-db commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1327417666


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/FunctionTestSuite.scala:
##
@@ -229,6 +229,18 @@ class FunctionTestSuite extends ConnectFunSuite {
 schema_of_csv("x,y"),
 schema_of_csv(lit("x,y"), Collections.emptyMap()))
   testEquals("to_csv", to_csv(a), to_csv(a, Collections.emptyMap[String, 
String]))
+  testEquals(
+"from_xml",
+from_xml(a, schema),
+from_xml(a, lit(schema.toDDL)),
+// from_xml(a, lit(schema.json)),

Review Comment:
   @HyukjinKwon @itholic 
   This is failing due to a parse error. So I commented it temporarily. The 
same passes for from_json() above in this test.
   Can you please share some pointers? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sandip-db commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



sandip-db commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1327401538


##
python/pyspark/sql/tests/connect/test_connect_function.py:
##
@@ -1821,6 +1821,106 @@ def test_json_functions(self):
 sdf.select(SF.to_json(SF.struct(SF.lit("a"), SF.lit("b")), 
{"mode": "FAILFAST"})),
 )
 
+def test_xml_functions(self):
+query = """
+SELECT * FROM VALUES
+('1', '123', '5.0'),
+('0', '456', '')
+AS tab(a, b, c)
+"""
+# 
+---+---+-+
+# |  a|  b|
c|
+# 
+---+---+-+
+# |1|123|5.0|
+# |1|456|  |
+# 
+---+---+-+
+
+cdf = self.connect.sql(query)
+sdf = self.spark.sql(query)
+
+# test from_xml
+for schema in [
+"a INT",
+#StructType([StructField("a", IntegerType())]),

Review Comment:
   I would like to support this. pyspark sql works fine, but pyspark connect is 
failing to parse the StructType.
   Any pointers would help.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark-connect-go] arnarpall commented on pull request #12: [SPARK-44141] Removed need to have buf preinstalled

2023-09-15 Thread via GitHub



arnarpall commented on PR #12:
URL: https://github.com/apache/spark-connect-go/pull/12#issuecomment-1721362325

   I seems like not all the changes to the workflow are not being reflected 
properly.
   
   The current failure.
   Output from the `internal/generated.out` target is for the run
   ```
   >> BUILD, output = internal/generated.outbuf generate --debug -vvv
   bash: line 1: buf: command not found
   ```
   
   However this is what we should be expecting 
   ```
   >> BUILD, output = internal/generated.outGO111MODULE=on go run 
github.com/bufbuild/buf/cmd/buf@v1.26.1 generate --debug -vvv
   ```
   Here instead of calling the buf binary directly we expect it to be wrapped 
in a `go run` statement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #42918: [SPARK-40497][BUILD] Re-upgrade Scala to 2.13.11

2023-09-15 Thread via GitHub



LuciferYang commented on PR #42918:
URL: https://github.com/apache/spark/pull/42918#issuecomment-1721331830

   > Could you re-trigger the failed pipelines?
   
   Triggered


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] pan3793 commented on a diff in pull request #42599: [DO-NOT-MERGE] Remove Guava from shared classes from IsolatedClientLoader

2023-09-15 Thread via GitHub



pan3793 commented on code in PR #42599:
URL: https://github.com/apache/spark/pull/42599#discussion_r1327275616


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala:
##
@@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging {
 }
 val hiveArtifacts = version.extraDeps ++
   Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde")
-.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++
-  Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames
+.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ 
hadoopJarNames

Review Comment:
   @JoshRosen @sunchao do you have any suggestions for this one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42935: [SPARK-45173][UI] Remove some unnecessary sourceMapping files in UI

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42935:
URL: https://github.com/apache/spark/pull/42935#issuecomment-1721244501

   If possible, please elaborate a little more in the PR description, @yaooqinn 
. :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42941: [SPARK-43874][FOLLOWUP][TESTS] Enable `GroupbyIndexTests.test_groupby_multiindex_columns`

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42941:
URL: https://github.com/apache/spark/pull/42941#issuecomment-1721225088

   Merged to master. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #42941: [SPARK-43874][FOLLOWUP][TESTS] Enable `GroupbyIndexTests.test_groupby_multiindex_columns`

2023-09-15 Thread via GitHub



dongjoon-hyun closed pull request #42941: [SPARK-43874][FOLLOWUP][TESTS] Enable 
`GroupbyIndexTests.test_groupby_multiindex_columns`
URL: https://github.com/apache/spark/pull/42941


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42943: [SPARK-45175][K8S] download krb5.conf from remote storage in spark-submit on k8s

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42943:
URL: https://github.com/apache/spark/pull/42943#issuecomment-1721220006

   I have the same question with @yaooqinn . Since this is in `Security` 
domain, I'm wondering if this is safe or a recommended way for Kerberos.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42918: [SPARK-40497][BUILD] Re-upgrade Scala to 2.13.11

2023-09-15 Thread via GitHub



dongjoon-hyun commented on PR #42918:
URL: https://github.com/apache/spark/pull/42918#issuecomment-1721203298

   Could you re-trigger the failed pipelines?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42944: [SPARK-45179][PYTHON] Increase Numpy minimum version to 1.21

2023-09-15 Thread via GitHub



zhengruifeng commented on PR #42944:
URL: https://github.com/apache/spark/pull/42944#issuecomment-1721199404

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #42944: [SPARK-45179][PYTHON] Increase Numpy minimum version to 1.21

2023-09-15 Thread via GitHub



zhengruifeng closed pull request #42944: [SPARK-45179][PYTHON] Increase Numpy 
minimum version to 1.21
URL: https://github.com/apache/spark/pull/42944


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42948: [SPARK-45166][PYTHON][FOLLOWUP] Delete unused `pyarrow_version_less_than_minimum` from `pyspark.sql.pandas.utils`

2023-09-15 Thread via GitHub



zhengruifeng commented on PR #42948:
URL: https://github.com/apache/spark/pull/42948#issuecomment-1721155323

   CI link: 
https://github.com/zhengruifeng/spark/actions/runs/6195005123/job/16818927706


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #42948: [SPARK-45166][PYTHON][FOLLOWUP] Delete unused `pyarrow_version_less_than_minimum` from `pyspark.sql.pandas.utils`

2023-09-15 Thread via GitHub



zhengruifeng opened a new pull request, #42948:
URL: https://github.com/apache/spark/pull/42948

   ### What changes were proposed in this pull request?
   Delete unused `pyarrow_version_less_than_minimum` from 
`pyspark.sql.pandas.utils`
   
   
   ### Why are the changes needed?
   this method is only used to compare PyArrow version with 2.0.0, which is on 
longer needed after the minimum version is set 4.0.0
   
   
   ### Does this PR introduce _any_ user-facing change?
   No, dev-only
   
   
   ### How was this patch tested?
   CI
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42947: [SPARK-45181][BUILD] Upgrade buf to v1.26.1

2023-09-15 Thread via GitHub



zhengruifeng commented on PR #42947:
URL: https://github.com/apache/spark/pull/42947#issuecomment-1721124704

   I think we can continue our **monthly** upgrade of this package


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #42947: [SPARK-45181][BUILD] Upgrade buf to v1.26.1

2023-09-15 Thread via GitHub



zhengruifeng opened a new pull request, #42947:
URL: https://github.com/apache/spark/pull/42947

   ### What changes were proposed in this pull request?
   Upgrade buf to v1.26.1
   
   
   ### Why are the changes needed?
   this upgrade cause no change in generated codes
   it fixed multiple issues: see https://github.com/bufbuild/buf/releases
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, dev-only
   
   
   ### How was this patch tested?
   manually check
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #42917: [SPARK-45163][SQL] Merge UNSUPPORTED_VIEW_OPERATION & UNSUPPORTED_TABLE_OPERATION & fix some issue

2023-09-15 Thread via GitHub



panbingkun commented on PR #42917:
URL: https://github.com/apache/spark/pull/42917#issuecomment-1721070239

   I have checked all UT to display prompts for what should be prompted and not 
for what should not be prompted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on a diff in pull request #42917: [SPARK-45163][SQL] Merge UNSUPPORTED_VIEW_OPERATION & UNSUPPORTED_TABLE_OPERATION & fix some issue

2023-09-15 Thread via GitHub



panbingkun commented on code in PR #42917:
URL: https://github.com/apache/spark/pull/42917#discussion_r1327126016


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -4715,8 +4714,7 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
 RecoverPartitions(
   createUnresolvedTable(
 ctx.identifierReference,
-"ALTER TABLE ... RECOVER PARTITIONS",
-true))
+"ALTER TABLE ... RECOVER PARTITIONS"))
   }

Review Comment:
   Because the syntax of ALTER VIEW ... RECOVER PARTITIONS is not supported on 
view, prompts can actually cause misunderstandings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on a diff in pull request #42917: [SPARK-45163][SQL] Merge UNSUPPORTED_VIEW_OPERATION & UNSUPPORTED_TABLE_OPERATION & fix some issue

2023-09-15 Thread via GitHub



panbingkun commented on code in PR #42917:
URL: https://github.com/apache/spark/pull/42917#discussion_r1327125546


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##
@@ -976,7 +976,7 @@ class HiveDDLSuite
   exception = intercept[AnalysisException] {
 sql(s"ALTER TABLE $oldViewName RECOVER PARTITIONS")
   },
-  errorClass = "UNSUPPORTED_VIEW_OPERATION.WITH_SUGGESTION",

Review Comment:
   Other modifications are similar.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on a diff in pull request #42917: [SPARK-45163][SQL] Merge UNSUPPORTED_VIEW_OPERATION & UNSUPPORTED_TABLE_OPERATION & fix some issue

2023-09-15 Thread via GitHub



panbingkun commented on code in PR #42917:
URL: https://github.com/apache/spark/pull/42917#discussion_r1327124355


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala:
##
@@ -976,7 +976,7 @@ class HiveDDLSuite
   exception = intercept[AnalysisException] {
 sql(s"ALTER TABLE $oldViewName RECOVER PARTITIONS")
   },
-  errorClass = "UNSUPPORTED_VIEW_OPERATION.WITH_SUGGESTION",

Review Comment:
   The old logic will prompt `Please use ALTER VIEW instead.` here, 
   Because the syntax of `ALTER VIEW ... RECOVER PARTITIONS` is not supported 
on view, prompts can actually cause misunderstandings.
   which is incorrect. 
   I have checked the rationality of the error prompts returned by all test 
cases.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on a diff in pull request #42917: [SPARK-45163][SQL] Merge UNSUPPORTED_VIEW_OPERATION & UNSUPPORTED_TABLE_OPERATION & fix some issue

2023-09-15 Thread via GitHub



panbingkun commented on code in PR #42917:
URL: https://github.com/apache/spark/pull/42917#discussion_r1327119760


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -3215,11 +3225,6 @@
   " is a VARIABLE and cannot be updated using the SET 
statement. Use SET VARIABLE  = ... instead."
 ]
   },
-  "TABLE_OPERATION" : {
-"message" : [
-  "Table  does not support . Please check the 
current catalog and namespace to make sure the qualified table name is 
expected, and also check the catalog implementation which is configured by 
\"spark.sql.catalog\"."

Review Comment:
   Done.
   PS: After the pr, I will open a new seperate pr to merge 
`_LEGACY_ERROR_TEMP_1113` into `UNSUPPORTED_FEATURE.TABLE_OPERATION`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn closed pull request #42904: [SPARK-45151][CORE][UI] Task Level Thread Dump Support

2023-09-15 Thread via GitHub



yaooqinn closed pull request #42904: [SPARK-45151][CORE][UI] Task Level Thread 
Dump Support
URL: https://github.com/apache/spark/pull/42904


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #42904: [SPARK-45151][CORE][UI] Task Level Thread Dump Support

2023-09-15 Thread via GitHub



yaooqinn commented on PR #42904:
URL: https://github.com/apache/spark/pull/42904#issuecomment-1721030343

   The second last commits passed CI and the last is a minor.
   
   Thanks @mridulm and @dongjoon-hyun for the review
   Merged to master
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #42943: [SPARK-45175][K8S] download krb5.conf from remote storage in spark-submit on k8s

2023-09-15 Thread via GitHub



yaooqinn commented on PR #42943:
URL: https://github.com/apache/spark/pull/42943#issuecomment-1721015896

   What if the remote storage requires login via Kerberos before accessing it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dcoliversun commented on pull request #42943: [SPARK-45175][K8S] download krb5.conf from remote storage in spark-submit on k8s

2023-09-15 Thread via GitHub



dcoliversun commented on PR #42943:
URL: https://github.com/apache/spark/pull/42943#issuecomment-1721010955

   @dongjoon-hyun It would be good if you have time to review this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42929: [SPARK-45167][CONNECT] Python client must call `release_all`

2023-09-15 Thread via GitHub



HyukjinKwon commented on code in PR #42929:
URL: https://github.com/apache/spark/pull/42929#discussion_r1327072112


##
python/pyspark/sql/connect/client/reattach.py:
##
@@ -14,14 +14,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+from multiprocessing import RLock

Review Comment:
   nit but the import has to be below around `from multiprocessing.pool import 
ThreadPool`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #42946: [DO-NOT-METGE] Test Jinja2 latest

2023-09-15 Thread via GitHub



itholic commented on PR #42946:
URL: https://github.com/apache/spark/pull/42946#issuecomment-1720947209

   We need the latest version of `Jinja2` for some functions from Pandas 2.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #42946: [DO-NOT-METGE] Test Jinja2 latest

2023-09-15 Thread via GitHub



itholic opened a new pull request, #42946:
URL: https://github.com/apache/spark/pull/42946

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42942: [SPARK-45168][PYTHON][FOLLOWUP] `test_missing_data.py` Code Cleanup

2023-09-15 Thread via GitHub



zhengruifeng commented on PR #42942:
URL: https://github.com/apache/spark/pull/42942#issuecomment-1720943164

   thanks @dongjoon-hyun , merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #42942: [SPARK-45168][PYTHON][FOLLOWUP] `test_missing_data.py` Code Cleanup

2023-09-15 Thread via GitHub



zhengruifeng closed pull request #42942: [SPARK-45168][PYTHON][FOLLOWUP]  
`test_missing_data.py` Code Cleanup
URL: https://github.com/apache/spark/pull/42942


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #42755: [SPARK-45034][SQL] Support deterministic mode function

2023-09-15 Thread via GitHub



peter-toth commented on PR #42755:
URL: https://github.com/apache/spark/pull/42755#issuecomment-1720914122

   Hmm, the failure seems unrelated but persistent...
   ```
   [info] - client INVALID_CURSOR.DISCONNECTED error is retried when other RPC 
preempts this one *** FAILED *** (385 milliseconds)
   [info]   org.apache.spark.SparkException: io.grpc.StatusRuntimeException: 
INTERNAL: [INVALID_CURSOR.POSITION_NOT_AVAILABLE] The cursor is invalid. The 
cursor position id c156089c-f861-467e-9031-31ee6f26d2b4 is no longer available 
at index 2.
   [info]   at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:113)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on pull request #42890: [SPARK-25689][YARN][FOLLOWUP] Add a missing argument usage description for ApplicationMasterArguments

2023-09-15 Thread via GitHub



itholic commented on PR #42890:
URL: https://github.com/apache/spark/pull/42890#issuecomment-1720903171

   Actually I'm not very used to yarn cluster, so we might need a review from 
who has enough context for this (maybe @vanzin and @squito one of them?).
   
   Otherwise, can you give me a more context for this PR? What is the major 
difference or benefit after this PR compare to the current status??


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #42861: [SPARK-45108][SQL] Improve the InjectRuntimeFilter for check probably shuffle

2023-09-15 Thread via GitHub



beliefer commented on PR #42861:
URL: https://github.com/apache/spark/pull/42861#issuecomment-1720864374

   ping @cloud-fan @viirya cc @somani 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #42937: [SPARK-45177][PS] Remove `col_space` parameter from `to_latex`

2023-09-15 Thread via GitHub



HyukjinKwon closed pull request #42937: [SPARK-45177][PS] Remove `col_space` 
parameter from `to_latex`
URL: https://github.com/apache/spark/pull/42937


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42937: [SPARK-45177][PS] Remove `col_space` parameter from `to_latex`

2023-09-15 Thread via GitHub



HyukjinKwon commented on PR #42937:
URL: https://github.com/apache/spark/pull/42937#issuecomment-1720812084

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Hisoka-X commented on pull request #42802: [SPARK-43752][SQL] Support default column value on DataSource V2

2023-09-15 Thread via GitHub



Hisoka-X commented on PR #42802:
URL: https://github.com/apache/spark/pull/42802#issuecomment-1720796610

   cc @cloud-fan Would you mind take a look this? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #42918: [SPARK-40497][BUILD] Re-upgrade Scala to 2.13.11

2023-09-15 Thread via GitHub



LuciferYang commented on PR #42918:
URL: https://github.com/apache/spark/pull/42918#issuecomment-1720796029

   > 2.13.12 has also been release 
https://github.com/scala/scala/releases/tag/v2.13.12 so we might want to look 
into that in the future.
   
   We can test it once Ammonite releases a new version supporting Scala 2.13.12.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Hisoka-X commented on a diff in pull request #42802: [SPARK-43752][SQL] Support default column value on DataSource V2

2023-09-15 Thread via GitHub



Hisoka-X commented on code in PR #42802:
URL: https://github.com/apache/spark/pull/42802#discussion_r1326897921


##
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTableCatalog.scala:
##
@@ -139,8 +141,32 @@ class BasicInMemoryTableCatalog extends TableCatalog {
   throw new IllegalArgumentException(s"Cannot drop all fields")
 }
 
+def createNewData(

Review Comment:
   `InMemoryTableCatalog` should change data in table when execute alter table 
add/drop column.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #42918: [SPARK-40497][BUILD] Re-upgrade Scala to 2.13.11

2023-09-15 Thread via GitHub



LuciferYang commented on PR #42918:
URL: https://github.com/apache/spark/pull/42918#issuecomment-1720792981

   Thanks @dongjoon-hyun and @eejbyfeldt 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #42928: [SPARK-45166][PYTHON] Clean up unused code paths for pyarrow<4

2023-09-15 Thread via GitHub



HyukjinKwon closed pull request #42928: [SPARK-45166][PYTHON] Clean up unused 
code paths for pyarrow<4
URL: https://github.com/apache/spark/pull/42928


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42928: [SPARK-45166][PYTHON] Clean up unused code paths for pyarrow<4

2023-09-15 Thread via GitHub



HyukjinKwon commented on PR #42928:
URL: https://github.com/apache/spark/pull/42928#issuecomment-1720790062

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #42847: [SPARK-45128][SQL] Support `CalendarIntervalType` in Arrow

2023-09-15 Thread via GitHub



HyukjinKwon closed pull request #42847: [SPARK-45128][SQL] Support 
`CalendarIntervalType` in Arrow
URL: https://github.com/apache/spark/pull/42847


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on pull request #42886: [SPARK-45129] Add pyspark "ml-connect" extras dependencies

2023-09-15 Thread via GitHub



WeichenXu123 commented on PR #42886:
URL: https://github.com/apache/spark/pull/42886#issuecomment-1720788624

   Thanks ! @HyukjinKwon  When we run `pip install pyspark[ml-connect]` it 
should install pyspark[connect] dependencies too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42847: [SPARK-45128][SQL] Support `CalendarIntervalType` in Arrow

2023-09-15 Thread via GitHub



HyukjinKwon commented on PR #42847:
URL: https://github.com/apache/spark/pull/42847#issuecomment-1720788409

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rickyma commented on pull request #42890: [SPARK-25689][YARN][FOLLOWUP] Add a missing argument usage description for ApplicationMasterArguments

2023-09-15 Thread via GitHub



rickyma commented on PR #42890:
URL: https://github.com/apache/spark/pull/42890#issuecomment-1720787138

   @itholic @HyukjinKwon Hey, can you guys merge this? This pull request 
doesn't need to be tested. Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



HyukjinKwon commented on PR #42938:
URL: https://github.com/apache/spark/pull/42938#issuecomment-1720785722

   cc @itholic mind helping review this please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



HyukjinKwon commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r132670


##
python/pyspark/sql/functions.py:
##
@@ -13041,6 +13041,120 @@ def json_object_keys(col: "ColumnOrName") -> Column:
 return _invoke_function_over_columns("json_object_keys", col)
 
 
+@_try_remote_functions
+def from_xml(
+col: "ColumnOrName",
+schema: Union[StructType, Column, str],
+options: Optional[Dict[str, str]] = None,
+) -> Column:
+"""
+Parses a column containing a XML string to a row with
+the specified schema. Returns `null`, in the case of an unparseable string.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+col : :class:`~pyspark.sql.Column` or str
+a column or column name in XML format
+schema : :class:`StructType` or str
+a StructType or Python string literal with a DDL-formatted string
+to use when parsing the Xml column
+options : dict, optional
+options to control parsing. accepts the same options as the Xml 
datasource.
+See `Data Source Option 
`_
+for the version you use.
+
+.. # noqa
+
+Returns
+---
+:class:`~pyspark.sql.Column`
+a new column of complex type from given XML object.
+
+Examples
+
+>>> from pyspark.sql.types import *
+>>> from pyspark.sql.functions import from_xml, schema_of_xml, lit
+>>> data = [(1, '''1''')]
+>>> schema = StructType([StructField("a", IntegerType())])
+>>> df = spark.createDataFrame(data, ("key", "value"))
+>>> df.select(from_xml(df.value, schema).alias("xml")).collect()
+[Row(xml=Row(a=1))]
+>>> df.select(from_xml(df.value, "a INT").alias("xml")).collect()
+[Row(xml=Row(a=1))]
+>>> data = [(1, '12')]
+>>> df = spark.createDataFrame(data, ("key", "value"))
+>>> schema = StructType([StructField("a", ArrayType(IntegerType()))])
+>>> df.select(from_xml(df.value, schema).alias("xml")).collect()
+[Row(xml=Row(a=[1, 2]))]
+>>> schema = schema_of_xml(lit(data[0][1]))
+>>> df.select(from_xml(df.value, schema).alias("xml")).collect()
+[Row(xml=Row(a=[1, 2]))]
+"""
+
+if isinstance(schema, StructType):
+schema = schema.json()
+elif isinstance(schema, Column):
+schema = _to_java_column(schema)
+elif not isinstance(schema, str):
+raise PySparkTypeError(
+error_class="NOT_COLUMN_OR_STR_OR_STRUCT",
+message_parameters={"arg_name": "schema", "arg_type": 
type(schema).__name__},
+)
+return _invoke_function("from_xml", _to_java_column(col), schema, 
_options_to_str(options))
+
+
+@_try_remote_functions
+def schema_of_xml(xml: "ColumnOrName", options: Optional[Dict[str, str]] = 
None) -> Column:
+"""
+Parses a XML string and infers its schema in DDL format.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+xml : :class:`~pyspark.sql.Column` or str
+a XML string or a foldable string column containing a XML string.
+options : dict, optional
+options to control parsing. accepts the same options as the XML 
datasource.
+See `Data Source Option 
`_
+for the version you use.
+
+.. # noqa
+
+.. versionchanged:: 4.0.0

Review Comment:
   You can remove this since this is a new feature, we won't need to annotate 
them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



HyukjinKwon commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1326888182


##
python/pyspark/sql/functions.py:
##
@@ -13041,6 +13041,120 @@ def json_object_keys(col: "ColumnOrName") -> Column:
 return _invoke_function_over_columns("json_object_keys", col)
 
 
+@_try_remote_functions
+def from_xml(
+col: "ColumnOrName",
+schema: Union[StructType, Column, str],
+options: Optional[Dict[str, str]] = None,
+) -> Column:
+"""
+Parses a column containing a XML string to a row with
+the specified schema. Returns `null`, in the case of an unparseable string.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+col : :class:`~pyspark.sql.Column` or str
+a column or column name in XML format
+schema : :class:`StructType` or str
+a StructType or Python string literal with a DDL-formatted string
+to use when parsing the Xml column
+options : dict, optional
+options to control parsing. accepts the same options as the Xml 
datasource.

Review Comment:
   ```suggestion
   options to control parsing. Accepts the same options as the Xml 
datasource.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



HyukjinKwon commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1326888182


##
python/pyspark/sql/functions.py:
##
@@ -13041,6 +13041,120 @@ def json_object_keys(col: "ColumnOrName") -> Column:
 return _invoke_function_over_columns("json_object_keys", col)
 
 
+@_try_remote_functions
+def from_xml(
+col: "ColumnOrName",
+schema: Union[StructType, Column, str],
+options: Optional[Dict[str, str]] = None,
+) -> Column:
+"""
+Parses a column containing a XML string to a row with
+the specified schema. Returns `null`, in the case of an unparseable string.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+col : :class:`~pyspark.sql.Column` or str
+a column or column name in XML format
+schema : :class:`StructType` or str
+a StructType or Python string literal with a DDL-formatted string
+to use when parsing the Xml column
+options : dict, optional
+options to control parsing. accepts the same options as the Xml 
datasource.

Review Comment:
   ```suggestion
   options to control parsing. Accepts the same options as the Xml 
datasource.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



HyukjinKwon commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1326886753


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##
@@ -830,7 +830,11 @@ object FunctionRegistry {
 // csv
 expression[CsvToStructs]("from_csv"),
 expression[SchemaOfCsv]("schema_of_csv"),
-expression[StructsToCsv]("to_csv")
+expression[StructsToCsv]("to_csv"),
+
+// Xml
+expression[XmlToStructs] ("from_xml"),

Review Comment:
   To register this properly, I think `XmlToStructs` has to be decorated by 
`ExpressionDescription` like you did in `SchemaOfXml`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



HyukjinKwon commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1326885469


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##
@@ -830,7 +830,11 @@ object FunctionRegistry {
 // csv
 expression[CsvToStructs]("from_csv"),
 expression[SchemaOfCsv]("schema_of_csv"),
-expression[StructsToCsv]("to_csv")
+expression[StructsToCsv]("to_csv"),
+
+// Xml
+expression[XmlToStructs] ("from_xml"),

Review Comment:
   ```suggestion
   expression[XmlToStructs]("from_xml"),
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



HyukjinKwon commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1326885256


##
python/pyspark/sql/tests/connect/test_connect_function.py:
##
@@ -1821,6 +1821,106 @@ def test_json_functions(self):
 sdf.select(SF.to_json(SF.struct(SF.lit("a"), SF.lit("b")), 
{"mode": "FAILFAST"})),
 )
 
+def test_xml_functions(self):
+query = """
+SELECT * FROM VALUES
+('1', '123', '5.0'),
+('0', '456', '')
+AS tab(a, b, c)
+"""
+# 
+---+---+-+
+# |  a|  b|
c|
+# 
+---+---+-+
+# |1|123|5.0|
+# |1|456|  |
+# 
+---+---+-+
+
+cdf = self.connect.sql(query)
+sdf = self.spark.sql(query)
+
+# test from_xml
+for schema in [
+"a INT",
+#StructType([StructField("a", IntegerType())]),

Review Comment:
   Let's probably remove commented codes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #42945: [SPARK-SPARK-45180][PS] Remove boolean inputs for `inclusive` parameter from `Series.between`

2023-09-15 Thread via GitHub



itholic opened a new pull request, #42945:
URL: https://github.com/apache/spark/pull/42945

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR proposes to remove boolean inputs for `inclusive` parameter from 
`Series.between` in favor of `both` and `neither`
   
   
   ### Why are the changes needed?
   
   To match the behavior of latest Pandas.
   
   ### Does this PR introduce _any_ user-facing change?
   
   boolean type input is no longer supported for `inclusive` parameter of 
`Series.between`
   
   ### How was this patch tested?
   
   Updating & enabling the existing UT.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



HyukjinKwon commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1326882806


##
python/pyspark/sql/functions.py:
##
@@ -13041,6 +13041,120 @@ def json_object_keys(col: "ColumnOrName") -> Column:
 return _invoke_function_over_columns("json_object_keys", col)
 
 
+@_try_remote_functions
+def from_xml(
+col: "ColumnOrName",
+schema: Union[StructType, Column, str],
+options: Optional[Dict[str, str]] = None,
+) -> Column:
+"""
+Parses a column containing a XML string to a row with
+the specified schema. Returns `null`, in the case of an unparseable string.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+col : :class:`~pyspark.sql.Column` or str
+a column or column name in XML format
+schema : :class:`StructType` or str
+a StructType or Python string literal with a DDL-formatted string
+to use when parsing the Xml column
+options : dict, optional
+options to control parsing. accepts the same options as the Xml 
datasource.
+See `Data Source Option 
`_
+for the version you use.
+
+.. # noqa
+
+Returns
+---
+:class:`~pyspark.sql.Column`
+a new column of complex type from given XML object.
+
+Examples

Review Comment:
   I think we should improve the examples here but let's do that separately. cc 
@allisonwang-db and @zhengruifeng FYI



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42938: [SPARK-44788][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

2023-09-15 Thread via GitHub



HyukjinKwon commented on code in PR #42938:
URL: https://github.com/apache/spark/pull/42938#discussion_r1326881986


##
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -7367,15 +7367,83 @@ object functions {
*
"https://spark.apache.org/docs/latest/sql-data-sources-xml.html#data-source-option";>
*Data Source Option in the version you use.
* @group collection_funcs
-   * @since
+   * @since 4.0.0
*/
   // scalastyle:on line.size.limit
   def from_xml(e: Column, schema: StructType, options: Map[String, String]): 
Column = withExpr {
 XmlToStructs(CharVarcharUtils.failIfHasCharVarchar(schema), options, 
e.expr)
   }
 
+  // scalastyle:off line.size.limit
+  /**
+   * (Java-specific) Parses a column containing a XML string into a 
`StructType`
+   * with the specified schema.
+   * Returns `null`, in the case of an unparseable string.
+   *
+   * @param e   a string column containing XML data.
+   * @param schema  the schema as a DDL-formatted string.
+   * @param options options to control how the XML is parsed. accepts the same 
options and the
+   *xml data source.
+   *See
+   *https://spark.apache.org/docs/latest/sql-data-sources-xml.html#data-source-option";>
+   *Data Source Option in the version you use.
+   * @group collection_funcs
+   * @since 4.0.0
+   */
+  // scalastyle:on line.size.limit
+  def from_xml(e: Column, schema: String, options: java.util.Map[String, 
String]): Column = {
+from_xml(e, schema, options.asScala.toMap)
+  }
+
+  // scalastyle:off line.size.limit
+
+  /**
+   * (Scala-specific) Parses a column containing a XML string into a 
`StructType`
+   * with the specified schema.
+   * Returns `null`, in the case of an unparseable string.
+   *
+   * @param e   a string column containing XML data.
+   * @param schema  the schema as a DDL-formatted string.
+   * @param options options to control how the XML is parsed. accepts the same 
options and the
+   *Xml data source.
+   *See
+   *https://spark.apache.org/docs/latest/sql-data-sources-xml.html#data-source-option";>
+   *Data Source Option in the version you use.
+   * @group collection_funcs
+   * @since 4.0.0
+   */
+  // scalastyle:on line.size.limit
+  def from_xml(e: Column, schema: String, options: Map[String, String]): 
Column = {

Review Comment:
   Let's probably remove the Scala specific `Map`, and only have one with Java 
signature. Scala can easily use them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

95 matches

Mail list logo