date:20240508

Re: [PR] [SPARK-48187][INFRA] Run `docs` only in PR builders and Java 21 Daily CI [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46463:
URL: https://github.com/apache/spark/pull/46463#issuecomment-2099880036

   Thank you, @HyukjinKwon and @yaooqinn .
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48187][INFRA] Run `docs` only in PR builders and `build_non_ansi` Daily CI [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun closed pull request #46463: [SPARK-48187][INFRA] Run `docs` only 
in PR builders and `build_non_ansi` Daily CI
URL: https://github.com/apache/spark/pull/46463


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-MERGE] Test in different versions [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46417:
URL: https://github.com/apache/spark/pull/46417#issuecomment-2099921722

   ```
   ==
   ERROR [15.001s]: test_aggregate 
(pyspark.pandas.tests.connect.groupby.test_parity_aggregate.GroupbyParityAggregateTests)
   --
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/groupby/test_aggregate.py",
 line 179, in test_aggregate
   stats_pdf.sort_values(by=[(20, "min"), (20, "max"), (30, 
"sum")]).reset_index(
 File 
"/usr/share/miniconda/envs/client-env/lib/python3.10/site-packages/pandas/util/_decorators.py",
 line 331, in wrapper
   return func(*args, **kwargs)
 File 
"/usr/share/miniconda/envs/client-env/lib/python3.10/site-packages/pandas/core/frame.py",
 line 6894, in sort_values
   keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
 File 
"/usr/share/miniconda/envs/client-env/lib/python3.10/site-packages/pandas/core/frame.py",
 line 6894, in 
   keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
 File 
"/usr/share/miniconda/envs/client-env/lib/python3.10/site-packages/pandas/core/generic.py",
 line 1838, in _get_label_or_level_values
   self._check_label_or_level_ambiguity(key, axis=axis)
 File 
"/usr/share/miniconda/envs/client-env/lib/python3.10/site-packages/pandas/core/generic.py",
 line 1780, in _check_label_or_level_ambiguity
   and key in self.axes[axis].names
   ValueError: The truth value of an array with more than one element is 
ambiguous. Use a.any() or a.all()
   --
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48190][PYTHON][PS][TESTS] Introduce a helper function to drop metadata [spark]

2024-05-08 Thread via GitHub



zhengruifeng opened a new pull request, #46466:
URL: https://github.com/apache/spark/pull/46466

   ### What changes were proposed in this pull request?
   Introduce a helper function to drop metadata
   
   
   ### Why are the changes needed?
   existing helper function `remove_metadata` in PS doesn't support nested 
types, so cannot be reused in other places
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, test only
   
   
   ### How was this patch tested?
   ci
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48185][SQL] Fix 'symbolic reference class is not accessible: class sun.util.calendar.ZoneInfo' [spark]

2024-05-08 Thread via GitHub



yaooqinn commented on code in PR #46457:
URL: https://github.com/apache/spark/pull/46457#discussion_r1593525386


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##
@@ -197,8 +197,8 @@ trait SparkDateTimeUtils {
 rebaseJulianToGregorianDays(julianDays)
   }
 
-  private val zoneInfoClassName = "sun.util.calendar.ZoneInfo"
-  private val getOffsetsByWallHandle = {
+  private lazy val zoneInfoClassName = "sun.util.calendar.ZoneInfo"

Review Comment:
   ```suggestion
 private val zoneInfoClassName = "sun.util.calendar.ZoneInfo"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48006][SQL]add SortOrder for window function which has no orde… [spark]

2024-05-08 Thread via GitHub



guixiaowen commented on PR #46243:
URL: https://github.com/apache/spark/pull/46243#issuecomment-2099929746

   @yaooqinn hi, Can you help me review this PR？


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR][INFRA] Rename NON-ANSI tp Non-ANSI, and Public Snapshot to Public snapshot [spark]

2024-05-08 Thread via GitHub



HyukjinKwon opened a new pull request, #46467:
URL: https://github.com/apache/spark/pull/46467

   ### What changes were proposed in this pull request?
   
   Minor renaming to match with other names.
   
   ### Why are the changes needed?
   
   For consistency.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   N/A
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]

2024-05-08 Thread via GitHub



cloud-fan commented on code in PR #45537:
URL: https://github.com/apache/spark/pull/45537#discussion_r1593536803


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -500,7 +501,11 @@ object JdbcUtils extends Logging with SQLConfHelper {
 
 case TimestampType =>
   (rs: ResultSet, row: InternalRow, pos: Int) =>
-val t = rs.getTimestamp(pos + 1)
+val t = dialect.getDatabaseCalendar match {
+  case Some(cal) => rs.getTimestamp(pos + 1, cal)

Review Comment:
   shall we always call `get/setTimestamp` with a calendar instance of the 
session local timezone? We can add a config to restore the old behavior, 
instead of adding a new API in Dialect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41794][SQL] Add `try_remainder` function and re-enable column tests [spark]

2024-05-08 Thread via GitHub



grundprinzip commented on PR #46434:
URL: https://github.com/apache/spark/pull/46434#issuecomment-2099962569

   yes, done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47018][BUILD][SQL][HIVE] Bump built-in Hive to 2.3.10 [spark]

2024-05-08 Thread via GitHub



pan3793 opened a new pull request, #46468:
URL: https://github.com/apache/spark/pull/46468

   
   
   ### What changes were proposed in this pull request?
   
   This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with 
two additional changes:
   
   - due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` 
to `0.16`.
   - remove version management of `commons-lang:2.6`, it comes from Hive 
transitive deps, Hive 2.3.10 drops it in 
https://github.com/apache/hive/pull/4892
   
   This is the first part of https://github.com/apache/spark/pull/45372
   
   ### Why are the changes needed?
   
   
   Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and 
dropping vulnerable dependencies like Jackson 1.x / Jodd
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass GA. (wait for @sunchao to complete the 2.3.10 release to make jars 
visible on Maven Central)
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-MERGE] Test in different versions [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46417:
URL: https://github.com/apache/spark/pull/46417#issuecomment-205116

   ```
   ==
   ERROR [2.562s]: test_apply_infer_schema_without_shortcut 
(pyspark.pandas.tests.connect.groupby.test_parity_apply_func.GroupbyParityApplyFuncTests)
   --
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/groupby/test_apply_func.py",
 line 240, in test_apply_infer_schema_without_shortcut
   self.assert_eq(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
525, in assert_eq
   return assertPandasOnSparkEqual(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
457, in assertPandasOnSparkEqual
   actual = actual.to_pandas()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/frame.py", 
line 5428, in to_pandas
   return self._to_pandas()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/frame.py", 
line 5434, in _to_pandas
   return self._internal.to_pandas_frame.copy()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/utils.py", 
line 600, in wrapped_lazy_property
   setattr(self, attr_name, fn(self))
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/internal.py", line 
1115, in to_pandas_frame
   pdf = sdf.toPandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
line 1663, in toPandas
   return self._session.client.to_pandas(query)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 873, in to_pandas
   table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
   for response in self._execute_and_fetch_as_iterator(req):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1264, in _execute_and_fetch_as_iterator
   self._handle_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1503, in _handle_error
   self._handle_rpc_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1539, in _handle_rpc_error
   raise convert_exception(info, status.message) from None
   pyspark.errors.exceptions.connect.PythonException: 
 An exception was thrown from the Python worker. Please see the stack trace 
below.
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1834, in main
   process()
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1826, in process
   serializer.dump_stream(out_iter, outfile)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 531, in dump_stream
   return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 104, in dump_stream
   for batch in iterator:
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 524, in init_stream_yield_batches
   for series in iterator:
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1610, in mapper
   return f(keys, vals)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
488, in 
   return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
478, in wrapped
   result = f(pd.concat(value_series, axis=1))
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 
134, in wrapper
   return f(*args, **kwargs)
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/groupby.py", 
line 2307, in rename_output
   pdf.columns = return_schema.names
   for series in iterator:
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1610, in mapper
   return f(keys, vals)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
488, in 
   return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
478, in wrapped
   result = f(pd.concat(value_series, axis=1))
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 
134, in wrapper
   return f(*args, **kwargs)
 File "/

[PR] [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository [spark]

2024-05-08 Thread via GitHub



HyukjinKwon opened a new pull request, #46470:
URL: https://github.com/apache/spark/pull/46470

   ### What changes were proposed in this pull request?
   
   This PR is a sort of a followup of 
https://github.com/apache/spark/pull/46361. It proposes to run TPC-DS and 
Docker integration tests in PRs (that does not consume ASF resources).
   
   ### Why are the changes needed?
   
   TPC-DS and Docker integration stuff at least have to be tested in the PR if 
the PR touches the codes related to that.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   Manually
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46470:
URL: https://github.com/apache/spark/pull/46470#issuecomment-214838

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48192][INFRA] Enable TPC-DS tests in forked repository [spark]

2024-05-08 Thread via GitHub



HyukjinKwon closed pull request #46470: [SPARK-48192][INFRA] Enable TPC-DS 
tests in forked repository
URL: https://github.com/apache/spark/pull/46470


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][INFRA] Rename builds to have consistent names [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46467:
URL: https://github.com/apache/spark/pull/46467#issuecomment-219391

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][INFRA] Rename builds to have consistent names [spark]

2024-05-08 Thread via GitHub



HyukjinKwon closed pull request #46467: [MINOR][INFRA] Rename builds to have 
consistent names
URL: https://github.com/apache/spark/pull/46467


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48193][INFRA] Make `maven-deploy-plugin` retry 3 times [spark]

2024-05-08 Thread via GitHub



panbingkun opened a new pull request, #46471:
URL: https://github.com/apache/spark/pull/46471

   ### What changes were proposed in this pull request?
   The pr aims to make maven plugin `maven-deploy-plugin` retry `3` times.
   
   
   ### Why are the changes needed?
   I found that our `the daily scheduled publish snapshot` workflow of GA often 
failed. 
   https://github.com/apache/spark/actions/workflows/publish_snapshot.yml
   https://github.com/apache/spark/assets/15246973/2a759bf4-85de-4bc2-aff7-a226bd475321";>
   I tried to make it as successful as possible by changing the time of retries 
from `1`(default) to `3`.
   
https://maven.apache.org/plugins/maven-deploy-plugin/deploy-mojo.html#retryFailedDeploymentCount
   
https://maven.apache.org/plugins/maven-deploy-plugin/examples/deploy-network-issues.html#configuring-multiple-tries
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Keep observing `the daily scheduled publish snapshot` workflow of GA.
   https://github.com/apache/spark/actions/workflows/publish_snapshot.yml
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48193][INFRA] Make `maven-deploy-plugin` retry 3 times [spark]

2024-05-08 Thread via GitHub



panbingkun commented on PR #46471:
URL: https://github.com/apache/spark/pull/46471#issuecomment-2100018222

   cc @HyukjinKwon @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48185][SQL] Fix 'symbolic reference class is not accessible: class sun.util.calendar.ZoneInfo' [spark]

2024-05-08 Thread via GitHub



LuciferYang commented on code in PR #46457:
URL: https://github.com/apache/spark/pull/46457#discussion_r1593672542


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##
@@ -197,8 +197,8 @@ trait SparkDateTimeUtils {
 rebaseJulianToGregorianDays(julianDays)
   }
 
-  private val zoneInfoClassName = "sun.util.calendar.ZoneInfo"
-  private val getOffsetsByWallHandle = {
+  private lazy val zoneInfoClassName = "sun.util.calendar.ZoneInfo"

Review Comment:
   Not sure, but: 
   1. Using `TimeZone.getDefault.getOffset(localMillis)`, all tests can pass, 
perhaps the unit test coverage is not comprehensive, but what would be the 
counterexample?
   2. Looking at the inverse function `fromJavaDate` of `toJavaDate`, it does 
not take into account the scenario of `ZoneInfo`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48171][CORE] Clean up the use of deprecated constructors of `o.rocksdb.Logger` [spark]

2024-05-08 Thread via GitHub



LuciferYang commented on PR #46436:
URL: https://github.com/apache/spark/pull/46436#issuecomment-2100095240

   Thanks @dongjoon-hyun ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47018][BUILD][SQL][HIVE] Bump built-in Hive to 2.3.10 [spark]

2024-05-08 Thread via GitHub



pan3793 commented on PR #46468:
URL: https://github.com/apache/spark/pull/46468#issuecomment-2100110174

   @sunchao please ping me after you finish the jar release :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-08 Thread via GitHub



uros-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1593716451


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -117,76 +119,445 @@ public Collation(
 }
 
 /**
- * Constructor with comparators that are inherited from the given collator.
+ * collation id (32-bit integer) layout:
+ * bit 31:0 = predefined collation, 1 = user-defined collation
+ * bit 30-29: 00 = utf8-binary, 01 = ICU, 10 = indeterminate (without spec 
implementation)

Review Comment:
   should we consider using bit 31 for indeterminate collation?
   
   then, indeterminate collation could conveniently be -1 / all 1s in binary



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-08 Thread via GitHub



uros-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1593724507


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -117,76 +119,445 @@ public Collation(
 }
 
 /**
- * Constructor with comparators that are inherited from the given collator.
+ * collation id (32-bit integer) layout:
+ * bit 31:0 = predefined collation, 1 = user-defined collation
+ * bit 30-29: 00 = utf8-binary, 01 = ICU, 10 = indeterminate (without spec 
implementation)
+ * bit 28:0 for utf8-binary / 0 = case-sensitive, 1 = case-insensitive 
for ICU
+ * bit 27:0 for utf8-binary / 0 = accent-sensitive, 1 = 
accent-insensitive for ICU
+ * bit 26-25: zeroes, reserved for punctuation sensitivity
+ * bit 24-23: zeroes, reserved for first letter preference
+ * bit 22-21: 00 = unspecified, 01 = to-lower, 10 = to-upper
+ * bit 20-19: zeroes, reserved for space trimming
+ * bit 18-17: zeroes, reserved for version
+ * bit 16-12: zeroes
+ * bit 11-0:  zeroes for utf8-binary / locale id for ICU
  */
-public Collation(
-String collationName,
-Collator collator,
-String version,
-boolean supportsBinaryEquality,
-boolean supportsBinaryOrdering,
-boolean supportsLowercaseEquality) {
-  this(
-collationName,
-collator,
-(s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-version,
-s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-supportsBinaryEquality,
-supportsBinaryOrdering,
-supportsLowercaseEquality);
+private abstract static class CollationSpec {
+  protected enum ImplementationProvider {
+UTF8_BINARY, ICU, INDETERMINATE
+  }
+
+  protected enum CaseSensitivity {
+CS, CI
+  }
+
+  protected enum AccentSensitivity {
+AS, AI
+  }
+
+  protected enum CaseConversion {
+UNSPECIFIED, LCASE, UCASE
+  }
+
+  protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+  protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b11;
+  protected static final int CASE_SENSITIVITY_OFFSET = 28;
+  protected static final int CASE_SENSITIVITY_MASK = 0b1;
+  protected static final int ACCENT_SENSITIVITY_OFFSET = 27;
+  protected static final int ACCENT_SENSITIVITY_MASK = 0b1;
+  protected static final int CASE_CONVERSION_OFFSET = 21;
+  protected static final int CASE_CONVERSION_MASK = 0b11;
+  protected static final int LOCALE_OFFSET = 0;
+  protected static final int LOCALE_MASK = 0x0FFF;
+
+  protected static final int INDETERMINATE_COLLATION_ID =
+ImplementationProvider.INDETERMINATE.ordinal() << 
IMPLEMENTATION_PROVIDER_OFFSET;
+
+  protected final CaseSensitivity caseSensitivity;
+  protected final AccentSensitivity accentSensitivity;
+  protected final CaseConversion caseConversion;
+  protected final String locale;
+  protected final int collationId;
+
+  protected CollationSpec(
+  String locale,
+  CaseSensitivity caseSensitivity,
+  AccentSensitivity accentSensitivity,
+  CaseConversion caseConversion) {
+this.locale = locale;
+this.caseSensitivity = caseSensitivity;
+this.accentSensitivity = accentSensitivity;
+this.caseConversion = caseConversion;
+this.collationId = getCollationId();
+  }
+
+  private static final Map collationMap = new 
ConcurrentHashMap<>();
+
+  public static Collation fetchCollation(int collationId) throws 
SparkException {
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+} else if (collationMap.containsKey(collationId)) {
+  return collationMap.get(collationId);
+} else {
+  CollationSpec spec;
+  int implementationProviderOrdinal =
+(collationId >> IMPLEMENTATION_PROVIDER_OFFSET) & 
IMPLEMENTATION_PROVIDER_MASK;
+  if (implementationProviderOrdinal >= 
ImplementationProvider.values().length) {
+throw SparkException.internalError("Invalid collation 
implementation provider");
+  } else {
+ImplementationProvider implementationProvider = 
ImplementationProvider.values()[
+  implementationProviderOrdinal];
+if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+  spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+} else if (implementationProvider == ImplementationProvider.ICU) {
+  spec = CollationSpecICU.fromCollationId(collationId);
+} else {
+  throw SparkException.internalError("Cannot instantiate 
indeterminate collation");
+}
+Collation collation = spec.buildCollation();
+c

Re: [PR] [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE [spark]

2024-05-08 Thread via GitHub



stefankandic commented on PR #46280:
URL: https://github.com/apache/spark/pull/46280#issuecomment-2100170222

   @cloud-fan please take a look when you have the time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-MERGE] Test in different versions [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46417:
URL: https://github.com/apache/spark/pull/46417#issuecomment-2100178195

   ```
   ==
   ERROR [2.615s]: test_map 
(pyspark.pandas.tests.connect.indexes.test_parity_category.CategoricalIndexParityTests)
   --
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/indexes/test_category.py",
 line 408, in test_map
   self.assert_eq(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
536, in assert_eq
   robj = self._to_pandas(right)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
557, in _to_pandas
   return obj.to_pandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/indexes/base.py", line 
524, in to_pandas
   return self._to_pandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/indexes/base.py", line 
530, in _to_pandas
   return self._to_internal_pandas().copy()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/indexes/base.py", line 
503, in _to_internal_pandas
   return self._psdf._internal.to_pandas_frame.index
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/utils.py", 
line 600, in wrapped_lazy_property
   setattr(self, attr_name, fn(self))
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/internal.py", line 
1115, in to_pandas_frame
   pdf = sdf.toPandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
line 1663, in toPandas
   return self._session.client.to_pandas(query)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 873, in to_pandas
   table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
   for response in self._execute_and_fetch_as_iterator(req):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1264, in _execute_and_fetch_as_iterator
   self._handle_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1503, in _handle_error
   self._handle_rpc_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1539, in _handle_rpc_error
   raise convert_exception(info, status.message) from None
   pyspark.errors.exceptions.connect.PythonException: 
 An exception was thrown from the Python worker. Please see the stack trace 
below.
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1818, in main
   func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
eval_type)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1728, in read_udfs
   read_single_udf(
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
783, in read_single_udf
   f, return_type = read_command(pickleSer, infile)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker_util.py", 
line 64, in read_command
   command = serializer._read_with_length(file)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 173, in _read_with_length
   return self.loads(obj)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 473, in loads
   return cloudpickle.loads(obj, encoding=encoding)
   ModuleNotFoundError: No module named 'pandas.core.indexes.numeric'
   --
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [spark]

2024-05-08 Thread via GitHub



uros-db commented on PR #46206:
URL: https://github.com/apache/spark/pull/46206#issuecomment-2100191098

   @cloud-fan some tests are stale here, but overall I think the changes in 
this PR are good
   however, changes are indeed large and merge conflicts arise daily - could 
you please review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48161][SQL] Add collation support for JSON expressions [spark]

2024-05-08 Thread via GitHub



uros-db commented on PR #46462:
URL: https://github.com/apache/spark/pull/46462#issuecomment-2100210489

   @cloud-fan all checks good here, ready for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect [spark]

2024-05-08 Thread via GitHub



milastdbx commented on code in PR #45537:
URL: https://github.com/apache/spark/pull/45537#discussion_r1593776643


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -500,7 +501,11 @@ object JdbcUtils extends Logging with SQLConfHelper {
 
 case TimestampType =>
   (rs: ResultSet, row: InternalRow, pos: Int) =>
-val t = rs.getTimestamp(pos + 1)
+val t = dialect.getDatabaseCalendar match {
+  case Some(cal) => rs.getTimestamp(pos + 1, cal)

Review Comment:
   Unforntunatelly it really depends on driver implementation whether they 
handle difference between TZ and NTZ in their getTimestamp/setTimestamp codepath



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48185][SQL] Fix 'symbolic reference class is not accessible: class sun.util.calendar.ZoneInfo' [spark]

2024-05-08 Thread via GitHub



yaooqinn closed pull request #46457: [SPARK-48185][SQL] Fix 'symbolic reference 
class is not accessible: class sun.util.calendar.ZoneInfo'
URL: https://github.com/apache/spark/pull/46457


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47965][SQL][FOLLOW-UP] Uses `null` as its default value for `OptionalConfigEntry` [spark]

2024-05-08 Thread via GitHub



HyukjinKwon opened a new pull request, #46472:
URL: https://github.com/apache/spark/pull/46472

   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   ### How was this patch tested?
   
   Manually:
   
   ```
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-08 Thread via GitHub



yaooqinn commented on PR #46425:
URL: https://github.com/apache/spark/pull/46425#issuecomment-2100236987

   It looks like the failure just downshifted to the next one
   
   ```
   [info] - SPARK-43923: commands send events ((extension {
   [info]   type_url: "type.googleapis.com/spark.connect.ExamplePluginCommand"
   [info]   value: "\n\vSPARK-43923"
   [info] }
   [info] ,None)) *** FAILED *** (23 milliseconds)
   [info]   VerifyEvents.this.listener.executeHolder.isDefined was false 
(SparkConnectServiceSuite.scala:876)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-MERGE] Test in different versions [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46417:
URL: https://github.com/apache/spark/pull/46417#issuecomment-2100237805

   ```
   ==
   ERROR [9.997s]: test_map 
(pyspark.pandas.tests.connect.indexes.test_parity_base.IndexesParityTests)
   --
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/indexes/test_base.py",
 line 2642, in test_map
   self.assert_eq(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
536, in assert_eq
   robj = self._to_pandas(right)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
557, in _to_pandas
   return obj.to_pandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/indexes/base.py", line 
524, in to_pandas
   return self._to_pandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/indexes/base.py", line 
530, in _to_pandas
   return self._to_internal_pandas().copy()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/indexes/base.py", line 
503, in _to_internal_pandas
   return self._psdf._internal.to_pandas_frame.index
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/utils.py", 
line 600, in wrapped_lazy_property
   setattr(self, attr_name, fn(self))
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/internal.py", line 
1115, in to_pandas_frame
   pdf = sdf.toPandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
line 1663, in toPandas
   return self._session.client.to_pandas(query)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 873, in to_pandas
   table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
   for response in self._execute_and_fetch_as_iterator(req):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1264, in _execute_and_fetch_as_iterator
   self._handle_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1503, in _handle_error
   self._handle_rpc_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1539, in _handle_rpc_error
   raise convert_exception(info, status.message) from None
   pyspark.errors.exceptions.connect.PythonException:
 An exception was thrown from the Python worker. Please see the stack trace 
below.
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1818, in main
   func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
eval_type)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1728, in read_udfs
   read_single_udf(
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
783, in read_single_udf
   f, return_type = read_command(pickleSer, infile)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker_util.py", 
line 64, in read_command
   command = serializer._read_with_length(file)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 173, in _read_with_length
   return self.loads(obj)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
line 473, in loads
   return cloudpickle.loads(obj, encoding=encoding)
   ModuleNotFoundError: No module named 'pandas.core.indexes.numeric'
   
   
   -
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48193][INFRA] Make `maven-deploy-plugin` retry 3 times [spark]

2024-05-08 Thread via GitHub



HyukjinKwon closed pull request #46471: [SPARK-48193][INFRA] Make 
`maven-deploy-plugin` retry 3 times
URL: https://github.com/apache/spark/pull/46471


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48193][INFRA] Make `maven-deploy-plugin` retry 3 times [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46471:
URL: https://github.com/apache/spark/pull/46471#issuecomment-2100240322

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47965][SQL][FOLLOW-UP] Uses `null` as its default value for `OptionalConfigEntry` [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46472:
URL: https://github.com/apache/spark/pull/46472#issuecomment-2100246019

   I am going to merge this - it's a clean revert.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47965][SQL][FOLLOW-UP] Uses `null` as its default value for `OptionalConfigEntry` [spark]

2024-05-08 Thread via GitHub



HyukjinKwon closed pull request #46472: [SPARK-47965][SQL][FOLLOW-UP] Uses 
`null` as its default value for `OptionalConfigEntry`
URL: https://github.com/apache/spark/pull/46472


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48087][PYTHON][TESTS][3.5] Remove obsolete comment about UDTF test [spark]

2024-05-08 Thread via GitHub



HyukjinKwon opened a new pull request, #46473:
URL: https://github.com/apache/spark/pull/46473

   ### What changes were proposed in this pull request?
   
   UDTF is experimental, and its design has not been fully implemented yet. Not 
documented either so it's left as a breaking change.
   
   ### Why are the changes needed?
   
   To remove the obsolete comment.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   N/A
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48087][PYTHON][TESTS][3.5] Remove obsolete comment about UDTF test [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46473:
URL: https://github.com/apache/spark/pull/46473#issuecomment-2100259238

   cc @allisonwang-db and @dtenedor 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46425:
URL: https://github.com/apache/spark/pull/46425#issuecomment-2100262105

   😢 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46425:
URL: https://github.com/apache/spark/pull/46425#issuecomment-2100263774

   Let me just revert it out for now  .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48190][PYTHON][PS][TESTS] Introduce a helper function to drop metadata [spark]

2024-05-08 Thread via GitHub



zhengruifeng closed pull request #46466: [SPARK-48190][PYTHON][PS][TESTS] 
Introduce a helper function to drop metadata
URL: https://github.com/apache/spark/pull/46466


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48190][PYTHON][PS][TESTS] Introduce a helper function to drop metadata [spark]

2024-05-08 Thread via GitHub



zhengruifeng commented on PR #46466:
URL: https://github.com/apache/spark/pull/46466#issuecomment-2100299337

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-08 Thread via GitHub



panbingkun commented on code in PR #46450:
URL: https://github.com/apache/spark/pull/46450#discussion_r1593836057


##
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIService.scala:
##
@@ -33,8 +33,8 @@ import org.apache.hive.service.Service.STATE
 import org.apache.hive.service.auth.HiveAuthFactory
 import org.apache.hive.service.cli._
 import org.apache.hive.service.server.HiveServer2
-import org.slf4j.Logger
 
+import org.apache.spark.internal.Logger

Review Comment:
   Used in this place:
   
https://github.com/apache/spark/blob/bd896cac168aa5793413058ca706c73705edbf96/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIService.scala#L116-L119
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-MERGE] Test in different versions [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46417:
URL: https://github.com/apache/spark/pull/46417#issuecomment-2100307118

   ```
   ==
   ERROR [2.158s]: test_between_time 
(pyspark.pandas.tests.connect.series.test_parity_compute.SeriesParityComputeTests)
   --
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/series/test_compute.py",
 line 578, in test_between_time
   self.assert_eq(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
536, in assert_eq
   robj = self._to_pandas(right)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
557, in _to_pandas
   return obj.to_pandas()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/series.py", 
line 1718, in to_pandas
   return self._to_pandas()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/series.py", 
line 1724, in _to_pandas
   return self._to_internal_pandas().copy()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/series.py", 
line 7333, in _to_internal_pandas
   return self._psdf._internal.to_pandas_frame[self.name]
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/utils.py", 
line 600, in wrapped_lazy_property
   setattr(self, attr_name, fn(self))
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/internal.py", line 
1115, in to_pandas_frame
   pdf = sdf.toPandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
line 1663, in toPandas
   return self._session.client.to_pandas(query)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 873, in to_pandas
   table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
   for response in self._execute_and_fetch_as_iterator(req):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1264, in _execute_and_fetch_as_iterator
   self._handle_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1503, in _handle_error
   self._handle_rpc_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1539, in _handle_rpc_error
   raise convert_exception(info, status.message) from None
   pyspark.errors.exceptions.connect.PythonException: 
 An exception was thrown from the Python worker. Please see the stack trace 
below.
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1834, in main
   process()
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1826, in process
   serializer.dump_stream(out_iter, outfile)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 531, in dump_stream
   return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 104, in dump_stream
   for batch in iterator:
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 524, in init_stream_yield_batches
   for series in iterator:
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1529, in func
   for result_batch, result_type in result_iter:
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/groupby.py", 
line 2295, in rename_output
   pdf = func(pdf)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/accessors.py", line 
350, in new_func
   return original_func(o, *args, **kwds)
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/frame.py", 
line 3612, in pandas_between_time
   return pdf.between_time(start_time, end_time, include_start, 
include_end).reset_index()
 File 
"/usr/share/miniconda/envs/server-env/lib/python3.10/site-packages/pandas/core/generic.py",
 line 9371, in between_time
   raise TypeError("Index must be DatetimeIndex")
   TypeError: Index must be DatetimeIndex
   ==
   ERROR [0.249s]: test_factorize 
(pyspark.pandas.tests.connect.series.test_parity_compute.SeriesParityComputeTests)
   --
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/series/test_compute.py",
 line 426, in test_factor

Re: [PR] [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-08 Thread via GitHub



panbingkun commented on code in PR #46450:
URL: https://github.com/apache/spark/pull/46450#discussion_r1593840887


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/TSetIpAddressProcessor.java:
##
@@ -38,7 +39,7 @@
  */
 public class TSetIpAddressProcessor extends 
TCLIService.Processor {
 
-  private static final Logger LOGGER = 
LoggerFactory.getLogger(TSetIpAddressProcessor.class.getName());
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(TSetIpAddressProcessor.class);

Review Comment:
   
https://github.com/qos-ch/slf4j/blob/1d82519b07d7d755096d916dfbfaddd2d0e07c4f/slf4j-api/src/main/java/org/slf4j/LoggerFactory.java#L446-L456
   https://github.com/apache/spark/assets/15246973/c86c7bdd-babb-44c1-ae31-c4ce5fbf8fbd";>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-08 Thread via GitHub



panbingkun commented on code in PR #46450:
URL: https://github.com/apache/spark/pull/46450#discussion_r1593843557


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/ServiceOperations.java:
##
@@ -129,9 +132,8 @@ public static Exception stopQuietly(Service service) {
 try {
   stop(service);
 } catch (Exception e) {
-  LOG.warn("When stopping the service " + service.getName()
-   + " : " + e,

Review Comment:
   I think `" : " + e` is redundant because there is `, e(Throwable)` behind



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48087][PYTHON][TESTS][3.5] Remove obsolete comment about UDTF test [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46473:
URL: https://github.com/apache/spark/pull/46473#issuecomment-2100317791

   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48087][PYTHON][TESTS][3.5] Remove obsolete comment about UDTF test [spark]

2024-05-08 Thread via GitHub



HyukjinKwon closed pull request #46473: [SPARK-48087][PYTHON][TESTS][3.5] 
Remove obsolete comment about UDTF test
URL: https://github.com/apache/spark/pull/46473


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [DO-NOT-MERGE] Test in different versions [spark]

2024-05-08 Thread via GitHub



HyukjinKwon commented on PR #46417:
URL: https://github.com/apache/spark/pull/46417#issuecomment-2100355577

   ```
   ==
   ERROR [2.060s]: test_frame_apply_batch_without_shortcut 
(pyspark.pandas.tests.connect.test_parity_categorical.CategoricalParityTests)
   --
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/test_categorical.py",
 line 476, in test_frame_apply_batch_without_shortcut
   self.test_frame_apply_batch()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/test_categorical.py",
 line 457, in test_frame_apply_batch
   self.assert_eq(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
525, in assert_eq
   return assertPandasOnSparkEqual(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
457, in assertPandasOnSparkEqual
   actual = actual.to_pandas()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/frame.py", 
line 5428, in to_pandas
   return self._to_pandas()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/frame.py", 
line 5434, in _to_pandas
   return self._internal.to_pandas_frame.copy()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/utils.py", 
line 600, in wrapped_lazy_property
   setattr(self, attr_name, fn(self))
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/internal.py", line 
1115, in to_pandas_frame
   pdf = sdf.toPandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
line 1663, in toPandas
   return self._session.client.to_pandas(query)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 873, in to_pandas
   table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execute_and_fetch
   for response in self._execute_and_fetch_as_iterator(req):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1264, in _execute_and_fetch_as_iterator
   self._handle_error(error)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1503, in _handle_error
   self._handle_rpc_error(error)
   return cloudpickle.loads(obj, encoding=encoding)
   ModuleNotFoundError: No module named 'pandas.core.indexes.numeric'
   ==
   ERROR [1.477s]: test_series_transform_batch_without_shortcut 
(pyspark.pandas.tests.connect.test_parity_categorical.CategoricalParityTests)
   --
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/test_categorical.py",
 line 
[602](https://github.com/HyukjinKwon/spark/actions/runs/8997403746/job/24725108255#step:9:603),
 in test_series_transform_batch_without_shortcut
   self.test_series_transform_batch()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/tests/test_categorical.py",
 line 583, in test_series_transform_batch
   self.assert_eq(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
525, in assert_eq
   return assertPandasOnSparkEqual(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/testing/pandasutils.py", line 
457, in assertPandasOnSparkEqual
   actual = actual.to_pandas()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/series.py", 
line 1718, in to_pandas
   return self._to_pandas()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/series.py", 
line 1724, in _to_pandas
   return self._to_internal_pandas().copy()
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/series.py", 
line 7333, in _to_internal_pandas
   return self._psdf._internal.to_pandas_frame[self.name]
 File "/home/runner/work/spark/spark-3.5/python/pyspark/pandas/utils.py", 
line 600, in wrapped_lazy_property
   setattr(self, attr_name, fn(self))
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/pandas/internal.py", line 
1115, in to_pandas_frame
   pdf = sdf.toPandas()
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
line 1663, in toPandas
   return self._session.client.to_pandas(query)
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 873, in to_pandas
   table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py", 
line 1283, in _execut

Re: [PR] [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-08 Thread via GitHub



panbingkun commented on code in PR #46450:
URL: https://github.com/apache/spark/pull/46450#discussion_r1593871899


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/SessionManager.java:
##
@@ -84,13 +87,15 @@ public synchronized void init(HiveConf hiveConf) {
 
   private void createBackgroundOperationPool() {
 int poolSize = 
hiveConf.getIntVar(ConfVars.HIVE_SERVER2_ASYNC_EXEC_THREADS);
-LOG.info("HiveServer2: Background operation thread pool size: " + 
poolSize);
+LOG.info("HiveServer2: Background operation thread pool size: {}",
+  MDC.of(LogKeys.THREAD_POOL_SIZE$.MODULE$, poolSize));
 int poolQueueSize = 
hiveConf.getIntVar(ConfVars.HIVE_SERVER2_ASYNC_EXEC_WAIT_QUEUE_SIZE);
-LOG.info("HiveServer2: Background operation thread wait queue size: " + 
poolQueueSize);
+LOG.info("HiveServer2: Background operation thread wait queue size: {}",
+  MDC.of(LogKeys.THREAD_WAIT_QUEUE_SIZE$.MODULE$, poolQueueSize));
 long keepAliveTime = HiveConf.getTimeVar(
 hiveConf, ConfVars.HIVE_SERVER2_ASYNC_EXEC_KEEPALIVE_TIME, 
TimeUnit.SECONDS);
-LOG.info(
-"HiveServer2: Background operation thread keepalive time: " + 
keepAliveTime + " seconds");
+LOG.info("HiveServer2: Background operation thread keepalive time: {} ms",
+  MDC.of(LogKeys.THREAD_KEEP_ALIVE_TIME$.MODULE$, keepAliveTime * 1000));

Review Comment:
   Let's turn our `time unit` into `ms`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions [spark]

2024-05-08 Thread via GitHub



zhouyifan279 commented on PR #41628:
URL: https://github.com/apache/spark/pull/41628#issuecomment-2100365610

   To eliminate data inconsistency issue, we should handle custom partitions in 
`HadoopMapReduceCommitProtocol.commitJob`  instead of writing to the final 
output path then moving partition dir to custom location:
   1.  Get all partitionPaths from 
`TaskCommitMessage.obj._2`(`TaskCommitMessage.obj._1` is empty as we do not 
have `customPartitionLocations` at this step)
   2. Use partitionPaths to get matchingPartitions, then get 
customPartitionLocations like what we do in this PR.
   3. Move partitionPaths to final location according to 
customPartitionLocations
   
   @jeanlyn @bowenliang123 @attilapiros what do you think ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-08 Thread via GitHub



panbingkun commented on code in PR #46450:
URL: https://github.com/apache/spark/pull/46450#discussion_r1593875379


##
common/utils/src/main/java/org/apache/spark/internal/Logger.java:
##
@@ -193,4 +193,8 @@ static MessageThrowable of(String message, Throwable 
throwable) {
   return new MessageThrowable(message, throwable);
 }
   }
+

Review Comment:
   The following code will use it:
   https://github.com/apache/spark/assets/15246973/095e1c1b-eea8-4082-9a68-58127a2e";>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-08 Thread via GitHub



panbingkun commented on code in PR #46450:
URL: https://github.com/apache/spark/pull/46450#discussion_r1593883576


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/ServiceUtils.java:
##
@@ -18,7 +18,7 @@
 
 import java.io.IOException;
 
-import org.slf4j.Logger;
+import org.apache.spark.internal.Logger;

Review Comment:
   The following code use it
   I searched the `spark code base`. It seems that the `public static void 
cleanup(Logger log, java.io.Closeable... closeables)` method is not used. Maybe 
we can delete the following code?
   
https://github.com/apache/spark/blob/d7f69e7003a3c7e7ad22a39e6aaacd183d26d326/sql/hive-thriftserver/src/main/java/org/apache/hive/service/ServiceUtils.java#L55-L67



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-08 Thread via GitHub



panbingkun commented on PR #46450:
URL: https://github.com/apache/spark/pull/46450#issuecomment-2100380424

   cc @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-08 Thread via GitHub



nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1593900492


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -117,76 +119,445 @@ public Collation(
 }
 
 /**
- * Constructor with comparators that are inherited from the given collator.
+ * collation id (32-bit integer) layout:
+ * bit 31:0 = predefined collation, 1 = user-defined collation
+ * bit 30-29: 00 = utf8-binary, 01 = ICU, 10 = indeterminate (without spec 
implementation)
+ * bit 28:0 for utf8-binary / 0 = case-sensitive, 1 = case-insensitive 
for ICU
+ * bit 27:0 for utf8-binary / 0 = accent-sensitive, 1 = 
accent-insensitive for ICU
+ * bit 26-25: zeroes, reserved for punctuation sensitivity
+ * bit 24-23: zeroes, reserved for first letter preference
+ * bit 22-21: 00 = unspecified, 01 = to-lower, 10 = to-upper
+ * bit 20-19: zeroes, reserved for space trimming
+ * bit 18-17: zeroes, reserved for version
+ * bit 16-12: zeroes
+ * bit 11-0:  zeroes for utf8-binary / locale id for ICU
  */
-public Collation(
-String collationName,
-Collator collator,
-String version,
-boolean supportsBinaryEquality,
-boolean supportsBinaryOrdering,
-boolean supportsLowercaseEquality) {
-  this(
-collationName,
-collator,
-(s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-version,
-s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-supportsBinaryEquality,
-supportsBinaryOrdering,
-supportsLowercaseEquality);
+private abstract static class CollationSpec {
+  protected enum ImplementationProvider {
+UTF8_BINARY, ICU, INDETERMINATE
+  }
+
+  protected enum CaseSensitivity {
+CS, CI
+  }
+
+  protected enum AccentSensitivity {
+AS, AI
+  }
+
+  protected enum CaseConversion {
+UNSPECIFIED, LCASE, UCASE
+  }
+
+  protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+  protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b11;
+  protected static final int CASE_SENSITIVITY_OFFSET = 28;
+  protected static final int CASE_SENSITIVITY_MASK = 0b1;
+  protected static final int ACCENT_SENSITIVITY_OFFSET = 27;
+  protected static final int ACCENT_SENSITIVITY_MASK = 0b1;
+  protected static final int CASE_CONVERSION_OFFSET = 21;
+  protected static final int CASE_CONVERSION_MASK = 0b11;
+  protected static final int LOCALE_OFFSET = 0;
+  protected static final int LOCALE_MASK = 0x0FFF;
+
+  protected static final int INDETERMINATE_COLLATION_ID =
+ImplementationProvider.INDETERMINATE.ordinal() << 
IMPLEMENTATION_PROVIDER_OFFSET;
+
+  protected final CaseSensitivity caseSensitivity;
+  protected final AccentSensitivity accentSensitivity;
+  protected final CaseConversion caseConversion;
+  protected final String locale;
+  protected final int collationId;
+
+  protected CollationSpec(
+  String locale,
+  CaseSensitivity caseSensitivity,
+  AccentSensitivity accentSensitivity,
+  CaseConversion caseConversion) {
+this.locale = locale;
+this.caseSensitivity = caseSensitivity;
+this.accentSensitivity = accentSensitivity;
+this.caseConversion = caseConversion;
+this.collationId = getCollationId();
+  }
+
+  private static final Map collationMap = new 
ConcurrentHashMap<>();
+
+  public static Collation fetchCollation(int collationId) throws 
SparkException {
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+} else if (collationMap.containsKey(collationId)) {
+  return collationMap.get(collationId);
+} else {
+  CollationSpec spec;
+  int implementationProviderOrdinal =
+(collationId >> IMPLEMENTATION_PROVIDER_OFFSET) & 
IMPLEMENTATION_PROVIDER_MASK;
+  if (implementationProviderOrdinal >= 
ImplementationProvider.values().length) {
+throw SparkException.internalError("Invalid collation 
implementation provider");
+  } else {
+ImplementationProvider implementationProvider = 
ImplementationProvider.values()[
+  implementationProviderOrdinal];
+if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+  spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+} else if (implementationProvider == ImplementationProvider.ICU) {
+  spec = CollationSpecICU.fromCollationId(collationId);
+} else {
+  throw SparkException.internalError("Cannot instantiate 
indeterminate collation");
+}
+Collation collation = spec.buildCollation();
+

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-08 Thread via GitHub



nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1593910153


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -117,76 +119,445 @@ public Collation(
 }
 
 /**
- * Constructor with comparators that are inherited from the given collator.
+ * collation id (32-bit integer) layout:
+ * bit 31:0 = predefined collation, 1 = user-defined collation
+ * bit 30-29: 00 = utf8-binary, 01 = ICU, 10 = indeterminate (without spec 
implementation)

Review Comment:
   My opinion is that it would be better to stick with this naming with 
following reasoning.
   
   If we used bit 31 for indeterminate collation, we would shrink user 
collation space because we use additional bit to distinguish between predefined 
and user collations. It's more convenient to distinguish this at first bit 
since indeterminate collation falls into predefined space.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48191][SQL] Support UTF-32 for string encode and decode [spark]

2024-05-08 Thread via GitHub



yaooqinn closed pull request #46469: [SPARK-48191][SQL] Support UTF-32 for 
string encode and decode
URL: https://github.com/apache/spark/pull/46469


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48191][SQL] Support UTF-32 for string encode and decode [spark]

2024-05-08 Thread via GitHub



yaooqinn commented on PR #46469:
URL: https://github.com/apache/spark/pull/46469#issuecomment-2100433840

   Thank you all. Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48188][SQL] Consistently use normalized plan for cache [spark]

2024-05-08 Thread via GitHub



yaooqinn commented on PR #46465:
URL: https://github.com/apache/spark/pull/46465#issuecomment-2100436615

   Thank you @cloud-fan, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48188][SQL] Consistently use normalized plan for cache [spark]

2024-05-08 Thread via GitHub



yaooqinn closed pull request #46465: [SPARK-48188][SQL] Consistently use 
normalized plan for cache
URL: https://github.com/apache/spark/pull/46465


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions [spark]

2024-05-08 Thread via GitHub



jeanlyn commented on PR #41628:
URL: https://github.com/apache/spark/pull/41628#issuecomment-2100456683

   @zhouyifan279  👍🏻, I think the solution you  mention can solve the data 
inconsistency for custom partitions. But i don't know it's acceptable to 
communicate to metastore in such low level api, and not sure whether other 
problems will arise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions [spark]

2024-05-08 Thread via GitHub



zhouyifan279 commented on PR #41628:
URL: https://github.com/apache/spark/pull/41628#issuecomment-2100506618

   @jeanlyn 
   
   > 2. Use partitionPaths to get matchingPartitions, then get 
customPartitionLocations like what we do in this PR.
   
   We can create a function `patitionPaths: Set[String] => 
Map[TablePartitionSpec, String]` in `InsertIntoHadoopFsRelationCommand` and 
pass it to `HadoopMapReduceCommitProtocol` as a constructor param.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-47972][SQL] Restrict CAST expression for collations [spark]

2024-05-08 Thread via GitHub



mihailom-db opened a new pull request, #46474:
URL: https://github.com/apache/spark/pull/46474

   ### What changes were proposed in this pull request?
   Block of syntax CAST(value AS STRING COLLATE collation_name).
   
   
   ### Why are the changes needed?
   Current state of code allows for calls like CAST(1 AS STRING COLLATE 
UNICODE). We want to restrict CAST expression to only be able to cast to 
default collation string, and to only allow COLLATE expression to produce 
explicitly collated strings.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   
   ### How was this patch tested?
   Test in CollationSuite.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-08 Thread via GitHub



cxzl25 commented on code in PR #46464:
URL: https://github.com/apache/spark/pull/46464#discussion_r1593987337


##
.github/workflows/build_and_test.yml:
##
@@ -644,6 +644,7 @@ jobs:
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 
'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 
'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 
'alabaster==0.7.13'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
+python3.9 -m pip install 'nbsphinx==0.9.3'

Review Comment:
   
https://github.com/cxzl25/spark/actions/runs/8997219681/job/24725778387#step:24:6423
   ```
   Exception occurred:
 File "/usr/local/lib/python3.9/dist-packages/nbsphinx/__init__.py", line 
1316, in apply
   for section in self.document.findall(docutils.nodes.section):
   AttributeError: 'document' object has no attribute 'findall'
   ```
   
   The failed CI uses nbsphinx 0.9.4 version, which requires docutils >= 0.18.1.
   
   https://github.com/spatialaudio/nbsphinx/releases/tag/0.9.4



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48161][SQL] Add collation support for JSON expressions [spark]

2024-05-08 Thread via GitHub



cloud-fan commented on PR #46462:
URL: https://github.com/apache/spark/pull/46462#issuecomment-2100516982

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48161][SQL] Add collation support for JSON expressions [spark]

2024-05-08 Thread via GitHub



cloud-fan closed pull request #46462: [SPARK-48161][SQL] Add collation support 
for JSON expressions
URL: https://github.com/apache/spark/pull/46462


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-08 Thread via GitHub



cxzl25 commented on code in PR #46464:
URL: https://github.com/apache/spark/pull/46464#discussion_r1593987337


##
.github/workflows/build_and_test.yml:
##
@@ -644,6 +644,7 @@ jobs:
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 
'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 
'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 
'alabaster==0.7.13'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
+python3.9 -m pip install 'nbsphinx==0.9.3'

Review Comment:
   
https://github.com/cxzl25/spark/actions/runs/8997219681/job/24725778387#step:24:6423
   ```
   Exception occurred:
 File "/usr/local/lib/python3.9/dist-packages/nbsphinx/__init__.py", line 
1316, in apply
   for section in self.document.findall(docutils.nodes.section):
   AttributeError: 'document' object has no attribute 'findall'
   ```
   
   The failed CI uses nbsphinx 0.9.4 version, which requires docutils >= 0.18.1.
   
   https://github.com/spatialaudio/nbsphinx/releases/tag/0.9.4
   
   Release 
   0.9.4 May 7, 2024
   0.9.3 Aug 27, 2023



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46741][SQL] Cache Table with CTE won't work [spark]

2024-05-08 Thread via GitHub



AngersZh commented on PR #44767:
URL: https://github.com/apache/spark/pull/44767#issuecomment-2100528131

   ping @cloud-fan Changed follow your PR, pls take a look again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48197][SQL] Avoid assert error for invalid lambda function [spark]

2024-05-08 Thread via GitHub



cloud-fan opened a new pull request, #46475:
URL: https://github.com/apache/spark/pull/46475

   
   
   ### What changes were proposed in this pull request?
   
   `ExpressionBuilder` asserts all its input expressions to be resolved during 
lookup, which is not true as the analyzer rule `ResolveFunctions` can trigger 
function lookup even if the input expression contains unresolved lambda 
functions.
   
   This PR updates that assert to check non-lambda inputs only, and fail 
earlier if the input contains lambda functions. In the future, if we use 
`ExpressionBuilder` to register higher-order functions, we can relax it.
   
   ### Why are the changes needed?
   
   better error message
   
   ### Does this PR introduce _any_ user-facing change?
   
   no, only changes error message
   
   ### How was this patch tested?
   
   new test
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-48198][BUILD] Upgrade jackson to 2.17.1 [spark]

2024-05-08 Thread via GitHub



panbingkun opened a new pull request, #46476:
URL: https://github.com/apache/spark/pull/46476

   ### What changes were proposed in this pull request?
   The pr aims to upgrade `jackson` from `2.17.0` to `2.17.1`.
   
   ### Why are the changes needed?
   The full release notes: 
   https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.17.1
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Pass GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46425:
URL: https://github.com/apache/spark/pull/46425#issuecomment-2100710684

   Thank you for reporting and reverting, @yaooqinn and @HyukjinKwon .
   
   Ya, I agree. It's not a single command event issue. It seems that the whole 
`gridTest` is flaky or `SparkConnectServiceSuite` is flaky by itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48163][CONNECT][TESTS] Disable `SparkConnectServiceSuite.SPARK-43923: commands send events - get_resources_command` [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46425:
URL: https://github.com/apache/spark/pull/46425#issuecomment-2100710947

   I'll re-try later~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47018][BUILD][SQL][HIVE] Bump built-in Hive to 2.3.10 [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46468:
URL: https://github.com/apache/spark/pull/46468#issuecomment-2100720094

   Thank you, @pan3793 and @sunchao .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

2024-05-08 Thread via GitHub



ianmcook commented on PR #45481:
URL: https://github.com/apache/spark/pull/45481#issuecomment-2100720402

   Thanks @HyukjinKwon. Anything else you need before merging this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46464:
URL: https://github.com/apache/spark/pull/46464#issuecomment-2100725260

   Merged to branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun closed pull request #46464: [SPARK-48037][CORE][3.4] Fix 
SortShuffleWriter lacks shuffle write related metrics resulting in potentially 
inaccurate data
URL: https://github.com/apache/spark/pull/46464


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48144][SQL] Fix `canPlanAsBroadcastHashJoin` to respect shuffle join hints [spark]

2024-05-08 Thread via GitHub



cloud-fan commented on code in PR #46401:
URL: https://github.com/apache/spark/pull/46401#discussion_r1594148455


##
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##
@@ -191,6 +203,13 @@ class JoinSuite extends QueryTest with SharedSparkSession 
with AdaptiveSparkPlan
 //).foreach { case (query, joinClass) => assertJoin(query, joinClass) }
 //  }
 
+  test("broadcastable join with shuffle join hint") {
+spark.sharedState.cacheManager.clearCache()
+sql("CACHE TABLE testData")

Review Comment:
   why do we need to cache?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-08 Thread via GitHub



cxzl25 commented on code in PR #46464:
URL: https://github.com/apache/spark/pull/46464#discussion_r1594150601


##
.github/workflows/build_and_test.yml:
##
@@ -644,6 +644,7 @@ jobs:
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 
'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 
'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 
'alabaster==0.7.13'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
+python3.9 -m pip install 'nbsphinx==0.9.3'

Review Comment:
   Do we need to pin the nbsphinx version in the master branch as well? Similar 
to SPARK-39421 . cc @HyukjinKwon



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on code in PR #46464:
URL: https://github.com/apache/spark/pull/46464#discussion_r1594151959


##
.github/workflows/build_and_test.yml:
##
@@ -644,6 +644,7 @@ jobs:
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 
'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 
'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 
'alabaster==0.7.13'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
+python3.9 -m pip install 'nbsphinx==0.9.3'

Review Comment:
   Oops. I missed this line. My bad. We should proceed this separately because 
this this the following, @cxzl25 .
   - #46448



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-08 Thread via GitHub



dbatomic commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1594153009


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -117,76 +119,445 @@ public Collation(
 }
 
 /**
- * Constructor with comparators that are inherited from the given collator.
+ * collation id (32-bit integer) layout:
+ * bit 31:0 = predefined collation, 1 = user-defined collation
+ * bit 30-29: 00 = utf8-binary, 01 = ICU, 10 = indeterminate (without spec 
implementation)
+ * bit 28:0 for utf8-binary / 0 = case-sensitive, 1 = case-insensitive 
for ICU
+ * bit 27:0 for utf8-binary / 0 = accent-sensitive, 1 = 
accent-insensitive for ICU
+ * bit 26-25: zeroes, reserved for punctuation sensitivity
+ * bit 24-23: zeroes, reserved for first letter preference
+ * bit 22-21: 00 = unspecified, 01 = to-lower, 10 = to-upper
+ * bit 20-19: zeroes, reserved for space trimming
+ * bit 18-17: zeroes, reserved for version
+ * bit 16-12: zeroes
+ * bit 11-0:  zeroes for utf8-binary / locale id for ICU
  */
-public Collation(
-String collationName,
-Collator collator,
-String version,
-boolean supportsBinaryEquality,
-boolean supportsBinaryOrdering,
-boolean supportsLowercaseEquality) {
-  this(
-collationName,
-collator,
-(s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-version,
-s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-supportsBinaryEquality,
-supportsBinaryOrdering,
-supportsLowercaseEquality);
+private abstract static class CollationSpec {
+  protected enum ImplementationProvider {
+UTF8_BINARY, ICU, INDETERMINATE
+  }
+
+  protected enum CaseSensitivity {
+CS, CI
+  }
+
+  protected enum AccentSensitivity {
+AS, AI
+  }
+
+  protected enum CaseConversion {
+UNSPECIFIED, LCASE, UCASE
+  }
+
+  protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+  protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b11;
+  protected static final int CASE_SENSITIVITY_OFFSET = 28;
+  protected static final int CASE_SENSITIVITY_MASK = 0b1;
+  protected static final int ACCENT_SENSITIVITY_OFFSET = 27;
+  protected static final int ACCENT_SENSITIVITY_MASK = 0b1;
+  protected static final int CASE_CONVERSION_OFFSET = 21;
+  protected static final int CASE_CONVERSION_MASK = 0b11;
+  protected static final int LOCALE_OFFSET = 0;
+  protected static final int LOCALE_MASK = 0x0FFF;
+
+  protected static final int INDETERMINATE_COLLATION_ID =
+ImplementationProvider.INDETERMINATE.ordinal() << 
IMPLEMENTATION_PROVIDER_OFFSET;
+
+  protected final CaseSensitivity caseSensitivity;
+  protected final AccentSensitivity accentSensitivity;
+  protected final CaseConversion caseConversion;
+  protected final String locale;
+  protected final int collationId;
+
+  protected CollationSpec(
+  String locale,
+  CaseSensitivity caseSensitivity,
+  AccentSensitivity accentSensitivity,
+  CaseConversion caseConversion) {
+this.locale = locale;
+this.caseSensitivity = caseSensitivity;
+this.accentSensitivity = accentSensitivity;
+this.caseConversion = caseConversion;
+this.collationId = getCollationId();
+  }
+
+  private static final Map collationMap = new 
ConcurrentHashMap<>();
+
+  public static Collation fetchCollation(int collationId) throws 
SparkException {
+if (collationId == UTF8_BINARY_COLLATION_ID) {
+  return CollationSpecUTF8Binary.UTF8_BINARY_COLLATION;
+} else if (collationMap.containsKey(collationId)) {
+  return collationMap.get(collationId);
+} else {
+  CollationSpec spec;
+  int implementationProviderOrdinal =
+(collationId >> IMPLEMENTATION_PROVIDER_OFFSET) & 
IMPLEMENTATION_PROVIDER_MASK;
+  if (implementationProviderOrdinal >= 
ImplementationProvider.values().length) {
+throw SparkException.internalError("Invalid collation 
implementation provider");
+  } else {
+ImplementationProvider implementationProvider = 
ImplementationProvider.values()[
+  implementationProviderOrdinal];
+if (implementationProvider == ImplementationProvider.UTF8_BINARY) {
+  spec = CollationSpecUTF8Binary.fromCollationId(collationId);
+} else if (implementationProvider == ImplementationProvider.ICU) {
+  spec = CollationSpecICU.fromCollationId(collationId);
+} else {
+  throw SparkException.internalError("Cannot instantiate 
indeterminate collation");
+}
+Collation collation = spec.buildCollation();
+

Re: [PR] [SPARK-48144][SQL] Fix `canPlanAsBroadcastHashJoin` to respect shuffle join hints [spark]

2024-05-08 Thread via GitHub



cloud-fan commented on code in PR #46401:
URL: https://github.com/apache/spark/pull/46401#discussion_r1594152932


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala:
##
@@ -400,11 +395,32 @@ trait JoinSelectionHelper {
 }
   }
 
-  def canPlanAsBroadcastHashJoin(join: Join, conf: SQLConf): Boolean = {
-getBroadcastBuildSide(join.left, join.right, join.joinType,
-  join.hint, hintOnly = true, conf).isDefined ||
-  getBroadcastBuildSide(join.left, join.right, join.joinType,
-join.hint, hintOnly = false, conf).isDefined
+  protected def hashJoinSupported
+  (leftKeys: Seq[Expression], rightKeys: Seq[Expression]): Boolean = {
+val result = leftKeys.concat(rightKeys).forall(e => 
UnsafeRowUtils.isBinaryStable(e.dataType))
+if (!result) {
+  val keysNotSupportingHashJoin = leftKeys.concat(rightKeys).filterNot(
+e => UnsafeRowUtils.isBinaryStable(e.dataType))
+  logWarning(log"Hash based joins are not supported due to joining on keys 
that don't " +
+log"support binary equality. Keys not supporting hash joins: " +
+log"${
+  MDC(HASH_JOIN_KEYS, keysNotSupportingHashJoin.map(
+e => e.toString + " due to DataType: " + 
e.dataType.typeName).mkString(", "))
+}")
+}
+result
+  }
+
+  def canPlanAsBroadcastHashJoin(join: Join, conf: SQLConf): Boolean = join 
match {

Review Comment:
   to confirm: do we call this function in `JoinSelection`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-08 Thread via GitHub



cxzl25 commented on code in PR #46464:
URL: https://github.com/apache/spark/pull/46464#discussion_r1594158117


##
.github/workflows/build_and_test.yml:
##
@@ -644,6 +644,7 @@ jobs:
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 
'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 
'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 
'alabaster==0.7.13'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
+python3.9 -m pip install 'nbsphinx==0.9.3'

Review Comment:
   Thanks @dongjoon-hyun! I didn’t know why CI failed on the 3.4 branch at 
first, so I tested it in my own way.
   
   Do we need to backport SPARK-48179 to branch 3.4?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun closed pull request #46456: [SPARK-48184][PYTHON][CONNECT] Always 
set the seed of `Dataframe.sample` in Client side
URL: https://github.com/apache/spark/pull/46456


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46456:
URL: https://github.com/apache/spark/pull/46456#issuecomment-2100756927

   Merged to master/3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46456:
URL: https://github.com/apache/spark/pull/46456#issuecomment-2100758425

   Thank you, @zhengruifeng and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWithState operator. [spark]

2024-05-08 Thread via GitHub



sahnib commented on PR #45376:
URL: https://github.com/apache/spark/pull/45376#issuecomment-2100800925

   > Let's be sure to either 1) introduce a method to users which gives a 
watermark value before advancing (late events) or 2) construct a story for 
users to set the event time timestamp properly without watermark value. @sahnib 
Could you please file a JIRA ticket with blocker priority?
   
   @HeartSaVioR  Created the JIRA 
https://issues.apache.org/jira/browse/SPARK-48199 for follow up item. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub



hvanhovell commented on code in PR #46245:
URL: https://github.com/apache/spark/pull/46245#discussion_r1594208101


##
sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala:
##
@@ -49,6 +49,7 @@ import 
org.apache.spark.sql.execution.aggregate.TypedAggregateExpression
  * @tparam OUT The type of the final output result.
  * @since 1.6.0
  */
+@SerialVersionUID(2093413866369130093L)

Review Comment:
   Why is this needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub



hvanhovell commented on code in PR #46245:
URL: https://github.com/apache/spark/pull/46245#discussion_r1594209214


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.sql.{Encoder, TypedColumn}
+
+/**
+ * A base class for user-defined aggregations, which can be used in `Dataset` 
operations to take
+ * all of the elements of a group and reduce them to a single value.
+ *
+ * For example, the following aggregator extracts an `int` from a specific 
class and adds them up:
+ * {{{
+ *   case class Data(i: Int)
+ *
+ *   val customSummer =  new Aggregator[Data, Int, Int] {
+ * def zero: Int = 0
+ * def reduce(b: Int, a: Data): Int = b + a.i
+ * def merge(b1: Int, b2: Int): Int = b1 + b2
+ * def finish(r: Int): Int = r
+ * def bufferEncoder: Encoder[Int] = Encoders.scalaInt
+ * def outputEncoder: Encoder[Int] = Encoders.scalaInt
+ *   }
+ *
+ *   spark.udf.register("customSummer", udaf(customSummer))
+ *   val ds: Dataset[Data] = ...
+ *   val aggregated = ds.selectExpr("customSummer(i)")
+ * }}}
+ *
+ * Based loosely on Aggregator from Algebird: 
https://github.com/twitter/algebird
+ *
+ * @tparam IN  The input type for the aggregation.
+ * @tparam BUF The type of the intermediate value of the reduction.
+ * @tparam OUT The type of the final output result.
+ * @since 4.0.0
+ */
+@SerialVersionUID(2093413866369130093L)
+abstract class Aggregator[-IN, BUF, OUT] extends Serializable {

Review Comment:
   Can we move this common instead of having two abstract classes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub



hvanhovell commented on code in PR #46245:
URL: https://github.com/apache/spark/pull/46245#discussion_r1594210664


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala:
##
@@ -205,6 +205,91 @@ object ScalarUserDefinedFunction {
   }
 }
 
+case class UserDefinedAggregationFunction private[sql] (

Review Comment:
   How much of this implementation is shared with the 
`ScalarUserDefinedFunction`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub



xupefei commented on code in PR #46245:
URL: https://github.com/apache/spark/pull/46245#discussion_r1594216679


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.sql.{Encoder, TypedColumn}
+
+/**
+ * A base class for user-defined aggregations, which can be used in `Dataset` 
operations to take
+ * all of the elements of a group and reduce them to a single value.
+ *
+ * For example, the following aggregator extracts an `int` from a specific 
class and adds them up:
+ * {{{
+ *   case class Data(i: Int)
+ *
+ *   val customSummer =  new Aggregator[Data, Int, Int] {
+ * def zero: Int = 0
+ * def reduce(b: Int, a: Data): Int = b + a.i
+ * def merge(b1: Int, b2: Int): Int = b1 + b2
+ * def finish(r: Int): Int = r
+ * def bufferEncoder: Encoder[Int] = Encoders.scalaInt
+ * def outputEncoder: Encoder[Int] = Encoders.scalaInt
+ *   }
+ *
+ *   spark.udf.register("customSummer", udaf(customSummer))
+ *   val ds: Dataset[Data] = ...
+ *   val aggregated = ds.selectExpr("customSummer(i)")
+ * }}}
+ *
+ * Based loosely on Aggregator from Algebird: 
https://github.com/twitter/algebird
+ *
+ * @tparam IN  The input type for the aggregation.
+ * @tparam BUF The type of the intermediate value of the reduction.
+ * @tparam OUT The type of the final output result.
+ * @since 4.0.0
+ */
+@SerialVersionUID(2093413866369130093L)
+abstract class Aggregator[-IN, BUF, OUT] extends Serializable {

Review Comment:
   Yes, that would be ideal. I was doing that before until I found that the 
Connect client should have another docstring `@since 4.0.0`. Could you suggest 
how could we document this on the client side if this class is moved to Common?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub



hvanhovell commented on code in PR #46245:
URL: https://github.com/apache/spark/pull/46245#discussion_r1594223201


##
sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala:
##
@@ -49,6 +49,7 @@ import 
org.apache.spark.sql.execution.aggregate.TypedAggregateExpression
  * @tparam OUT The type of the final output result.
  * @since 1.6.0
  */
+@SerialVersionUID(2093413866369130093L)

Review Comment:
   `TypedColumn`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub



hvanhovell commented on code in PR #46245:
URL: https://github.com/apache/spark/pull/46245#discussion_r1594227631


##
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##
@@ -379,6 +380,15 @@ message ScalarScalaUDF {
   bool nullable = 4;
 }
 
+message ScalaUDAF {
+  // (Required) Serialized JVM object of the Aggregator instance, including 
buffer and output encoders
+  bytes payload = 1;
+  // (Required) Input type of the UDAF
+  DataType inputType = 2;
+  // (Required) True if the UDAF can return null value
+  bool nullable = 3;

Review Comment:
   Why is the return type not needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub



hvanhovell commented on code in PR #46245:
URL: https://github.com/apache/spark/pull/46245#discussion_r1594229607


##
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##
@@ -379,6 +380,15 @@ message ScalarScalaUDF {
   bool nullable = 4;
 }
 
+message ScalaUDAF {
+  // (Required) Serialized JVM object of the Aggregator instance, including 
buffer and output encoders
+  bytes payload = 1;
+  // (Required) Input type of the UDAF
+  DataType inputType = 2;
+  // (Required) True if the UDAF can return null value
+  bool nullable = 3;

Review Comment:
   For the record we should not rely on the payload for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48008][WIP] Support UDAFs in Spark Connect [spark]

2024-05-08 Thread via GitHub



hvanhovell commented on code in PR #46245:
URL: https://github.com/apache/spark/pull/46245#discussion_r1594250192


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.sql.{Encoder, TypedColumn}
+
+/**
+ * A base class for user-defined aggregations, which can be used in `Dataset` 
operations to take
+ * all of the elements of a group and reduce them to a single value.
+ *
+ * For example, the following aggregator extracts an `int` from a specific 
class and adds them up:
+ * {{{
+ *   case class Data(i: Int)
+ *
+ *   val customSummer =  new Aggregator[Data, Int, Int] {
+ * def zero: Int = 0
+ * def reduce(b: Int, a: Data): Int = b + a.i
+ * def merge(b1: Int, b2: Int): Int = b1 + b2
+ * def finish(r: Int): Int = r
+ * def bufferEncoder: Encoder[Int] = Encoders.scalaInt
+ * def outputEncoder: Encoder[Int] = Encoders.scalaInt
+ *   }
+ *
+ *   spark.udf.register("customSummer", udaf(customSummer))
+ *   val ds: Dataset[Data] = ...
+ *   val aggregated = ds.selectExpr("customSummer(i)")
+ * }}}
+ *
+ * Based loosely on Aggregator from Algebird: 
https://github.com/twitter/algebird
+ *
+ * @tparam IN  The input type for the aggregation.
+ * @tparam BUF The type of the intermediate value of the reduction.
+ * @tparam OUT The type of the final output result.
+ * @since 4.0.0
+ */
+@SerialVersionUID(2093413866369130093L)
+abstract class Aggregator[-IN, BUF, OUT] extends Serializable {
+
+  /**
+   * A zero value for this aggregation. Should satisfy the property that any b 
+ zero = b.
+   *
+   * @since 4.0.0
+   */
+  def zero: BUF
+
+  /**
+   * Combine two values to produce a new value.  For performance, the function 
may modify `b` and
+   * return it instead of constructing new object for b.
+   *
+   * @since 4.0.0
+   */
+  def reduce(b: BUF, a: IN): BUF
+
+  /**
+   * Merge two intermediate values.
+   *
+   * @since 4.0.0
+   */
+  def merge(b1: BUF, b2: BUF): BUF
+
+  /**
+   * Transform the output of the reduction.
+   *
+   * @since 4.0.0
+   */
+  def finish(reduction: BUF): OUT
+
+  /**
+   * Specifies the `Encoder` for the intermediate value type.
+   *
+   * @since 4.0.0
+   */
+  def bufferEncoder: Encoder[BUF]
+
+  /**
+   * Specifies the `Encoder` for the final output value type.
+   *
+   * @since 4.0.0
+   */
+  def outputEncoder: Encoder[OUT]
+
+  /**
+   * Returns this `Aggregator` as a `TypedColumn` that can be used in 
`Dataset`.
+   * operations.
+   */
+  def toColumn: TypedColumn[IN, OUT] = {

Review Comment:
   How should this work on the connect side?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48037][CORE][3.4] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on code in PR #46464:
URL: https://github.com/apache/spark/pull/46464#discussion_r1594250748


##
.github/workflows/build_and_test.yml:
##
@@ -644,6 +644,7 @@ jobs:
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0' 'sphinxcontrib-applehelp==1.0.4' 'sphinxcontrib-devhelp==1.0.2' 
'sphinxcontrib-htmlhelp==2.0.1' 'sphinxcontrib-qthelp==1.0.3' 
'sphinxcontrib-serializinghtml==1.1.5' 'nest-asyncio==1.5.8' 'rpds-py==0.16.2' 
'alabaster==0.7.13'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
+python3.9 -m pip install 'nbsphinx==0.9.3'

Review Comment:
   > Do we need to backport 
[SPARK-48179](https://issues.apache.org/jira/browse/SPARK-48179) to branch 3.4?
   
   I believe it's too late and would be redundant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47421][SQL] Add collation support for URL expressions [spark]

2024-05-08 Thread via GitHub



uros-db commented on PR #46460:
URL: https://github.com/apache/spark/pull/46460#issuecomment-2100872209

   note: collation awareness for these pass-through Spark expressions required 
modifying query plans in `query-tests/explain-results/…` in order to 
accommodate using `StringTypeAnyCollation` instead of `StringType`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48197][SQL] Avoid assert error for invalid lambda function [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46475:
URL: https://github.com/apache/spark/pull/46475#issuecomment-2100882656

   Is the UT failure relevant, @cloud-fan ?
   ```
   [info] *** 1 TEST FAILED ***
   [error] Failed: Total 10583, Failed 1, Errors 0, Passed 10582, Ignored 29
   [error] Failed tests:
   [error]  
org.apache.spark.sql.execution.python.PythonStreamingDataSourceSuite
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48198][BUILD] Upgrade jackson to 2.17.1 [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun closed pull request #46476: [SPARK-48198][BUILD] Upgrade jackson 
to 2.17.1
URL: https://github.com/apache/spark/pull/46476


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48198][BUILD] Upgrade jackson to 2.17.1 [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46476:
URL: https://github.com/apache/spark/pull/46476#issuecomment-2100889456

   Merged to master for Apache Spark 4.0.0-preview.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41794][SQL] Add `try_remainder` function and re-enable column tests [spark]

2024-05-08 Thread via GitHub



dongjoon-hyun commented on PR #46434:
URL: https://github.com/apache/spark/pull/46434#issuecomment-2100893973

   Thank you.
   
   BTW, the recent test failure is relevant? For me, it looks irrelevant. Could 
you take a look at?
   
   ```
   [info] *** 1 TEST FAILED ***
   [error] Failed: Total 10578, Failed 1, Errors 0, Passed 10577, Ignored 29
   [error] Failed tests:
   [error]  
org.apache.spark.sql.execution.python.PythonStreamingDataSourceSuite
   [error] (sql / Test / test) sbt.TestsFailedException: Tests unsuccessful
   [error] Total time: 3020 s (50:20), completed May 8, 2024, 1:48:01 PM
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 285 matches

Mail list logo