Re: [PR] [SPARK-45292][SQL][HIVE] Remove Guava from shared classes from IsolatedClientLoader [spark]
dongjoon-hyun commented on PR #42599: URL: https://github.com/apache/spark/pull/42599#issuecomment-1947911644 Is there any update for Apache Hive 2.3.10, @pan3793 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47072][SQL] Fix supported interval formats in error messages [spark]
MaxGekk commented on PR #45127: URL: https://github.com/apache/spark/pull/45127#issuecomment-1947910365 @AngersZh @cloud-fan @HyukjinKwon Could you review this bug fix, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45720] Upgrade AWS SDK to v2 for Spark Kinesis connector module [spark]
dongjoon-hyun commented on code in PR #44211: URL: https://github.com/apache/spark/pull/44211#discussion_r1492052014 ## connector/kinesis-asl-assembly/pom.xml: ## @@ -62,12 +62,18 @@ com.google.protobuf protobuf-java - 2.6.1 - + compile + + + + com.google.guava + guava + ${aws.kinesis.client.guava.version} + compile Review Comment: Sorry for being late. 1. Is this only for Kinesis or for all AWS SKD v2? 2. Instead of the following, can we use the latest33.0.0-jre like #44795 ? ``` 32.1.1-jre ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45720] Upgrade AWS SDK to v2 for Spark Kinesis connector module [spark]
dongjoon-hyun commented on code in PR #44211: URL: https://github.com/apache/spark/pull/44211#discussion_r1492043709 ## connector/kinesis-asl-assembly/pom.xml: ## @@ -62,12 +62,18 @@ com.google.protobuf protobuf-java - 2.6.1 - + compile Review Comment: For the record, I removed this shaded `protobuf-java` completely in Apache Spark 4.0.0 independently. - #45096 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]
dongjoon-hyun commented on code in PR #45117: URL: https://github.com/apache/spark/pull/45117#discussion_r1492039371 ## sbin/start-connect-server.sh: ## @@ -40,8 +40,9 @@ fi if [ "$1" == "--wait" ]; then shift - exec "${SPARK_HOME}"/bin/spark-submit --class $CLASS 1 --name "Spark Connect Server" "$@" -else - exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark Connect server" "$@" + export SPARK_NO_DAEMONIZE=1 fi + exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark Connect server" "$@" Review Comment: While checking once more, I realized that we have extra leading spaces in this line. Could you double-check the indentation? ![Screenshot 2024-02-15 at 23 10 56](https://github.com/apache/spark/assets/9700541/538e7354-9b61-4f37-b04d-b88dbb0d7d1e) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
peter-toth commented on PR #45036: URL: https://github.com/apache/spark/pull/45036#issuecomment-1947853367 > `SQLOpenHashSet` also handles null differently. Not sure if `OpenHashSet` already covers it. Yes, you are right. Probably the `NaN` special handling can be remvoved though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
cloud-fan commented on PR #45036: URL: https://github.com/apache/spark/pull/45036#issuecomment-1947841579 `SQLOpenHashSet` also handles null differently. Not sure if `OpenHashSet` already covers it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]
cloud-fan commented on code in PR #45134: URL: https://github.com/apache/spark/pull/45134#discussion_r1492014338 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala: ## @@ -41,10 +41,15 @@ object RewriteWithExpression extends Rule[LogicalPlan] { rewriteWithExprAndInputPlans(expr, inputPlans) } newPlan = newPlan.withNewChildren(inputPlans.toIndexedSeq) -if (p.output == newPlan.output) { - newPlan -} else { +// Since we add extra Projects with extra columns to pre-evaluate the common expressions, +// the current operator may have extra columns if it inherits the output columns from its +// child, and we need to project away the extra columns to keep the plan schema unchanged. Review Comment: not related to this PR but refine the code here since I'm touching this rule. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]
cloud-fan commented on PR #45134: URL: https://github.com/apache/spark/pull/45134#issuecomment-1947836202 cc @viirya @ulysses-you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
peter-toth commented on PR #45036: URL: https://github.com/apache/spark/pull/45036#issuecomment-1947835261 If we go this direction and change `OpenHashSet` do we still need `SQLOpenHashSet`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47071][SQL] Inline With expression if it contains special expression [spark]
cloud-fan opened a new pull request, #45134: URL: https://github.com/apache/spark/pull/45134 ### What changes were proposed in this pull request? This is a bug fix for the With expression. If the common expression contains special expression like aggregate expresson, we cannot pull it out and put it in Project. We have to inline it. ### Why are the changes needed? bug fix. ### Does this PR introduce _any_ user-facing change? a failed can run after this fix, but this bug is not released yet. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47054][PYTHON][TESTS] Remove pinned version of torch for Python 3.12 support [spark]
dongjoon-hyun closed pull request #45113: [SPARK-47054][PYTHON][TESTS] Remove pinned version of torch for Python 3.12 support URL: https://github.com/apache/spark/pull/45113 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47054][PYTHON][TESTS] Remove pinned version of torch for Python 3.12 support [spark]
dongjoon-hyun commented on PR #45113: URL: https://github.com/apache/spark/pull/45113#issuecomment-1947828146 The `yarn` and `connect` module failures are known flaky ones. https://github.com/apache/spark/assets/9700541/ef345b48-d02b-4310-80bd-aa0169e040db;> Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
nchammas commented on PR #45036: URL: https://github.com/apache/spark/pull/45036#issuecomment-1947785597 @revans2 - Just to confirm, have you rerun your tests against this branch (the ones that exposed this bug) and do they pass now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
nchammas commented on code in PR #45036: URL: https://github.com/apache/spark/pull/45036#discussion_r1491975003 ## core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala: ## @@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers { map(null) = null assert(map.get(null) === Some(null)) } + + test("SPARK-45599: 0.0 and -0.0 should count distinctly") { +// Exactly these elements provided in roughly this order trigger a condition where lookups of +// 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly +// and inconsistently if `==` is used to check for key equality. Review Comment: I tweaked the test name. Is that what you had in mind? This comment explains why we need exactly the following elements to trigger the 0.0/-0.0 miscount. It doesn't always happen (which is part of what kept this bug hidden for so long). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46972][SQL] Fix asymmetrical replacement for char/varchar in V2SessionCatalog.createTable [spark]
cloud-fan commented on code in PR #45019: URL: https://github.com/apache/spark/pull/45019#discussion_r1491966372 ## sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala: ## @@ -1735,6 +1735,15 @@ class DataSourceV2SQLSuiteV1Filter } } + test("SPARK-46972: asymmetrical replacement for char/varchar in V2SessionCatalog.createTable") { +// unset this config to use the default v2 session catalog. +spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key) +withTable("t") { + sql(s"CREATE TABLE t(c char(1), v varchar(2)) USING $v2Source") + assert(!spark.table("t").isEmpty) Review Comment: I think we should not trigger an actual scan. The data schema (int) is incompatible with char, and the scan behavior is kind of undefined. It should be good enough to run the CREATE TABLE command in this test case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
cloud-fan commented on code in PR #45036: URL: https://github.com/apache/spark/pull/45036#discussion_r1491931994 ## core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala: ## @@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers { map(null) = null assert(map.get(null) === Some(null)) } + + test("SPARK-45599: 0.0 and -0.0 should count distinctly") { +// Exactly these elements provided in roughly this order trigger a condition where lookups of +// 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly +// and inconsistently if `==` is used to check for key equality. Review Comment: shall we mention the NaN behavior as well? All NaN values are all the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]
HyukjinKwon commented on code in PR #45129: URL: https://github.com/apache/spark/pull/45129#discussion_r1491925413 ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by +:attr:`spark.profile`. + +.. versionadded: 4.0.0 +""" + +def __init__(self, sparkSession: "SparkSession"): +self.sparkSession = sparkSession + +def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> None: +""" +Show the profile results. + +.. versionadded:: 4.0.0 + +Parameters +-- +type : str, optional +The profiler type, which can be either "perf" or "memory". +id : int, optional +A UDF ID to be shown. If not specified, all the results will be shown. +""" +if type == "memory": +self.sparkSession.showMemoryProfiles(id) Review Comment: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]
HyukjinKwon commented on code in PR #45129: URL: https://github.com/apache/spark/pull/45129#discussion_r1491924933 ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by Review Comment: We should probably also fix the docs at `python/docs/source/reference`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [spark]
HyukjinKwon closed pull request #45132: [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch URL: https://github.com/apache/spark/pull/45132 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [spark]
HyukjinKwon commented on PR #45132: URL: https://github.com/apache/spark/pull/45132#issuecomment-1947709204 Merged to master, brnach-3.5 and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46078][PYTHON][TESTS] Upgrade `pytorch` for Python 3.12 [spark]
HyukjinKwon closed pull request #45119: [SPARK-46078][PYTHON][TESTS] Upgrade `pytorch` for Python 3.12 URL: https://github.com/apache/spark/pull/45119 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46078][PYTHON][TESTS] Upgrade `pytorch` for Python 3.12 [spark]
HyukjinKwon commented on PR #45119: URL: https://github.com/apache/spark/pull/45119#issuecomment-1947708773 This won't affect the main build. I will check it through the scheduled build. Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]
HyukjinKwon commented on PR #45115: URL: https://github.com/apache/spark/pull/45115#issuecomment-1947706285 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]
HyukjinKwon closed pull request #45115: [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 URL: https://github.com/apache/spark/pull/45115 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-38098][PYTHON] Add support for ArrayType of nested StructType to arrow-based conversion [spark]
itholic commented on code in PR #35391: URL: https://github.com/apache/spark/pull/35391#discussion_r1491894555 ## python/pyspark/sql/pandas/types.py: ## @@ -86,8 +86,15 @@ def to_arrow_type(dt: DataType) -> "pa.DataType": elif type(dt) == DayTimeIntervalType: arrow_type = pa.duration("us") elif type(dt) == ArrayType: -if type(dt.elementType) in [StructType, TimestampType]: +if type(dt.elementType) == TimestampType: raise TypeError("Unsupported type in conversion to Arrow: " + str(dt)) +elif type(dt.elementType) == StructType: +if LooseVersion(pa.__version__) < LooseVersion("2.0.0"): +raise TypeError( +"Array of StructType is only supported with pyarrow 2.0.0 and above" Review Comment: Hi, @LucaCanali I think I've found a case where Array of StructType doesn't work properly: **In:** ```python df = spark.createDataFrame( [ ("a", [("b", False), ("c", True)]), ] ).toDF("c1", "c2") df.toPandas() ``` **Out:** ```python c1 c2 0 a [{'_1': 'b', '_2': False}, {'_1': 'c', '_2': T... ``` **Expected:** ```python c1 c2 0 a [(b, False), (c, True)] ``` I roughly suspect that this is maybe an internal table conversion issue from PyArrow side but not very sure about this, so could you check this out when you find some time?? Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-XXX] Fix invalid aggregation after in-subquery rewrite [spark]
anton5798 opened a new pull request, #45133: URL: https://github.com/apache/spark/pull/45133 ### What changes were proposed in this pull request? todo ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Query tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47069][PYTHON] Introduce `spark.profile.show/dump` for SparkSession-based profiling [spark]
ueshin commented on code in PR #45129: URL: https://github.com/apache/spark/pull/45129#discussion_r1491866174 ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by +:attr:`spark.profile`. + +.. versionadded: 4.0.0 +""" + +def __init__(self, sparkSession: "SparkSession"): +self.sparkSession = sparkSession + +def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> None: +""" +Show the profile results. + +.. versionadded:: 4.0.0 + +Parameters +-- +type : str, optional +The profiler type, which can be either "perf" or "memory". +id : int, optional +A UDF ID to be shown. If not specified, all the results will be shown. +""" +if type == "memory": +self.sparkSession.showMemoryProfiles(id) Review Comment: Shall we remove the old APIs? I think the new APIs are enough to have. ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by +:attr:`spark.profile`. + +.. versionadded: 4.0.0 +""" + +def __init__(self, sparkSession: "SparkSession"): +self.sparkSession = sparkSession Review Comment: It should take `ProfilerCollector`? ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by +:attr:`spark.profile`. + +.. versionadded: 4.0.0 +""" + +def __init__(self, sparkSession: "SparkSession"): +self.sparkSession = sparkSession + +def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> None: +""" +Show the profile results. + +.. versionadded:: 4.0.0 + +Parameters +-- +type : str, optional +The profiler type, which can be either "perf" or "memory". +id : int, optional +A UDF ID to be shown. If not specified, all the results will be shown. +""" +if type == "memory": +self.sparkSession.showMemoryProfiles(id) +elif type == "perf" or type is None: +self.sparkSession.showPerfProfiles(id) +if type is None: # Show both perf and memory profiles +self.sparkSession.showMemoryProfiles(id) +else: +raise PySparkValueError( +error_class="VALUE_NOT_ALLOWED", +message_parameters={ +"arg_name": "type", +"allowed_values": str(["perf", "memory"]), +}, +) + +def dump(self, path: str, *, type: Optional[str] = None, id: Optional[int] = None) -> None: Review Comment: ditto. ```suggestion def dump(self, path: str, id: Optional[int] = None, *, type: Optional[str] = None) -> None: ``` ## python/pyspark/sql/profiler.py: ## @@ -239,3 +241,72 @@ def _profile_results(self) -> "ProfileResults": with self._lock: value = self._accumulator.value return value if value is not None else {} + + +class Profile: +"""User-facing profile API. This instance can be accessed by +:attr:`spark.profile`. + +.. versionadded: 4.0.0 +""" + +def __init__(self, sparkSession: "SparkSession"): +self.sparkSession = sparkSession + +def show(self, *, type: Optional[str] = None, id: Optional[int] = None) -> None: Review Comment: I prefer: ```suggestion def show(self, id: Optional[int] = None, *, type: Optional[str] = None) -> None: ``` ```py spark.profile.show() # show all profile results spark.profile.show(1) # show the profile results for ID = 1 spark.profile.show(1, type="memory") # show the memory profile results for ID = 1 spark.profile.show(type="memory") # show all memory profile results ``` ## python/pyspark/sql/session.py: ## @@ -906,6 +907,12 @@ def dataSource(self) -> "DataSourceRegistration": return DataSourceRegistration(self) +@property +def profile(self) -> "Profile": Review Comment: Need this for connect, too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to
Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]
HeartSaVioR commented on code in PR #45023: URL: https://github.com/apache/spark/pull/45023#discussion_r1491869371 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala: ## @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + + +package org.apache.spark.sql.execution.python + +import java.io.{BufferedInputStream, BufferedOutputStream, DataInputStream, DataOutputStream} + +import scala.collection.mutable.ArrayBuffer +import scala.jdk.CollectionConverters._ + +import org.apache.spark.SparkEnv +import org.apache.spark.api.python.{PythonFunction, PythonWorker, PythonWorkerFactory, PythonWorkerUtils, SpecialLengths} +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.BUFFER_SIZE +import org.apache.spark.internal.config.Python.{PYTHON_AUTH_SOCKET_TIMEOUT, PYTHON_USE_DAEMON} +import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryExecutionErrors} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType + +object PythonStreamingSourceRunner { + val initialOffsetFuncId = 884 + val latestOffsetFuncId = 885 + val partitionsFuncId = 886 + val commitFuncId = 887 +} + +/** + * This class is a proxy to invoke methods in Python DataSourceStreamReader from JVM. + * A runner spawns a python worker process. In the main function, set up communication + * between JVM and python process through socket and create a DataSourceStreamReader instance. + * In an infinite loop, the python worker process poll information(function name and parameters) + * from the socket, invoke the corresponding method of StreamReader and send return value to JVM. + */ +class PythonStreamingSourceRunner( +func: PythonFunction, +outputSchema: StructType) extends Logging { + val workerModule = "pyspark.sql.streaming.python_streaming_source_runner" + + private val conf = SparkEnv.get.conf + protected val bufferSize: Int = conf.get(BUFFER_SIZE) + protected val authSocketTimeout = conf.get(PYTHON_AUTH_SOCKET_TIMEOUT) + + private val envVars: java.util.Map[String, String] = func.envVars + private val pythonExec: String = func.pythonExec + private var pythonWorker: Option[PythonWorker] = None + private var pythonWorkerFactory: Option[PythonWorkerFactory] = None + protected val pythonVer: String = func.pythonVer + + private var dataOut: DataOutputStream = null + private var dataIn: DataInputStream = null + + import PythonStreamingSourceRunner._ + + /** + * Initializes the Python worker for running the streaming source. + */ + def init(): Unit = { +logInfo(s"Initializing Python runner pythonExec: $pythonExec") +val env = SparkEnv.get + +val localdir = env.blockManager.diskBlockManager.localDirs.map(f => f.getPath()).mkString(",") +envVars.put("SPARK_LOCAL_DIRS", localdir) + +envVars.put("SPARK_AUTH_SOCKET_TIMEOUT", authSocketTimeout.toString) +envVars.put("SPARK_BUFFER_SIZE", bufferSize.toString) + +val prevConf = conf.get(PYTHON_USE_DAEMON) +conf.set(PYTHON_USE_DAEMON, false) +try { + val workerFactory = +new PythonWorkerFactory(pythonExec, workerModule, envVars.asScala.toMap) + val (worker: PythonWorker, _) = workerFactory.createSimpleWorker(blockingMode = true) + pythonWorker = Some(worker) + pythonWorkerFactory = Some(workerFactory) +} finally { + conf.set(PYTHON_USE_DAEMON, prevConf) +} + +val stream = new BufferedOutputStream( + pythonWorker.get.channel.socket().getOutputStream, bufferSize) +dataOut = new DataOutputStream(stream) + +PythonWorkerUtils.writePythonVersion(pythonVer, dataOut) + +val pythonIncludes = func.pythonIncludes.asScala.toSet +PythonWorkerUtils.writeSparkFiles(Some("streaming_job"), pythonIncludes, dataOut) + +// Send the user function to python process +PythonWorkerUtils.writePythonFunction(func, dataOut) + +// Send output schema +PythonWorkerUtils.writeUTF(outputSchema.json, dataOut) + +// Send configurations +dataOut.writeInt(SQLConf.get.arrowMaxRecordsPerBatch) +dataOut.flush() + +dataIn = new DataInputStream( + new
Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]
HyukjinKwon commented on code in PR #45115: URL: https://github.com/apache/spark/pull/45115#discussion_r1491862078 ## python/pyspark/streaming/dstream.py: ## @@ -849,7 +847,7 @@ def reduceFunc(t: datetime, a: Any, b: Any) -> Any: if a is None: g = b.groupByKey(numPartitions).mapValues(lambda vs: (list(vs), None)) else: -g = a.cogroup(b.partitionBy(cast(int, numPartitions)), numPartitions) +g = a.cogroup(b.partitionBy(int), numPartitions) Review Comment: ```suggestion g = a.cogroup(b.partitionBy(numPartitions), numPartitions) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]
HyukjinKwon commented on code in PR #45115: URL: https://github.com/apache/spark/pull/45115#discussion_r1491862078 ## python/pyspark/streaming/dstream.py: ## @@ -849,7 +847,7 @@ def reduceFunc(t: datetime, a: Any, b: Any) -> Any: if a is None: g = b.groupByKey(numPartitions).mapValues(lambda vs: (list(vs), None)) else: -g = a.cogroup(b.partitionBy(cast(int, numPartitions)), numPartitions) +g = a.cogroup(b.partitionBy(int), numPartitions) Review Comment: ```suggestion g = a.cogroup(b.partitionBy(numPartitions), numPartitions) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]
dongjoon-hyun closed pull request #45128: [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) URL: https://github.com/apache/spark/pull/45128 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]
dongjoon-hyun commented on PR #45128: URL: https://github.com/apache/spark/pull/45128#issuecomment-1947600534 Thank you, @viirya ! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]
dongjoon-hyun commented on code in PR #45128: URL: https://github.com/apache/spark/pull/45128#discussion_r1491857406 ## .github/workflows/build_apple_silicon.yml: ## @@ -0,0 +1,47 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +name: "Build / Apple Silicon (master, JDK 21)" + +on: + schedule: +- cron: '0 20 * * *' + +jobs: + run-build: +permissions: + packages: write +name: Run +uses: ./.github/workflows/build_and_test.yml +runs-on: macos-14 +if: github.repository == 'apache/spark' +with: + java: 21 + branch: master + hadoop: hadoop3 + envs: >- +{ + "SKIP_MIMA": "true", + "SKIP_UNIDOC": "true", + "DEDICATED_JVM_SBT_TESTS": "org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV1Suite,org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV2Suite,org.apache.spark.sql.execution.datasources.orc.OrcSourceV1Suite,org.apache.spark.sql.execution.datasources.orc.OrcSourceV2Suite" +} + jobs: >- +{ + "build": "true" Review Comment: Yes, I excluded them intentionally at this stage. They need more care. Only Java/Scala is the target. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [spark]
HyukjinKwon opened a new pull request, #45132: URL: https://github.com/apache/spark/pull/45132 ### What changes were proposed in this pull request? This PR fixes the regression introduced by https://github.com/apache/spark/pull/36683. ```python import pandas as pd spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0) spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() ``` **Before** ``` /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false. range() arg 3 must not be zero warn(msg) Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame return super(SparkSession, self).createDataFrame( # type: ignore[call-overload] File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step)) ValueError: range() arg 3 must not be zero ``` ``` Empty DataFrame Columns: [a] Index: [] ``` **After** ``` a 0 123 ``` ``` a 0 123 ``` ### Why are the changes needed? It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a regression as described above. ### How was this patch tested? Unittest was added. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] TransformWithState Batch Support [spark]
ericm-db closed pull request #44963: [WIP] TransformWithState Batch Support URL: https://github.com/apache/spark/pull/44963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] POC to add TTL for ValueState [spark]
ericm-db closed pull request #45072: [WIP] POC to add TTL for ValueState URL: https://github.com/apache/spark/pull/45072 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [wip] value state ttl poc [spark]
ericm-db closed pull request #45108: [wip] value state ttl poc URL: https://github.com/apache/spark/pull/45108 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP] Add Variant type info to PySpark [spark]
desmondcheongzx opened a new pull request, #45131: URL: https://github.com/apache/spark/pull/45131 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]
viirya commented on code in PR #45128: URL: https://github.com/apache/spark/pull/45128#discussion_r1491824408 ## .github/workflows/build_apple_silicon.yml: ## @@ -0,0 +1,47 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +name: "Build / Apple Silicon (master, JDK 21)" + +on: + schedule: +- cron: '0 20 * * *' + +jobs: + run-build: +permissions: + packages: write +name: Run +uses: ./.github/workflows/build_and_test.yml +runs-on: macos-14 +if: github.repository == 'apache/spark' +with: + java: 21 + branch: master + hadoop: hadoop3 + envs: >- +{ + "SKIP_MIMA": "true", + "SKIP_UNIDOC": "true", + "DEDICATED_JVM_SBT_TESTS": "org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV1Suite,org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV2Suite,org.apache.spark.sql.execution.datasources.orc.OrcSourceV1Suite,org.apache.spark.sql.execution.datasources.orc.OrcSourceV2Suite" +} + jobs: >- +{ + "build": "true" Review Comment: We don't do other jobs like `tpcds-1g`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]
dongjoon-hyun commented on PR #45128: URL: https://github.com/apache/spark/pull/45128#issuecomment-1947552568 Could you review this Daily GitHub Action job, `build_apple_silicon.yml`, @viirya ? It will work like `build_java21.yml` and we need to see the result after merging. - https://github.com/apache/spark/actions/workflows/build_java21.yml -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45669][CORE] Ensure the continuity of rolling log index [spark]
github-actions[bot] closed pull request #43534: [SPARK-45669][CORE] Ensure the continuity of rolling log index URL: https://github.com/apache/spark/pull/43534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Encoder ttl poc [spark]
ericm-db opened a new pull request, #45130: URL: https://github.com/apache/spark/pull/45130 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47067][INFRA] Add Daily Apple Silicon Github Action Job (Java/Scala) [spark]
dongjoon-hyun opened a new pull request, #45128: URL: https://github.com/apache/spark/pull/45128 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]
dongjoon-hyun closed pull request #45126: [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI URL: https://github.com/apache/spark/pull/45126 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]
dongjoon-hyun commented on PR #45126: URL: https://github.com/apache/spark/pull/45126#issuecomment-1947433794 Thank you so much, @viirya . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]
dongjoon-hyun commented on PR #45126: URL: https://github.com/apache/spark/pull/45126#issuecomment-1947434491 Merged to master for Apache Spark 4.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]
dongjoon-hyun commented on PR #45126: URL: https://github.com/apache/spark/pull/45126#issuecomment-1947425316 Can I get your approval, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47066][INFRA] Add `Apple Silicon` Maven build test to GitHub Action CI [spark]
dongjoon-hyun commented on PR #45126: URL: https://github.com/apache/spark/pull/45126#issuecomment-1947425044 Yes, it does~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-46947][CORE] Delay memory manager initialization until Driver plugin is loaded [spark]
sunchao commented on PR #45052: URL: https://github.com/apache/spark/pull/45052#issuecomment-1947389764 Sure, thanks @mridulm in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46743] Count bug after constant folding [spark]
agubichev commented on PR #45125: URL: https://github.com/apache/spark/pull/45125#issuecomment-1947380207 @jchen5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SQL] Fix supported interval formats in error messages [spark]
MaxGekk opened a new pull request, #45127: URL: https://github.com/apache/spark/pull/45127 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the existing test suite: ``` $ build/sbt "test:testOnly *IntervalUtilsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]
chaoqin-li1123 commented on code in PR #45023: URL: https://github.com/apache/spark/pull/45023#discussion_r1491663766 ## python/pyspark/sql/datasource.py: ## @@ -298,6 +320,117 @@ def read(self, partition: InputPartition) -> Iterator[Union[Tuple, Row]]: ... +class DataSourceStreamReader(ABC): +""" +A base class for streaming data source readers. Data source stream readers are responsible +for outputting data from a streaming data source. + +.. versionadded: 4.0.0 +""" + +def initialOffset(self) -> dict: +""" +Return the initial offset of the streaming data source. +A new streaming query starts reading data from the initial offset. +If Spark is restarting an existing query, it will restart from the check-pointed offset +rather than the initial one. + +Returns +--- +dict +A dict whose key and values are str type. +""" +... +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "initialOffset"}, +) + +def latestOffset(self) -> dict: +""" +Returns the most recent offset available. + +Returns +--- +dict +A dict whose key and values are str type. +""" +... +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "latestOffset"}, +) + +def partitions(self, start: dict, end: dict) -> Sequence[InputPartition]: +""" +Returns a list of InputPartition given the start and end offsets. Each InputPartition +represents a data split that can be processed by one Spark task. + +Parameters +-- +start : dict +The start offset of the microbatch to plan partitioning. +end : dict +The end offset of the microbatch to plan partitioning. + +Returns +--- +Sequence[InputPartition] +A sequence of partitions for this data source. Each partition value +must be an instance of `InputPartition` or a subclass of it. +""" +... +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "partitions"}, +) + +@abstractmethod +def read(self, partition) -> Iterator[Union[Tuple, Row]]: +""" +Generates data for a given partition and returns an iterator of tuples or rows. + +This method is invoked once per partition to read the data. Implementing +this method is required for stream reader. You can initialize any +non-serializable resources required for reading data from the data source +within this method. +This method is static and stateless. You shouldn't access mutable class member +or keep in memory state between different invocations of read(). + +Parameters +-- +partition : object +The partition to read. It must be one of the partition values returned by +``partitions()``. + +Returns +--- +Iterator[Tuple] or Iterator[Row] +An iterator of tuples or rows. Each tuple or row will be converted to a row +in the final DataFrame. +""" +... + +def commit(self, end: dict): Review Comment: Return type added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47066][INFRA] Add Apple Silicon Maven build test to GitHub Action CI [spark]
viirya commented on PR #45126: URL: https://github.com/apache/spark/pull/45126#issuecomment-1947327388 Yea, the new pipeline looks like passed already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47053][INFRA] Bump python libraries (pandas, pyarrow) in Docker image for release script [spark]
xinrong-meng commented on PR #45110: URL: https://github.com/apache/spark/pull/45110#issuecomment-1947099775 Thank you @HeartSaVioR ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47054][PYTHON][TESTS] Remove pinned version of torch for Python 3.12 support [spark]
xinrong-meng commented on PR #45113: URL: https://github.com/apache/spark/pull/45113#issuecomment-1947085069 Oh I saw you upgrading mypy, great! Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47054][PYTHON][TESTS] Remove pinned version of torch for Python 3.12 support [spark]
xinrong-meng commented on PR #45113: URL: https://github.com/apache/spark/pull/45113#issuecomment-1947070089 MyPy seems to complain about an error code `method-assign` in torchdynamo: ``` /usr/local/lib/python3.9/dist-packages/torch/_dynamo/eval_frame.py:1: error: disable_error_code: Invalid error code(s): method-assign [misc] ``` Not sure how to fix it though.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47066][INFRA] Add Apple Silicon Maven build test to GitHub Action CI [spark]
dongjoon-hyun opened a new pull request, #45126: URL: https://github.com/apache/spark/pull/45126 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]
chaoqin-li1123 commented on code in PR #45023: URL: https://github.com/apache/spark/pull/45023#discussion_r1491464370 ## sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala: ## @@ -0,0 +1,229 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.python + +import org.apache.spark.SparkException +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.IntegratedUDFTestUtils.{createUserDefinedPythonDataSource, shouldTestPandasUDFs} +import org.apache.spark.sql.execution.datasources.v2.python.{PythonDataSourceV2, PythonMicroBatchStream, PythonStreamingSourceOffset} +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.util.CaseInsensitiveStringMap + +class PythonStreamingDataSourceSuite extends PythonDataSourceSuiteBase { + + protected def simpleDataStreamReaderScript: String = +""" + |from pyspark.sql.datasource import DataSourceStreamReader, InputPartition + | + |class SimpleDataStreamReader(DataSourceStreamReader): + |def initialOffset(self): + |return {"offset": "0"} + |def latestOffset(self): + |return {"offset": "2"} + |def partitions(self, start: dict, end: dict): + |return [InputPartition(i) for i in range(int(start["offset"]))] + |def commit(self, end: dict): + |1 + 2 + |def read(self, partition): + |yield (0, partition.value) + |yield (1, partition.value) + |yield (2, partition.value) + |""".stripMargin + + protected def errorDataStreamReaderScript: String = +""" + |from pyspark.sql.datasource import DataSourceStreamReader, InputPartition + | + |class ErrorDataStreamReader(DataSourceStreamReader): + |def initialOffset(self): + |raise Exception("error reading initial offset") + |def latestOffset(self): + |raise Exception("error reading latest offset") + |def partitions(self, start: dict, end: dict): + |raise Exception("error planning partitions") + |def commit(self, end: dict): + |raise Exception("error committing offset") + |def read(self, partition): + |yield (0, partition.value) + |yield (1, partition.value) + |yield (2, partition.value) + |""".stripMargin + + private val errorDataSourceName = "ErrorDataSource" + + test("simple data stream source") { +assume(shouldTestPandasUDFs) +val dataSourceScript = + s""" + |from pyspark.sql.datasource import DataSource + |$simpleDataStreamReaderScript + | + |class $dataSourceName(DataSource): + |def streamReader(self, schema): + |return SimpleDataStreamReader() + |""".stripMargin +val inputSchema = StructType.fromDDL("input BINARY") + +val dataSource = createUserDefinedPythonDataSource(dataSourceName, dataSourceScript) +spark.dataSource.registerPython(dataSourceName, dataSource) +val pythonDs = new PythonDataSourceV2 +pythonDs.setShortName("SimpleDataSource") +val stream = new PythonMicroBatchStream( + pythonDs, dataSourceName, inputSchema, CaseInsensitiveStringMap.empty()) + +val initialOffset = stream.initialOffset() +assert(initialOffset.json == "{\"offset\": \"0\"}") +for (_ <- 1 to 50) { + val offset = stream.latestOffset() + assert(offset.json == "{\"offset\": \"2\"}") + assert(stream.planInputPartitions(offset, offset).size == 2) + stream.commit(offset) +} +stream.stop() Review Comment: read will be implemented in the next PR, it is executed in executor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail:
Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]
chaoqin-li1123 commented on code in PR #45023: URL: https://github.com/apache/spark/pull/45023#discussion_r1491464124 ## python/pyspark/sql/streaming/python_streaming_source_runner.py: ## @@ -0,0 +1,159 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import sys +import json +from typing import IO + +from pyspark.accumulators import _accumulatorRegistry +from pyspark.errors import PySparkAssertionError, PySparkRuntimeError +from pyspark.java_gateway import local_connect_and_auth +from pyspark.serializers import ( +read_int, +write_int, +write_with_length, +SpecialLengths, +) +from pyspark.sql.datasource import DataSource +from pyspark.sql.types import ( +_parse_datatype_json_string, +StructType, +) +from pyspark.util import handle_worker_exception +from pyspark.worker_util import ( +check_python_version, +read_command, +pickleSer, +send_accumulator_updates, +setup_memory_limits, +setup_spark_files, +utf8_deserializer, +) + +initial_offset_func_id = 884 +latest_offset_func_id = 885 +partitions_func_id = 886 +commit_func_id = 887 + + +def initial_offset_func(reader, outfile): +offset = reader.initialOffset() Review Comment: We throw unimplemented error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46962][SS][PYTHON] Add interface for python streaming data source API and implement python worker to run python streaming data source [spark]
chaoqin-li1123 commented on code in PR #45023: URL: https://github.com/apache/spark/pull/45023#discussion_r1491463836 ## python/pyspark/sql/datasource.py: ## @@ -298,6 +320,117 @@ def read(self, partition: InputPartition) -> Iterator[Union[Tuple, Row]]: ... +class DataSourceStreamReader(ABC): +""" +A base class for streaming data source readers. Data source stream readers are responsible +for outputting data from a streaming data source. + +.. versionadded: 4.0.0 +""" + +def initialOffset(self) -> dict: +""" +Return the initial offset of the streaming data source. +A new streaming query starts reading data from the initial offset. +If Spark is restarting an existing query, it will restart from the check-pointed offset +rather than the initial one. + +Returns +--- +dict +A dict whose key and values are str type. +""" +... +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "initialOffset"}, +) + +def latestOffset(self) -> dict: +""" +Returns the most recent offset available. + +Returns +--- +dict +A dict whose key and values are str type. +""" +... +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "latestOffset"}, +) + +def partitions(self, start: dict, end: dict) -> Sequence[InputPartition]: +""" +Returns a list of InputPartition given the start and end offsets. Each InputPartition +represents a data split that can be processed by one Spark task. + +Parameters +-- +start : dict +The start offset of the microbatch to plan partitioning. +end : dict +The end offset of the microbatch to plan partitioning. + +Returns +--- +Sequence[InputPartition] +A sequence of partitions for this data source. Each partition value +must be an instance of `InputPartition` or a subclass of it. +""" +... +raise PySparkNotImplementedError( +error_class="NOT_IMPLEMENTED", +message_parameters={"feature": "partitions"}, +) + +@abstractmethod +def read(self, partition) -> Iterator[Union[Tuple, Row]]: +""" +Generates data for a given partition and returns an iterator of tuples or rows. + +This method is invoked once per partition to read the data. Implementing +this method is required for stream reader. You can initialize any +non-serializable resources required for reading data from the data source +within this method. +This method is static and stateless. You shouldn't access mutable class member +or keep in memory state between different invocations of read(). + +Parameters +-- +partition : object +The partition to read. It must be one of the partition values returned by +``partitions()``. + +Returns +--- +Iterator[Tuple] or Iterator[Row] +An iterator of tuples or rows. Each tuple or row will be converted to a row +in the final DataFrame. +""" +... + +def commit(self, end: dict): Review Comment: This function is not supposed to return anything, how do we express that in python? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]
dongjoon-hyun closed pull request #45120: [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` URL: https://github.com/apache/spark/pull/45120 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]
dongjoon-hyun commented on PR #45120: URL: https://github.com/apache/spark/pull/45120#issuecomment-1946929639 Thank you so much, @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]
huaxingao commented on PR #45120: URL: https://github.com/apache/spark/pull/45120#issuecomment-1946858479 LGTM. Thanks for the PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Count bug in temp views [spark]
agubichev opened a new pull request, #45125: URL: https://github.com/apache/spark/pull/45125 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46400][CORE][SQL] When there are corrupted files in the local maven repo, skip this cache and try again [spark]
dongjoon-hyun commented on PR #44343: URL: https://github.com/apache/spark/pull/44343#issuecomment-1946772230 Thank you, @panbingkun and ll! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46400][CORE][SQL][3.5] When there are corrupted files in the local maven repo, skip this cache and try again [spark]
dongjoon-hyun commented on PR #45017: URL: https://github.com/apache/spark/pull/45017#issuecomment-1946768228 Please update the JIRA for the backporting PRs, @LuciferYang . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46400][CORE][SQL][3.4] When there are corrupted files in the local maven repo, skip this cache and try again [spark]
dongjoon-hyun closed pull request #45018: [SPARK-46400][CORE][SQL][3.4] When there are corrupted files in the local maven repo, skip this cache and try again URL: https://github.com/apache/spark/pull/45018 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]
dongjoon-hyun closed pull request #45124: [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` URL: https://github.com/apache/spark/pull/45124 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]
dongjoon-hyun commented on PR #45124: URL: https://github.com/apache/spark/pull/45124#issuecomment-1946730394 Let me merge this because this affects only `HiveExternalCatalogVersionsSuite` and I verified it manually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]
dongjoon-hyun commented on PR #45124: URL: https://github.com/apache/spark/pull/45124#issuecomment-1946726359 Thank you so much, @viirya . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]
dongjoon-hyun commented on code in PR #45124: URL: https://github.com/apache/spark/pull/45124#discussion_r1491387564 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala: ## @@ -95,7 +95,7 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { mirrors.distinct :+ "https://archive.apache.org/dist; :+ PROCESS_TABLES.releaseMirror logInfo(s"Trying to download Spark $version from $sites") for (site <- sites) { - val filename = s"spark-$version-bin-hadoop3.tgz" + val filename = s"spark-$version-bin-hadoop3-scala2.13.tgz" Review Comment: We can change it back to `*-hadoop3.tgz` at Apache Spark 4.1 after the official Apache Spark 4.0 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]
dongjoon-hyun commented on PR #45124: URL: https://github.com/apache/spark/pull/45124#issuecomment-1946711084 Could you review this test PR, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47064][SQL][TESTS] Use Scala 2.13 Spark distribution in `HiveExternalCatalogVersionsSuite` [spark]
dongjoon-hyun opened a new pull request, #45124: URL: https://github.com/apache/spark/pull/45124 ### What changes were proposed in this pull request? This PR aims to use `Scala 2.13` Spark binary in `HiveExternalCatalogVersionsSuite`. ### Why are the changes needed? SPARK-45314 makes Scala 2.13 is the default Scala version. As one of migration paths, the users choose Apache Spark 3.5.0 (Scala 2.13) and Apache Spark 3.4.2 (Scala 2.13). We had better focus on Scala 2.13 testing. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]
dongjoon-hyun commented on PR #45120: URL: https://github.com/apache/spark/pull/45120#issuecomment-1946669694 All tests passed. ![Screenshot 2024-02-15 at 09 23 38](https://github.com/apache/spark/assets/9700541/5ad55819-8c29-49f4-a1c8-6464f8a2a84c) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]
dongjoon-hyun commented on PR #45120: URL: https://github.com/apache/spark/pull/45120#issuecomment-1946623436 Could you review this PR, please, @huaxingao ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]
dongjoon-hyun commented on PR #45120: URL: https://github.com/apache/spark/pull/45120#issuecomment-1946622765 All tests passed except one irrelevant pyspark test pipeline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47055][PYTHON] Upgrade MyPy 1.8.0 [spark]
dongjoon-hyun commented on PR #45115: URL: https://github.com/apache/spark/pull/45115#issuecomment-1946533127 Python linter seems to complain still. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47040][CONNECT][FOLLOWUP] Improve usability of the start-conect-server-script.sh [spark]
dongjoon-hyun commented on PR #45117: URL: https://github.com/apache/spark/pull/45117#issuecomment-1946527338 Thank you for update, @grundprinzip . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47050][SQL] Collect and publish partition level metrics [spark]
snmvaughan opened a new pull request, #45123: URL: https://github.com/apache/spark/pull/45123 We currently capture metrics which include the number of files, bytes and rows for a task along with the updated partitions. This change captures metrics for each updated partition, reporting the partition sub-paths along with the number of files, bytes, and rows per partition for each task. ### What changes were proposed in this pull request? - Update the `WriteTaskStatsTracker` implementation to associate a partition with new files during writing, and to track the number of rows written to each file. The final stats now include a map of partitions and the associated stats (number of committed files, bytes, and rows) - Update the `WriteJobStatsTracker` implementation to capture the partition subpaths and to publish a new Event to the listener bus. The processed stats aggregate the statistics for each partition which are reported by the executors - Add a new `SparkListenerEvent` used to publish the task's collected partition metrics ### Why are the changes needed? This increases our understanding of written data by tracking the impact for each task on our datasets ### Does this PR introduce _any_ user-facing change? This adds an additional event which provides partition-level data to listeners. ### How was this patch tested? In addition to the new unit tests, this was run in a Kubernetes environment writing tables with differing partitioning strategies and validating the reported stats. In all cases where partitioning was enabled we also verified that the aggregated partition metrics matched the existing metrics for number of files, bytes, and rows. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47059][SQL] Attach error context for ALTER COLUMN v1 command [spark]
MaxGekk closed pull request #45121: [SPARK-47059][SQL] Attach error context for ALTER COLUMN v1 command URL: https://github.com/apache/spark/pull/45121 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47059][SQL] Attach error context for ALTER COLUMN v1 command [spark]
MaxGekk commented on PR #45121: URL: https://github.com/apache/spark/pull/45121#issuecomment-1946340230 +1, LGTM. Merging to master. Thank you, @cloud-fan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
nchammas commented on code in PR #45036: URL: https://github.com/apache/spark/pull/45036#discussion_r1491176098 ## core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala: ## @@ -249,4 +249,32 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers { map(null) = null assert(map.get(null) === Some(null)) } + + test("SPARK-45599: 0.0 and -0.0 should count distinctly") { +// Exactly these elements provided in roughly this order trigger a condition where lookups of +// 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly +// and inconsistently if `==` is used to check for key equality. +val spark45599Repro = Seq( + Double.NaN, + 2.0, + 168.0, + Double.NaN, + Double.NaN, + -0.0, + 153.0, + 0.0 +) + +val map1 = new OpenHashMap[Double, Int]() +spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1})) +assert(map1(0.0) == 1) +assert(map1(-0.0) == 1) + +val map2 = new OpenHashMap[Double, Int]() +// Simply changing the order in which the elements are added to the map should not change the +// counts for 0.0 and -0.0. +spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1})) +assert(map2(0.0) == 1) +assert(map2(-0.0) == 1) Review Comment: Added just below. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47062][CONNECT] Move Connect Plugins to Java for Compatibility [spark]
grundprinzip commented on PR #45114: URL: https://github.com/apache/spark/pull/45114#issuecomment-1946297630 Will update the PR description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
nchammas commented on code in PR #45036: URL: https://github.com/apache/spark/pull/45036#discussion_r1491158277 ## core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala: ## @@ -110,6 +110,18 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag]( this } + /** + * Check if a key exists at the provided position using object equality rather than + * cooperative equality. Otherwise, hash sets will mishandle values for which `==` + * and `equals` return different results, like 0.0/-0.0 and NaN/NaN. Review Comment: Yes, the differences are subtle: ```scala scala> 0.0 == -0.0 val res0: Boolean = true scala> 0.0 equals -0.0 val res1: Boolean = false scala> Double.NaN == Double.NaN val res2: Boolean = false scala> Double.NaN equals Double.NaN val res3: Boolean = true ``` There is a long discussion on the Scala forums from 2017 about this difference and some of the problems it causes: [Can we get rid of cooperative equality?](https://contributors.scala-lang.org/t/can-we-get-rid-of-cooperative-equality/1131) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
nchammas commented on code in PR #45036: URL: https://github.com/apache/spark/pull/45036#discussion_r1491125038 ## core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala: ## @@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers { assert(pos1 == pos2) } } + + test("SPARK-45599: 0.0 and -0.0 are equal but not the same") { +// Therefore, 0.0 and -0.0 should get separate entries in the hash set. +// +// Exactly these elements provided in roughly this order will trigger the following scenario: +// When probing the bitset in `getPos(-0.0)`, the loop will happen upon the entry for 0.0. +// In the old logic pre-SPARK-45599, the loop will find that the bit is set and, because +// -0.0 == 0.0, it will think that's the position of -0.0. But in reality this is the position +// of 0.0. So -0.0 and 0.0 will be stored at different positions, but `getPos()` will return +// the same position for them. This can cause users of OpenHashSet, like OpenHashMap, to +// return the wrong value for a key based on whether or not this bitset lookup collision +// happens. +val spark45599Repro = Seq( + Double.NaN, + 2.0, + 168.0, + Double.NaN, + Double.NaN, + -0.0, + 153.0, + 0.0 +) +val set = new OpenHashSet[Double]() +spark45599Repro.foreach(set.add) +assert(set.size == 6) +val zeroPos = set.getPos(0.0) +val negZeroPos = set.getPos(-0.0) +assert(zeroPos != negZeroPos) + } + + test("SPARK-45599: NaN and NaN are the same but not equal") { +// Any mathematical comparison to NaN will return false, but when we place it in +// a hash set we want the lookup to work like a "normal" value. +val set = new OpenHashSet[Double]() +set.add(Double.NaN) +set.add(Double.NaN) +assert(set.contains(Double.NaN)) +assert(set.size == 1) Review Comment: Yes sir. On `master`, this is the actual behavior of `OpenHashSet`: ```scala // ...OpenHashSet$mcD$sp@21b327e6 did not contain NaN assert(set.contains(Double.NaN)) // ...OpenHashSet$mcD$sp@1f09db1e had size 2 instead of expected size 1 assert(set.size == 1) ``` Every NaN will get its own entry in `OpenHashSet` on `master`. So if we add 1,000,000 NaNs to the set, NaN will have 1,000,000 entries in there. And `.contains()` will _still_ return `false`. :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45789][SQL] Support DESCRIBE TABLE for clustering columns [spark]
cloud-fan commented on code in PR #45077: URL: https://github.com/apache/spark/pull/45077#discussion_r1490980421 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/DescribeTableSuiteBase.scala: ## @@ -178,4 +178,43 @@ trait DescribeTableSuiteBase extends QueryTest with DDLCommandTestUtils { assert(errMsg === "DESC TABLE COLUMN does not support nested column: col.x.") } } + + test("describe a clustered table") { +withNamespaceAndTable("ns", "tbl") { tbl => + sql(s"CREATE TABLE $tbl (col1 STRING, col2 INT) $defaultUsing CLUSTER BY (col1, col2)") + val descriptionDf = sql(s"DESC $tbl") + assert(descriptionDf.schema.map(field => (field.name, field.dataType)) === Seq( +("col_name", StringType), +("data_type", StringType), +("comment", StringType))) + QueryTest.checkAnswer( +descriptionDf, +Seq( + Row("col1", "string", null), + Row("col2", "int", null), + Row("# Clustering Information", "", ""), + Row("# col_name", "data_type", "comment"), + Row("col1", "string", null), + Row("col2", "int", null))) +} + } + + test("describe a clustered table with comments on clustering columns") { +withNamespaceAndTable("ns", "tbl") { tbl => + sql(s"CREATE TABLE $tbl (col1 STRING) $defaultUsing CLUSTER BY (col1)") + sql(s"ALTER TABLE $tbl ALTER COLUMN col1 COMMENT 'this is comment';") Review Comment: since we only want to test DESCRIBE TABLE, not ATLER COLUMN, shall we specify the column comment in CREATE TABLE? then we don't need to exclude this test case in the hive test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
cloud-fan commented on code in PR #45036: URL: https://github.com/apache/spark/pull/45036#discussion_r1490973900 ## core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala: ## @@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers { assert(pos1 == pos2) } } + + test("SPARK-45599: 0.0 and -0.0 are equal but not the same") { +// Therefore, 0.0 and -0.0 should get separate entries in the hash set. +// +// Exactly these elements provided in roughly this order will trigger the following scenario: +// When probing the bitset in `getPos(-0.0)`, the loop will happen upon the entry for 0.0. +// In the old logic pre-SPARK-45599, the loop will find that the bit is set and, because +// -0.0 == 0.0, it will think that's the position of -0.0. But in reality this is the position +// of 0.0. So -0.0 and 0.0 will be stored at different positions, but `getPos()` will return +// the same position for them. This can cause users of OpenHashSet, like OpenHashMap, to +// return the wrong value for a key based on whether or not this bitset lookup collision +// happens. +val spark45599Repro = Seq( + Double.NaN, + 2.0, + 168.0, + Double.NaN, + Double.NaN, + -0.0, + 153.0, + 0.0 +) +val set = new OpenHashSet[Double]() +spark45599Repro.foreach(set.add) +assert(set.size == 6) +val zeroPos = set.getPos(0.0) +val negZeroPos = set.getPos(-0.0) +assert(zeroPos != negZeroPos) + } + + test("SPARK-45599: NaN and NaN are the same but not equal") { +// Any mathematical comparison to NaN will return false, but when we place it in +// a hash set we want the lookup to work like a "normal" value. +val set = new OpenHashSet[Double]() +set.add(Double.NaN) +set.add(Double.NaN) +assert(set.contains(Double.NaN)) +assert(set.size == 1) Review Comment: do we actually waste space in `OpenHashSet` to store all the NaN values? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
cloud-fan commented on code in PR #45036: URL: https://github.com/apache/spark/pull/45036#discussion_r1490972974 ## core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala: ## @@ -249,4 +249,32 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers { map(null) = null assert(map.get(null) === Some(null)) } + + test("SPARK-45599: 0.0 and -0.0 should count distinctly") { +// Exactly these elements provided in roughly this order trigger a condition where lookups of +// 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly +// and inconsistently if `==` is used to check for key equality. +val spark45599Repro = Seq( + Double.NaN, + 2.0, + 168.0, + Double.NaN, + Double.NaN, + -0.0, + 153.0, + 0.0 +) + +val map1 = new OpenHashMap[Double, Int]() +spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1})) +assert(map1(0.0) == 1) +assert(map1(-0.0) == 1) + +val map2 = new OpenHashMap[Double, Int]() +// Simply changing the order in which the elements are added to the map should not change the +// counts for 0.0 and -0.0. +spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1})) +assert(map2(0.0) == 1) +assert(map2(-0.0) == 1) Review Comment: shall we test NaN as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]
cloud-fan commented on code in PR #45036: URL: https://github.com/apache/spark/pull/45036#discussion_r1490967547 ## core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala: ## @@ -110,6 +110,18 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag]( this } + /** + * Check if a key exists at the provided position using object equality rather than + * cooperative equality. Otherwise, hash sets will mishandle values for which `==` + * and `equals` return different results, like 0.0/-0.0 and NaN/NaN. Review Comment: I was told that in scala `==` is the same as `equals`, but `eq` is a different operator. I need to refresh my knowledge now :) ## core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala: ## @@ -110,6 +110,18 @@ class OpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag]( this } + /** + * Check if a key exists at the provided position using object equality rather than + * cooperative equality. Otherwise, hash sets will mishandle values for which `==` + * and `equals` return different results, like 0.0/-0.0 and NaN/NaN. + * + * See: https://issues.apache.org/jira/browse/SPARK-45599 + */ + @annotation.nowarn("cat=other-non-cooperative-equals") + private def keyExistsAtPos(k: T, pos: Int) = +// _data(pos) == k Review Comment: we should remove it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR][SQL] Show only number of test blocks when there is a mismatch [spark]
cloud-fan commented on PR #45083: URL: https://github.com/apache/spark/pull/45083#issuecomment-1946030494 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47026][SQL][TESTS] Enable JSON sources in default value nested type tests [spark]
cloud-fan commented on PR #45086: URL: https://github.com/apache/spark/pull/45086#issuecomment-1946029398 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Update pom.xml [spark]
ArunDhamotharan opened a new pull request, #45122: URL: https://github.com/apache/spark/pull/45122 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47059][SQL] Attach error context for ALTER COLUMN v1 command [spark]
cloud-fan commented on PR #45121: URL: https://github.com/apache/spark/pull/45121#issuecomment-1945785935 cc @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47059][SQL] Attach error context for ALTER COLUMN v1 command [spark]
cloud-fan opened a new pull request, #45121: URL: https://github.com/apache/spark/pull/45121 ### What changes were proposed in this pull request? This is a small fix to improve the error message for ALTER COLUMN. We attach the error context for v1 command as well, making it consistent with v2 command. ### Why are the changes needed? better error message ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-36964][CORE][YARN] Share cached dnsToSwitchMapping for yarn locality container requests [spark]
vbmacher commented on PR #34231: URL: https://github.com/apache/spark/pull/34231#issuecomment-1945751728 Hey! Im receiving this time-to-time ``` 24/02/15 09:53:52 ERROR AsyncEventQueue: Dropping event from queue executorManagement. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. ``` So can this be "revived"? cc @gaoyajun02 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47058][TESTS] Add `scalastyle` and `checkstyle` rules to ban `AtomicDoubleArray|CompoundOrdering` [spark]
dongjoon-hyun opened a new pull request, #45120: URL: https://github.com/apache/spark/pull/45120 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47056][TESTS] Add `scalastyle` and `checkstyle` rules to ban `FileBackedOutputStream` [spark]
dongjoon-hyun closed pull request #45116: [SPARK-47056][TESTS] Add `scalastyle` and `checkstyle` rules to ban `FileBackedOutputStream` URL: https://github.com/apache/spark/pull/45116 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47056][TESTS] Add `scalastyle` and `checkstyle` rules to ban `FileBackedOutputStream` [spark]
dongjoon-hyun commented on PR #45116: URL: https://github.com/apache/spark/pull/45116#issuecomment-1945684945 Scala and Java linters passed. Merged to master. ![Screenshot 2024-02-15 at 01 26 10](https://github.com/apache/spark/assets/9700541/a55f8fbd-c4ab-4c77-afc5-d2e6f363af5f) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47056][TESTS] Add `scalastyle` and `checkstyle` rules to ban `FileBackedOutputStream` [spark]
dongjoon-hyun commented on PR #45116: URL: https://github.com/apache/spark/pull/45116#issuecomment-1945680565 Thank you, @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org