[GitHub] [spark] ucas010 commented on pull request #18748: [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel
ucas010 commented on PR #18748: URL: https://github.com/apache/spark/pull/18748#issuecomment-1247586071 java: 找不到符号 符号: 方法 recommendForUserSubset(org.apache.spark.sql.Dataset,int) 位置: 类型为org.apache.spark.ml.recommendation.ALSModel的变量 model -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode
AmplabJenkins commented on PR #37880: URL: https://github.com/apache/spark/pull/37880#issuecomment-1247568597 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #37894: [DRAFT][DO-NOT-MERGE][SPARK-40435][SS][PYTHON] Add test suites for applyInPandasWithState in PySpark
HeartSaVioR opened a new pull request, #37894: URL: https://github.com/apache/spark/pull/37894 ...TBD... ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #37893: [DRAFT][DO-NOT-MERGE][SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark
HeartSaVioR opened a new pull request, #37893: URL: https://github.com/apache/spark/pull/37893 ...TBD... ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes. We are exposing new public API in PySpark which performs arbitrary stateful processing. ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37885: [SPARK-40428][CORE][WIP] Add a shutdown hook in the CoarseGrainedSchedulerBackend
mridulm commented on PR #37885: URL: https://github.com/apache/spark/pull/37885#issuecomment-1247553777 `SparkContext.stop` will take care of this - and that is invoked from a shutdown hook already. SparkContext.stop -> _dagScheduler.stop -> taskScheduler.stop -> backend.stop -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby
zhengruifeng commented on PR #37836: URL: https://github.com/apache/spark/pull/37836#issuecomment-1247546159 > Actually ExpandingGroupby and RollingGroupby aren't documented in pandas side too. that's true, when I check the missing function in https://issues.apache.org/jira/browse/SPARK-40327, it's quite confusing that there is no document and I need to manually check whether a function exists. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37890: [SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and Expanding.quantile into PySpark documentation
zhengruifeng commented on PR #37890: URL: https://github.com/apache/spark/pull/37890#issuecomment-1247544737 LGTM2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37890: [SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and Expanding.quantile into PySpark documentation
HyukjinKwon closed pull request #37890: [SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and Expanding.quantile into PySpark documentation URL: https://github.com/apache/spark/pull/37890 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37890: [SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and Expanding.quantile into PySpark documentation
HyukjinKwon commented on PR #37890: URL: https://github.com/apache/spark/pull/37890#issuecomment-1247534406 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark
HeartSaVioR commented on code in PR #37889: URL: https://github.com/apache/spark/pull/37889#discussion_r971486761 ## python/pyspark/sql/streaming/state.py: ## @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +import datetime +import json +from typing import Tuple, Optional + +from pyspark.sql.types import DateType, Row, StructType + +__all__ = ["GroupStateImpl", "GroupStateTimeout"] + + +class GroupStateTimeout: +NoTimeout: str = "NoTimeout" +ProcessingTimeTimeout: str = "ProcessingTimeTimeout" +EventTimeTimeout: str = "EventTimeTimeout" + + +class GroupStateImpl: +NO_TIMESTAMP: int = -1 + +def __init__( +self, +# JVM Constructor +optionalValue: Row, +batchProcessingTimeMs: int, +eventTimeWatermarkMs: int, +timeoutConf: str, +hasTimedOut: bool, +watermarkPresent: bool, +# JVM internal state. +defined: bool, +updated: bool, +removed: bool, +timeoutTimestamp: int, +# Python internal state. +keyAsUnsafe: bytes, +valueSchema: StructType, +) -> None: +self._keyAsUnsafe = keyAsUnsafe +self._value = optionalValue +self._batch_processing_time_ms = batchProcessingTimeMs +self._event_time_watermark_ms = eventTimeWatermarkMs + +assert timeoutConf in [ +GroupStateTimeout.NoTimeout, +GroupStateTimeout.ProcessingTimeTimeout, +GroupStateTimeout.EventTimeTimeout, +] +self._timeout_conf = timeoutConf + +self._has_timed_out = hasTimedOut +self._watermark_present = watermarkPresent + +self._defined = defined +self._updated = updated +self._removed = removed +self._timeout_timestamp = timeoutTimestamp +# Python internal state. +self._old_timeout_timestamp = timeoutTimestamp + +self._value_schema = valueSchema + +@property +def exists(self) -> bool: +return self._defined + +@property +def get(self) -> Tuple: +if self.exists: +return tuple(self._value) +else: +raise ValueError("State is either not defined or has already been removed") + +@property +def getOption(self) -> Optional[Tuple]: +if self.exists: +return tuple(self._value) +else: +return None + +@property +def hasTimedOut(self) -> bool: +return self._has_timed_out + +# NOTE: this function is only available to PySpark implementation due to underlying +# implementation, do not port to Scala implementation! +@property +def oldTimeoutTimestamp(self) -> int: +return self._old_timeout_timestamp + +def update(self, newValue: Tuple) -> None: +if newValue is None: +raise ValueError("'None' is not a valid state value") + +self._value = Row(*newValue) +self._defined = True +self._updated = True +self._removed = False + +def remove(self) -> None: +self._defined = False +self._updated = False +self._removed = True + +def setTimeoutDuration(self, durationMs: int) -> None: +if isinstance(durationMs, str): +# TODO(SPARK-X): Support string representation of durationMs. +raise ValueError("durationMs should be int but get :%s" % type(durationMs)) + +if self._timeout_conf != GroupStateTimeout.ProcessingTimeTimeout: +raise RuntimeError( +"Cannot set timeout duration without enabling processing time timeout in " +"applyInPandasWithState" +) + +if durationMs <= 0: +raise ValueError("Timeout duration must be positive") +self._timeout_timestamp = durationMs + self._batch_processing_time_ms + +# TODO(SPARK-X): Implement additionalDuration parameter. Review Comment: Just reflected it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark
HyukjinKwon commented on code in PR #37889: URL: https://github.com/apache/spark/pull/37889#discussion_r971486330 ## python/pyspark/sql/streaming/state.py: ## @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +import datetime +import json +from typing import Tuple, Optional + +from pyspark.sql.types import DateType, Row, StructType + +__all__ = ["GroupStateImpl", "GroupStateTimeout"] + + +class GroupStateTimeout: +NoTimeout: str = "NoTimeout" +ProcessingTimeTimeout: str = "ProcessingTimeTimeout" +EventTimeTimeout: str = "EventTimeTimeout" + + +class GroupStateImpl: +NO_TIMESTAMP: int = -1 + +def __init__( +self, +# JVM Constructor +optionalValue: Row, +batchProcessingTimeMs: int, +eventTimeWatermarkMs: int, +timeoutConf: str, +hasTimedOut: bool, +watermarkPresent: bool, +# JVM internal state. +defined: bool, +updated: bool, +removed: bool, +timeoutTimestamp: int, +# Python internal state. +keyAsUnsafe: bytes, +valueSchema: StructType, +) -> None: +self._keyAsUnsafe = keyAsUnsafe +self._value = optionalValue +self._batch_processing_time_ms = batchProcessingTimeMs +self._event_time_watermark_ms = eventTimeWatermarkMs + +assert timeoutConf in [ +GroupStateTimeout.NoTimeout, +GroupStateTimeout.ProcessingTimeTimeout, +GroupStateTimeout.EventTimeTimeout, +] +self._timeout_conf = timeoutConf + +self._has_timed_out = hasTimedOut +self._watermark_present = watermarkPresent + +self._defined = defined +self._updated = updated +self._removed = removed +self._timeout_timestamp = timeoutTimestamp +# Python internal state. +self._old_timeout_timestamp = timeoutTimestamp + +self._value_schema = valueSchema + +@property +def exists(self) -> bool: +return self._defined + +@property +def get(self) -> Tuple: +if self.exists: +return tuple(self._value) +else: +raise ValueError("State is either not defined or has already been removed") + +@property +def getOption(self) -> Optional[Tuple]: +if self.exists: +return tuple(self._value) +else: +return None + +@property +def hasTimedOut(self) -> bool: +return self._has_timed_out + +# NOTE: this function is only available to PySpark implementation due to underlying +# implementation, do not port to Scala implementation! +@property +def oldTimeoutTimestamp(self) -> int: +return self._old_timeout_timestamp + +def update(self, newValue: Tuple) -> None: +if newValue is None: +raise ValueError("'None' is not a valid state value") + +self._value = Row(*newValue) +self._defined = True +self._updated = True +self._removed = False + +def remove(self) -> None: +self._defined = False +self._updated = False +self._removed = True + +def setTimeoutDuration(self, durationMs: int) -> None: +if isinstance(durationMs, str): +# TODO(SPARK-X): Support string representation of durationMs. +raise ValueError("durationMs should be int but get :%s" % type(durationMs)) + +if self._timeout_conf != GroupStateTimeout.ProcessingTimeTimeout: +raise RuntimeError( +"Cannot set timeout duration without enabling processing time timeout in " +"applyInPandasWithState" +) + +if durationMs <= 0: +raise ValueError("Timeout duration must be positive") +self._timeout_timestamp = durationMs + self._batch_processing_time_ms + +# TODO(SPARK-X): Implement additionalDuration parameter. Review Comment: Filed SPARK-40438 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark
HeartSaVioR commented on code in PR #37889: URL: https://github.com/apache/spark/pull/37889#discussion_r971485439 ## python/pyspark/sql/streaming/state.py: ## @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +import datetime +import json +from typing import Tuple, Optional + +from pyspark.sql.types import DateType, Row, StructType + +__all__ = ["GroupStateImpl", "GroupStateTimeout"] + + +class GroupStateTimeout: +NoTimeout: str = "NoTimeout" +ProcessingTimeTimeout: str = "ProcessingTimeTimeout" +EventTimeTimeout: str = "EventTimeTimeout" + + +class GroupStateImpl: +NO_TIMESTAMP: int = -1 + +def __init__( +self, +# JVM Constructor +optionalValue: Row, +batchProcessingTimeMs: int, +eventTimeWatermarkMs: int, +timeoutConf: str, +hasTimedOut: bool, +watermarkPresent: bool, +# JVM internal state. +defined: bool, +updated: bool, +removed: bool, +timeoutTimestamp: int, +# Python internal state. +keyAsUnsafe: bytes, +valueSchema: StructType, +) -> None: +self._keyAsUnsafe = keyAsUnsafe +self._value = optionalValue +self._batch_processing_time_ms = batchProcessingTimeMs +self._event_time_watermark_ms = eventTimeWatermarkMs + +assert timeoutConf in [ +GroupStateTimeout.NoTimeout, +GroupStateTimeout.ProcessingTimeTimeout, +GroupStateTimeout.EventTimeTimeout, +] +self._timeout_conf = timeoutConf + +self._has_timed_out = hasTimedOut +self._watermark_present = watermarkPresent + +self._defined = defined +self._updated = updated +self._removed = removed +self._timeout_timestamp = timeoutTimestamp +# Python internal state. +self._old_timeout_timestamp = timeoutTimestamp + +self._value_schema = valueSchema + +@property +def exists(self) -> bool: +return self._defined + +@property +def get(self) -> Tuple: +if self.exists: +return tuple(self._value) +else: +raise ValueError("State is either not defined or has already been removed") + +@property +def getOption(self) -> Optional[Tuple]: +if self.exists: +return tuple(self._value) +else: +return None + +@property +def hasTimedOut(self) -> bool: +return self._has_timed_out + +# NOTE: this function is only available to PySpark implementation due to underlying +# implementation, do not port to Scala implementation! +@property +def oldTimeoutTimestamp(self) -> int: +return self._old_timeout_timestamp + +def update(self, newValue: Tuple) -> None: +if newValue is None: +raise ValueError("'None' is not a valid state value") + +self._value = Row(*newValue) +self._defined = True +self._updated = True +self._removed = False + +def remove(self) -> None: +self._defined = False +self._updated = False +self._removed = True + +def setTimeoutDuration(self, durationMs: int) -> None: +if isinstance(durationMs, str): +# TODO(SPARK-X): Support string representation of durationMs. Review Comment: Nice! I'll update them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark
HyukjinKwon commented on code in PR #37889: URL: https://github.com/apache/spark/pull/37889#discussion_r971485152 ## python/pyspark/sql/streaming/state.py: ## @@ -0,0 +1,192 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +import datetime +import json +from typing import Tuple, Optional + +from pyspark.sql.types import DateType, Row, StructType + +__all__ = ["GroupStateImpl", "GroupStateTimeout"] + + +class GroupStateTimeout: +NoTimeout: str = "NoTimeout" +ProcessingTimeTimeout: str = "ProcessingTimeTimeout" +EventTimeTimeout: str = "EventTimeTimeout" + + +class GroupStateImpl: +NO_TIMESTAMP: int = -1 + +def __init__( +self, +# JVM Constructor +optionalValue: Row, +batchProcessingTimeMs: int, +eventTimeWatermarkMs: int, +timeoutConf: str, +hasTimedOut: bool, +watermarkPresent: bool, +# JVM internal state. +defined: bool, +updated: bool, +removed: bool, +timeoutTimestamp: int, +# Python internal state. +keyAsUnsafe: bytes, +valueSchema: StructType, +) -> None: +self._keyAsUnsafe = keyAsUnsafe +self._value = optionalValue +self._batch_processing_time_ms = batchProcessingTimeMs +self._event_time_watermark_ms = eventTimeWatermarkMs + +assert timeoutConf in [ +GroupStateTimeout.NoTimeout, +GroupStateTimeout.ProcessingTimeTimeout, +GroupStateTimeout.EventTimeTimeout, +] +self._timeout_conf = timeoutConf + +self._has_timed_out = hasTimedOut +self._watermark_present = watermarkPresent + +self._defined = defined +self._updated = updated +self._removed = removed +self._timeout_timestamp = timeoutTimestamp +# Python internal state. +self._old_timeout_timestamp = timeoutTimestamp + +self._value_schema = valueSchema + +@property +def exists(self) -> bool: +return self._defined + +@property +def get(self) -> Tuple: +if self.exists: +return tuple(self._value) +else: +raise ValueError("State is either not defined or has already been removed") + +@property +def getOption(self) -> Optional[Tuple]: +if self.exists: +return tuple(self._value) +else: +return None + +@property +def hasTimedOut(self) -> bool: +return self._has_timed_out + +# NOTE: this function is only available to PySpark implementation due to underlying +# implementation, do not port to Scala implementation! +@property +def oldTimeoutTimestamp(self) -> int: +return self._old_timeout_timestamp + +def update(self, newValue: Tuple) -> None: +if newValue is None: +raise ValueError("'None' is not a valid state value") + +self._value = Row(*newValue) +self._defined = True +self._updated = True +self._removed = False + +def remove(self) -> None: +self._defined = False +self._updated = False +self._removed = True + +def setTimeoutDuration(self, durationMs: int) -> None: +if isinstance(durationMs, str): +# TODO(SPARK-X): Support string representation of durationMs. Review Comment: Filed SPARK-40437 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark
HeartSaVioR commented on PR #37889: URL: https://github.com/apache/spark/pull/37889#issuecomment-1247528542 @HyukjinKwon I just missed adding a py file - just added it. Could you please take a look again? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37892: [SPARK-40436][BUILD] Upgrade Scala to 2.12.17
LuciferYang commented on PR #37892: URL: https://github.com/apache/spark/pull/37892#issuecomment-1247526234 Test first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #37892: [SPARK-40436][BUILD] Upgrade Scala to 2.12.17
LuciferYang opened a new pull request, #37892: URL: https://github.com/apache/spark/pull/37892 ### What changes were proposed in this pull request? This PR aims to upgrade Scala to 2.12.17 ### Why are the changes needed? Will add ### Does this PR introduce _any_ user-facing change? Yes, this is a Scala version change. ### How was this patch tested? Existing Test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module
zhengruifeng closed pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module URL: https://github.com/apache/spark/pull/37888 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby
Yikun commented on PR #37836: URL: https://github.com/apache/spark/pull/37836#issuecomment-1247519104 @HyukjinKwon OK, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #37891: [SPARK-40433][SS][PYTHON] Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row
HeartSaVioR opened a new pull request, #37891: URL: https://github.com/apache/spark/pull/37891 ### What changes were proposed in this pull request? This PR adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row. Co-authored with @HyukjinKwon . This is a breakdown PR of https://github.com/apache/spark/pull/37863. ### Why are the changes needed? This change will be leveraged in [SPARK-40434](https://issues.apache.org/jira/browse/SPARK-40434). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. We will make sure test suites are constructed via E2E manner under [SPARK-40431](https://issues.apache.org/jira/browse/SPARK-40431). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark
HyukjinKwon commented on code in PR #37889: URL: https://github.com/apache/spark/pull/37889#discussion_r971467767 ## sql/catalyst/src/main/java/org/apache/spark/sql/streaming/GroupStateTimeout.java: ## @@ -32,6 +32,11 @@ @Experimental @Evolving public class GroupStateTimeout { + // scalastyle:off line.size.limit + // NOTE: if you're adding new type of timeout, you should also fix the places below: + // - Scala: org.apache.spark.sql.execution.streaming.GroupStateImpl.getGroupStateTimeoutFromString + // - Python: pyspark.sql.streaming.state.GroupStateTimeout + // scalastyle:on line.size.limit Review Comment: ```suggestion // NOTE: if you're adding new type of timeout, you should also fix the places below: // - Scala: // org.apache.spark.sql.execution.streaming.GroupStateImpl.getGroupStateTimeoutFromString // - Python: pyspark.sql.streaming.state.GroupStateTimeout ``` ## sql/catalyst/src/main/java/org/apache/spark/sql/streaming/GroupStateTimeout.java: ## @@ -32,6 +32,11 @@ @Experimental @Evolving public class GroupStateTimeout { + // scalastyle:off line.size.limit + // NOTE: if you're adding new type of timeout, you should also fix the places below: + // - Scala: org.apache.spark.sql.execution.streaming.GroupStateImpl.getGroupStateTimeoutFromString + // - Python: pyspark.sql.streaming.state.GroupStateTimeout + // scalastyle:on line.size.limit Review Comment: no big deal :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby
HyukjinKwon commented on PR #37836: URL: https://github.com/apache/spark/pull/37836#issuecomment-1247508508 Opps, just made a followup at https://github.com/apache/spark/pull/37890. Actually `ExpandingGroupby` and `RollingGroupby` aren't documented in pandas side too. I don't know their intention thought. Let's make a quick followup only for `Rolling.quantile` and `Expanding.quantile` for now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #37890: [SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and Expanding.quantile into PySpark documentation
HyukjinKwon opened a new pull request, #37890: URL: https://github.com/apache/spark/pull/37890 ### What changes were proposed in this pull request? This PR adds `Rolling.quantile` and `Expanding.quantile` into documentation. ### Why are the changes needed? To show the documentation about the new features to end users. ### Does this PR introduce _any_ user-facing change? No to end users because the original PR is not released yet. ### How was this patch tested? CI in this PR should test it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37878: [SPARK-40424][CORE][TESTS] Refactor `ChromeUIHistoryServerSuite` to add UTs for RocksDB
LuciferYang commented on PR #37878: URL: https://github.com/apache/spark/pull/37878#issuecomment-1247502653 rebased and updated pr description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark
HeartSaVioR opened a new pull request, #37889: URL: https://github.com/apache/spark/pull/37889 ### What changes were proposed in this pull request? This PR introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala codebase to support convenient conversion between PySpark implementation and Scala implementation. Co-authorship with @HyukjinKwon . This is a breakdown PR of #37863. ### Why are the changes needed? This change will be leveraged in SPARK-40434. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. We will make sure test suites are constructed via E2E manner under SPARK-40431. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby
Yikun commented on PR #37836: URL: https://github.com/apache/spark/pull/37836#issuecomment-1247496635 @HyukjinKwon @zhengruifeng Thanks > I think we should also add quantile to window.rst Hmm, forgot again, I also noticed that `ExpandingGroupby` and `RollingGroupby` functions hasn't doc, so let me create a separate PR to cover doc: - all functions for ExpandingGroupby - all functions for RollingGroupby - quantile for Expanding - quantile for Rolling -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
LuciferYang commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971460711 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala: ## @@ -129,20 +131,19 @@ object ArrayBasedMapData { def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = { val keys = map.keyArray.asInstanceOf[GenericArrayData].array val values = map.valueArray.asInstanceOf[GenericArrayData].array -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: scala.collection.Seq[Any], values: scala.collection.Seq[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] = { -import scala.collection.JavaConverters._ -keys.zip(values).toMap.asJava +Utils.toJavaMap(keys, values) } Review Comment: removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby
zhengruifeng commented on PR #37836: URL: https://github.com/apache/spark/pull/37836#issuecomment-1247482347 @Yikun I think we should also add `quantile` to `window.rst` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
cloud-fan commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971448585 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala: ## @@ -129,20 +131,19 @@ object ArrayBasedMapData { def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = { val keys = map.keyArray.asInstanceOf[GenericArrayData].array val values = map.valueArray.asInstanceOf[GenericArrayData].array -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: scala.collection.Seq[Any], values: scala.collection.Seq[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] = { -import scala.collection.JavaConverters._ -keys.zip(values).toMap.asJava +Utils.toJavaMap(keys, values) } Review Comment: `ArrayBasedMapData` is not a public API and shouldn't be tracked by mima. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37874: [SPARK-40421][PS] Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`
zhengruifeng commented on PR #37874: URL: https://github.com/apache/spark/pull/37874#issuecomment-1247477326 Merged into master, thanks @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #37874: [SPARK-40421][PS] Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`
zhengruifeng closed pull request #37874: [SPARK-40421][PS] Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods` URL: https://github.com/apache/spark/pull/37874 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37871: [SPARK-40426][SQL] Return a map from SparkThrowable.getMessageParameters
HyukjinKwon closed pull request #37871: [SPARK-40426][SQL] Return a map from SparkThrowable.getMessageParameters URL: https://github.com/apache/spark/pull/37871 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37871: [SPARK-40426][SQL] Return a map from SparkThrowable.getMessageParameters
HyukjinKwon commented on PR #37871: URL: https://github.com/apache/spark/pull/37871#issuecomment-1247474181 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema
sadikovi commented on code in PR #37881: URL: https://github.com/apache/spark/pull/37881#discussion_r971441087 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { // Partition keys are not available in the statistics of the files. // `dataColumns` might have partition columns, we need to filter them out. - val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains) + val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- partitionColumns Review Comment: I am not an expert in AttributeSet so it would be good if you could explain how it makes this work so I can reference it in the future . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby
HyukjinKwon closed pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby URL: https://github.com/apache/spark/pull/37836 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema
sadikovi commented on code in PR #37881: URL: https://github.com/apache/spark/pull/37881#discussion_r971441087 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { // Partition keys are not available in the statistics of the files. // `dataColumns` might have partition columns, we need to filter them out. - val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains) + val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- partitionColumns Review Comment: I am not an expert in AttributeSet so it would be good if you could explain how it makes this work and pass the test so I can reference it in the future . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby
HyukjinKwon commented on PR #37836: URL: https://github.com/apache/spark/pull/37836#issuecomment-1247473695 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a diff in pull request #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output
huaxingao commented on code in PR #37886: URL: https://github.com/apache/spark/pull/37886#discussion_r971441803 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala: ## @@ -41,8 +41,19 @@ object V2ScanPartitioningAndOrdering extends Rule[LogicalPlan] with SQLConfHelpe private def partitioning(plan: LogicalPlan) = plan.transformDown { case d @ DataSourceV2ScanRelation(relation, scan: SupportsReportPartitioning, _, None, _) => val catalystPartitioning = scan.outputPartitioning() match { -case kgp: KeyGroupedPartitioning => sequenceToOption(kgp.keys().map( - V2ExpressionUtils.toCatalystOpt(_, relation, relation.funCatalog))) +case kgp: KeyGroupedPartitioning => + val partitioning = sequenceToOption( +kgp.keys().map(V2ExpressionUtils.toCatalystOpt(_, relation, relation.funCatalog))) + if (partitioning.isEmpty) { +None + } else { +val ref = AttributeSet.fromAttributeSets(partitioning.get.map(_.references)) Review Comment: Sounds good! Changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema
sadikovi commented on code in PR #37881: URL: https://github.com/apache/spark/pull/37881#discussion_r971441087 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { // Partition keys are not available in the statistics of the files. // `dataColumns` might have partition columns, we need to filter them out. - val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains) + val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- partitionColumns Review Comment: I am not an expert in AttributeSet so it would be good if you could explain how it works so I can reference it in the future . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak closed pull request #37868: [SPARK-40397][BUILD] Upgrade `org.scalatestplus:selenium` to 3.12.13
sarutak closed pull request #37868: [SPARK-40397][BUILD] Upgrade `org.scalatestplus:selenium` to 3.12.13 URL: https://github.com/apache/spark/pull/37868 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on pull request #37868: [SPARK-40397][BUILD] Upgrade `org.scalatestplus:selenium` to 3.12.13
sarutak commented on PR #37868: URL: https://github.com/apache/spark/pull/37868#issuecomment-1247470619 Merging to `master`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37878: [SPARK-40424][CORE][TESTS] Refactor `ChromeUIHistoryServerSuite` to add UTs for RocksDB
HyukjinKwon commented on PR #37878: URL: https://github.com/apache/spark/pull/37878#issuecomment-1247466952 cc @sarutak FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
LuciferYang commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971425920 ## core/src/main/scala/org/apache/spark/util/collection/Utils.scala: ## @@ -62,4 +63,30 @@ private[spark] object Utils { */ def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] = if (input.forall(_.isDefined)) Some(input.flatten) else None + + /** + * Same function as `keys.zip(values).toMap`, but has perf gain. + */ + def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = { +val builder = immutable.Map.newBuilder[K, V] +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)] +} +builder.result() + } + + /** + * Same function as `keys.zip(values).toMap.asJava`, but has perf gain. + */ + def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): java.util.Map[K, V] = { +val map = new java.util.HashMap[K, V]() +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + map.put(keyIter.next(), valueIter.next()) +} +map Review Comment: u are right, change to return `Collections.unmodifiableMap(map)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output
cloud-fan commented on code in PR #37886: URL: https://github.com/apache/spark/pull/37886#discussion_r971424921 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala: ## @@ -41,8 +41,19 @@ object V2ScanPartitioningAndOrdering extends Rule[LogicalPlan] with SQLConfHelpe private def partitioning(plan: LogicalPlan) = plan.transformDown { case d @ DataSourceV2ScanRelation(relation, scan: SupportsReportPartitioning, _, None, _) => val catalystPartitioning = scan.outputPartitioning() match { -case kgp: KeyGroupedPartitioning => sequenceToOption(kgp.keys().map( - V2ExpressionUtils.toCatalystOpt(_, relation, relation.funCatalog))) +case kgp: KeyGroupedPartitioning => + val partitioning = sequenceToOption( +kgp.keys().map(V2ExpressionUtils.toCatalystOpt(_, relation, relation.funCatalog))) + if (partitioning.isEmpty) { +None + } else { +val ref = AttributeSet.fromAttributeSets(partitioning.get.map(_.references)) Review Comment: how about `partitioning.get.forall(p => p.references.subsetOf(...))` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output
cloud-fan commented on code in PR #37886: URL: https://github.com/apache/spark/pull/37886#discussion_r971424375 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala: ## @@ -41,8 +41,19 @@ object V2ScanPartitioningAndOrdering extends Rule[LogicalPlan] with SQLConfHelpe private def partitioning(plan: LogicalPlan) = plan.transformDown { case d @ DataSourceV2ScanRelation(relation, scan: SupportsReportPartitioning, _, None, _) => val catalystPartitioning = scan.outputPartitioning() match { -case kgp: KeyGroupedPartitioning => sequenceToOption(kgp.keys().map( - V2ExpressionUtils.toCatalystOpt(_, relation, relation.funCatalog))) +case kgp: KeyGroupedPartitioning => + val partitioning = sequenceToOption( +kgp.keys().map(V2ExpressionUtils.toCatalystOpt(_, relation, relation.funCatalog))) + if (partitioning.isEmpty) { +None + } else { +val ref = AttributeSet.fromAttributeSets(partitioning.get.map(_.references)) +if (ref.subsetOf(AttributeSet(d.output))) { Review Comment: ```suggestion if (ref.subsetOf(d.outputSet)) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
LuciferYang commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971423880 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala: ## @@ -129,20 +131,19 @@ object ArrayBasedMapData { def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = { val keys = map.keyArray.asInstanceOf[GenericArrayData].array val values = map.valueArray.asInstanceOf[GenericArrayData].array -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: scala.collection.Seq[Any], values: scala.collection.Seq[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] = { -import scala.collection.JavaConverters._ -keys.zip(values).toMap.asJava +Utils.toJavaMap(keys, values) } Review Comment: let me check this later -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
LuciferYang commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971408078 ## core/src/main/scala/org/apache/spark/util/collection/Utils.scala: ## @@ -62,4 +63,30 @@ private[spark] object Utils { */ def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] = if (input.forall(_.isDefined)) Some(input.flatten) else None + + /** + * Same function as `keys.zip(values).toMap`, but has perf gain. + */ + def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = { +val builder = immutable.Map.newBuilder[K, V] +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)] +} +builder.result() + } + + /** + * Same function as `keys.zip(values).toMap.asJava`, but has perf gain. + */ + def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): java.util.Map[K, V] = { +val map = new java.util.HashMap[K, V]() +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + map.put(keyIter.next(), valueIter.next()) +} +map Review Comment: ~~this method return a Java Map, how to make it immutable...~~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
LuciferYang commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971411597 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala: ## @@ -129,20 +131,19 @@ object ArrayBasedMapData { def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = { val keys = map.keyArray.asInstanceOf[GenericArrayData].array val values = map.valueArray.asInstanceOf[GenericArrayData].array -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: scala.collection.Seq[Any], values: scala.collection.Seq[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] = { -import scala.collection.JavaConverters._ -keys.zip(values).toMap.asJava +Utils.toJavaMap(keys, values) } Review Comment: `ArrayBasedMapData#toJavaMap` is already a never used method, I think we can delete it, but need to confirm whether MiMa check can pass first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zzccctv commented on pull request #31302: [SPARK-34210][SQL] After upgrading 3.0.1, Spark SQL access hive on HBase table access exception
zzccctv commented on PR #31302: URL: https://github.com/apache/spark/pull/31302#issuecomment-1247436633 > How should I apply this change Change it to the code in PR and recompile it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module
xinrong-meng commented on PR #37888: URL: https://github.com/apache/spark/pull/37888#issuecomment-1247433538 CC @ueshin @zhengruifeng @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
LuciferYang commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971408657 ## core/src/main/scala/org/apache/spark/util/collection/Utils.scala: ## @@ -62,4 +63,30 @@ private[spark] object Utils { */ def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] = if (input.forall(_.isDefined)) Some(input.flatten) else None + + /** + * Same function as `keys.zip(values).toMap`, but has perf gain. + */ + def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = { +val builder = immutable.Map.newBuilder[K, V] +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)] +} +builder.result() + } + + /** + * Same function as `keys.zip(values).toMap.asJava`, but has perf gain. + */ + def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): java.util.Map[K, V] = { +val map = new java.util.HashMap[K, V]() +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + map.put(keyIter.next(), valueIter.next()) +} +map Review Comment: Wrap to Collections.unmodifiableMap? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37882: [SPARK-38017][FOLLOWUP][3.3] Hide TimestampNTZ in the doc
HyukjinKwon closed pull request #37882: [SPARK-38017][FOLLOWUP][3.3] Hide TimestampNTZ in the doc URL: https://github.com/apache/spark/pull/37882 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on a diff in pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module
xinrong-meng commented on code in PR #37888: URL: https://github.com/apache/spark/pull/37888#discussion_r971408190 ## python/pyspark/pandas/tests/test_spark_functions.py: ## @@ -20,25 +20,12 @@ from pyspark.pandas.spark import functions as SF from pyspark.pandas.utils import spark_column_equals from pyspark.sql import functions as F -from pyspark.sql.types import ( -ByteType, -FloatType, -IntegerType, -LongType, -) from pyspark.testing.pandasutils import PandasOnSparkTestCase class SparkFunctionsTests(PandasOnSparkTestCase): -def test_lit(self): -self.assertTrue(spark_column_equals(SF.lit(np.int64(1)), F.lit(1).astype(LongType( -self.assertTrue(spark_column_equals(SF.lit(np.int32(1)), F.lit(1).astype(IntegerType( -self.assertTrue(spark_column_equals(SF.lit(np.int8(1)), F.lit(1).astype(ByteType( -self.assertTrue(spark_column_equals(SF.lit(np.byte(1)), F.lit(1).astype(ByteType( -self.assertTrue( -spark_column_equals(SF.lit(np.float32(1)), F.lit(float(1)).astype(FloatType())) -) -self.assertTrue(spark_column_equals(SF.lit(1), F.lit(1))) +def test_repeat(self): Review Comment: This change is irrelevant to the PR proposal but is made to keep the `test_spark_functions` file. We shall complete tests for `pyspark/pandas/spark/functions` here later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37883: [SPARK-38017][FOLLOWUP][3.2] Hide TimestampNTZ in the doc
HyukjinKwon closed pull request #37883: [SPARK-38017][FOLLOWUP][3.2] Hide TimestampNTZ in the doc URL: https://github.com/apache/spark/pull/37883 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
LuciferYang commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971408078 ## core/src/main/scala/org/apache/spark/util/collection/Utils.scala: ## @@ -62,4 +63,30 @@ private[spark] object Utils { */ def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] = if (input.forall(_.isDefined)) Some(input.flatten) else None + + /** + * Same function as `keys.zip(values).toMap`, but has perf gain. + */ + def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = { +val builder = immutable.Map.newBuilder[K, V] +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)] +} +builder.result() + } + + /** + * Same function as `keys.zip(values).toMap.asJava`, but has perf gain. + */ + def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): java.util.Map[K, V] = { +val map = new java.util.HashMap[K, V]() +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + map.put(keyIter.next(), valueIter.next()) +} +map Review Comment: this method return a Java Map, how to make it immutable... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37882: [SPARK-38017][FOLLOWUP][3.3] Hide TimestampNTZ in the doc
HyukjinKwon commented on PR #37882: URL: https://github.com/apache/spark/pull/37882#issuecomment-1247431419 Merged to branch-3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37883: [SPARK-38017][FOLLOWUP][3.2] Hide TimestampNTZ in the doc
HyukjinKwon commented on PR #37883: URL: https://github.com/apache/spark/pull/37883#issuecomment-1247431164 Merged to branch-3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r971407420 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -1378,28 +1378,84 @@ case class Pivot( * A constructor for creating an Unpivot, which will later be converted to an [[Expand]] * during the query analysis. * - * An empty values array will be replaced during analysis with all resolved outputs of child except + * Either ids or values array must be set. The ids array can be empty, + * the values array must not be empty if not None. + * + * A None ids array will be replaced during analysis with all resolved outputs of child except + * the values. This expansion allows to easily select all non-value columns as id columns. + * + * A None values array will be replaced during analysis with all resolved outputs of child except Review Comment: Ah I see, let's add an assert here to guarantee this assumption. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r971407082 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -1378,28 +1378,84 @@ case class Pivot( * A constructor for creating an Unpivot, which will later be converted to an [[Expand]] * during the query analysis. * - * An empty values array will be replaced during analysis with all resolved outputs of child except + * Either ids or values array must be set. The ids array can be empty, + * the values array must not be empty if not None. + * + * A None ids array will be replaced during analysis with all resolved outputs of child except + * the values. This expansion allows to easily select all non-value columns as id columns. + * + * A None values array will be replaced during analysis with all resolved outputs of child except * the ids. This expansion allows to easily unpivot all non-id columns. * * @see `org.apache.spark.sql.catalyst.analysis.Analyzer.ResolveUnpivot` * - * The type of the value column is derived from all value columns during analysis once all values - * are resolved. All values' types have to be compatible, otherwise the result value column cannot - * be assigned the individual values and an AnalysisException is thrown. + * Multiple columns can be unpivoted in one row by providing multiple value column names + * and the same number of unpivot value expressions: + * {{{ + * // one-dimensional value columns + * Unpivot( + * Some(Seq("id")), + * Some(Seq( + * Seq("val1"), + * Seq("val2") + * )), + * None, + * "var", + * Seq("val") + * ) + * + * // two-dimensional value columns + * Unpivot( + * Some(Seq("id")), + * Some(Seq( + * Seq("val1.1", "val1.2"), + * Seq("val2.1", "val2.2")) + * ), + * None, + * "var", + * Seq("val1", "val2") + * ) + * }}} + * + * The variable column will contain the name of the unpivot value while the value columns contain + * the unpivot values. Multi-dimensional unpivot values can be given `aliases`: + * }}} + * // two-dimensional value columns with aliases + * Unpivot( + * Some(Seq("id")), + * Some(Seq( + * Seq("val1.1", "val1.2"), + * Seq("val2.1", "val2.2")) + * ), + * Some(Seq( + * Some("val1"), + * Some("val2") + * )), + * "var", + * Seq("val1", "val2") + * ) + * }}} + * + * All "value" columns must share a least common data type. Unless they are the same data type, + * all "value" columns are cast to the nearest common data type. For instance, + * types `IntegerType` and `LongType` are cast to `LongType`, while `IntegerType` and `StringType` + * do not have a common data type and `unpivot` fails with an `AnalysisException`. * * @see `org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.UnpivotCoercion` * * @param idsId columns * @param values Value columns to unpivot + * @param aliasesOptional aliases for values * @param variableColumnName Name of the variable column - * @param valueColumnNameName of the value column + * @param valueColumnNames Names of the value columns * @param child Child operator */ case class Unpivot( -ids: Seq[NamedExpression], -values: Seq[NamedExpression], +ids: Option[Seq[NamedExpression]], +values: Option[Seq[Seq[NamedExpression]]], +aliases: Option[Seq[Option[String]]], Review Comment: It's in `QueryPlan.mapExpressions`, Spark does not accept expressions in `Tuple2`... Another way is to create a `case class UnpivotValueColumns(children: Seq[Expression], alias: Option[String]) extends Expression`, and then use `values: Option[Seq[UnpivotValueColumns]]` here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36766: [SPARK-32184][SQL] Remove inferred predicate if it has InOrCorrelatedExistsSubquery
github-actions[bot] commented on PR #36766: URL: https://github.com/apache/spark/pull/36766#issuecomment-1247429345 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36626: [SPARK-39249][SQL] Improve subexpression elimination for conditional expressions
github-actions[bot] commented on PR #36626: URL: https://github.com/apache/spark/pull/36626#issuecomment-1247429366 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module
xinrong-meng commented on PR #37888: URL: https://github.com/apache/spark/pull/37888#issuecomment-1247424459 It may be easier to review by commits :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng opened a new pull request, #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module
xinrong-meng opened a new pull request, #37888: URL: https://github.com/apache/spark/pull/37888 ### What changes were proposed in this pull request? Consolidate `lit` function with NumPy scalar in sql and pandas module. Major changes: - More accurate types for `lit` in sql with NumPy scalar input - Replace `lit` in pandas with `lit` in sql ### Why are the changes needed? As part of [SPARK-39405](https://issues.apache.org/jira/browse/SPARK-39405) for NumPy support in SQL. That provides more precise and consistent Numpy support across PySpark moduels. ### Does this PR introduce _any_ user-facing change? Yes. More accurate types for `lit`. ### How was this patch tested? Unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
dtenedor commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r971397869 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -823,12 +834,16 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { // it must also be in the aggregate expressions to be rewritten in the optimization // phase. if (containsExpr(a.groupingExpressions) && !containsExpr(a.aggregateExpressions)) { -failAnalysis("Correlated scalar subqueries in the group by clause " + - s"must also be in the aggregate expressions:\n$a") +a.failAnalysis( + errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + errorSubClass = "MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY", + treeNodes = Seq(a)) Review Comment: Good Q. The way I implemented it, `a` is a `TreeNode`. This overload of `failAnalysis` I added takes a `Seq[TreeNode[_]]` and then assigns their `toString`s to the `treeNode` parameter. I do this in the `failAnalysis` overload itself in order to normalize the expression IDs in the string to keep tests deterministic (the LogicalPlan `canonicalized` method aims to support this as well, but is not available during analysis since it uses `transformUp`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37884: [WIP][SPARK-40427][SQL] Move LIMIT/OFFSET CheckAnalysis error messages to use the new error framework
AmplabJenkins commented on PR #37884: URL: https://github.com/apache/spark/pull/37884#issuecomment-1247415506 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng closed pull request #37777: [WIP][SPARK-40309][PYTHON][PS] Introduce `sql_conf` context manager for `pyspark.sql`
xinrong-meng closed pull request #3: [WIP][SPARK-40309][PYTHON][PS] Introduce `sql_conf` context manager for `pyspark.sql` URL: https://github.com/apache/spark/pull/3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
gengliangwang commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r971390448 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -823,12 +834,16 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { // it must also be in the aggregate expressions to be rewritten in the optimization // phase. if (containsExpr(a.groupingExpressions) && !containsExpr(a.aggregateExpressions)) { -failAnalysis("Correlated scalar subqueries in the group by clause " + - s"must also be in the aggregate expressions:\n$a") +a.failAnalysis( + errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + errorSubClass = "MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY", + treeNodes = Seq(a)) Review Comment: why do we need to pass `a` itself? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau opened a new pull request, #37887: [SPARK-40360] ALREADY_EXISTS and NOT_FOUND exceptions
srielau opened a new pull request, #37887: URL: https://github.com/apache/spark/pull/37887 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau closed pull request #37811: [SPARK-40360] *_ALREADY_EXISTS and *_NOT_FOUND error
srielau closed pull request #37811: [SPARK-40360] *_ALREADY_EXISTS and *_NOT_FOUND error URL: https://github.com/apache/spark/pull/37811 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on a diff in pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark
Yikun commented on code in PR #37671: URL: https://github.com/apache/spark/pull/37671#discussion_r971381963 ## python/pyspark/pandas/tests/test_dataframe_conversion.py: ## @@ -90,7 +90,6 @@ def get_excel_dfs(pandas_on_spark_location, pandas_location): "expected": pd.read_excel(pandas_location, index_col=0), } -@unittest.skip("openpyxl") def test_to_excel(self): Review Comment: Thanks, good to know! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output
huaxingao commented on PR #37886: URL: https://github.com/apache/spark/pull/37886#issuecomment-1247393264 cc @cloud-fan @sunchao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rahulbhatia2702 commented on pull request #32397: [WIP][SPARK-35084][CORE] Spark 3: supporting "--packages" in k8s cluster mode
rahulbhatia2702 commented on PR #32397: URL: https://github.com/apache/spark/pull/32397#issuecomment-1247390380 is there a solution for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
dtenedor commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r971345715 ## sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out: ## @@ -317,11 +317,20 @@ SELECT * FROM t1, LATERAL (SELECT c1 + c2 + rand(0) AS c3) struct<> -- !query output org.apache.spark.sql.AnalysisException -Non-deterministic lateral subqueries are not supported when joining with outer relations that produce more than one row -SubqueryAlias __auto_generated_subquery_name -+- Project [(cast((outer(c1#x) + outer(c2#x)) as double) + rand(0)) AS c3#x] - +- OneRowRelation -; line 1 pos 9 +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + "errorSubClass" : "NON_DETERMINISTIC_LATERAL_SUBQUERIES", + "messageParameters" : { +"treeNode" : ": !LateralJoin lateral-subquery#5 [c1#6 && c2#7], Inner\n: +- SubqueryAlias __auto_generated_subquery_name\n: +- Project [(cast((outer(c1#2) + outer(c2#3)) as double) + rand(0)) AS c3#4]\n:+- OneRowRelation\n+- SubqueryAlias spark_catalog.default.t1\n +- View (`spark_catalog`.`default`.`t1`, [c1#2,c2#3])\n +- Project [cast(col1#0 as int) AS c1#2, cast(col2#1 as int) AS c2#3]\n +- LocalRelation [col1#0, col2#1]\n" Review Comment: Yes we did discuss this already in the PR threads :) I suggested eliding the plan strings given that we have the query context now, but some folks opined that the strings are helpful for debugging. In the end I struck a compromise by including the strings but covered by a new SQLConf to toggle their presence. This new config can remain enabled by default for now in Apache Spark so folks who are used to the strings for debugging may continue to use them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
dtenedor commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r971345715 ## sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out: ## @@ -317,11 +317,20 @@ SELECT * FROM t1, LATERAL (SELECT c1 + c2 + rand(0) AS c3) struct<> -- !query output org.apache.spark.sql.AnalysisException -Non-deterministic lateral subqueries are not supported when joining with outer relations that produce more than one row -SubqueryAlias __auto_generated_subquery_name -+- Project [(cast((outer(c1#x) + outer(c2#x)) as double) + rand(0)) AS c3#x] - +- OneRowRelation -; line 1 pos 9 +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + "errorSubClass" : "NON_DETERMINISTIC_LATERAL_SUBQUERIES", + "messageParameters" : { +"treeNode" : ": !LateralJoin lateral-subquery#5 [c1#6 && c2#7], Inner\n: +- SubqueryAlias __auto_generated_subquery_name\n: +- Project [(cast((outer(c1#2) + outer(c2#3)) as double) + rand(0)) AS c3#4]\n:+- OneRowRelation\n+- SubqueryAlias spark_catalog.default.t1\n +- View (`spark_catalog`.`default`.`t1`, [c1#2,c2#3])\n +- Project [cast(col1#0 as int) AS c1#2, cast(col2#1 as int) AS c2#3]\n +- LocalRelation [col1#0, col2#1]\n" Review Comment: Yes we did discuss this already in the PR threads :) I suggested eliding the plan strings given that we have the query context now, but some folks opined that the strings are helpful for debugging. In the end I stuck a compromise by including the strings but covered by a new SQLConf to toggle their presence. This new config can remain enabled by default for now in Apache Spark so folks who are used to the strings for debugging may continue to use them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema
sadikovi commented on code in PR #37881: URL: https://github.com/apache/spark/pull/37881#discussion_r971343049 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { // Partition keys are not available in the statistics of the files. // `dataColumns` might have partition columns, we need to filter them out. - val dataColumnsWithoutPartitionCols = dataColumns.filterNot(partitionColumns.contains) + val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- partitionColumns Review Comment: I am not sure this is correct. Can you elaborate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
gengliangwang commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r971337689 ## sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out: ## @@ -317,11 +317,20 @@ SELECT * FROM t1, LATERAL (SELECT c1 + c2 + rand(0) AS c3) struct<> -- !query output org.apache.spark.sql.AnalysisException -Non-deterministic lateral subqueries are not supported when joining with outer relations that produce more than one row -SubqueryAlias __auto_generated_subquery_name -+- Project [(cast((outer(c1#x) + outer(c2#x)) as double) + rand(0)) AS c3#x] - +- OneRowRelation -; line 1 pos 9 +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + "errorSubClass" : "NON_DETERMINISTIC_LATERAL_SUBQUERIES", + "messageParameters" : { +"treeNode" : ": !LateralJoin lateral-subquery#5 [c1#6 && c2#7], Inner\n: +- SubqueryAlias __auto_generated_subquery_name\n: +- Project [(cast((outer(c1#2) + outer(c2#3)) as double) + rand(0)) AS c3#4]\n:+- OneRowRelation\n+- SubqueryAlias spark_catalog.default.t1\n +- View (`spark_catalog`.`default`.`t1`, [c1#2,c2#3])\n +- Project [cast(col1#0 as int) AS c1#2, cast(col2#1 as int) AS c2#3]\n +- LocalRelation [col1#0, col2#1]\n" Review Comment: Given there is already query context in the error message, are we sure about showing the logical plan as well? (This PR contains a lot of threads, I am not sure if this is discussed.) cc @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37885: [SPARK-40428][CORE][WIP] Add a shutdown hook in the CoarseGrainedSchedulerBackend
dongjoon-hyun commented on code in PR #37885: URL: https://github.com/apache/spark/pull/37885#discussion_r971295296 ## core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala: ## @@ -127,6 +127,19 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp ThreadUtils.newDaemonSingleThreadScheduledExecutor("cleanup-decommission-execs") } + private var _shutdownHookRef: AnyRef = _ + + _shutdownHookRef = ShutdownHookManager.addShutdownHook( +ShutdownHookManager.DEFAULT_SHUTDOWN_PRIORITY) { () => Review Comment: nit. indentation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao opened a new pull request, #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output
huaxingao opened a new pull request, #37886: URL: https://github.com/apache/spark/pull/37886 ### What changes were proposed in this pull request? Only set `KeyGroupedPartitioning` when the referenced column is in the output ### Why are the changes needed? bug fixing ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #37811: [SPARK-40360] *_ALREADY_EXISTS and *_NOT_FOUND error
srielau commented on code in PR #37811: URL: https://github.com/apache/spark/pull/37811#discussion_r971174244 ## core/src/main/resources/error/error-classes.json: ## @@ -400,6 +418,21 @@ ], "sqlState" : "42000" }, + "PARTITIONS_ALREADY_EXIST" : { +"message" : [ + "Cannot ADD or RENAME TO partition(s) in table because they already exist.", + "Choose a different name, drop the existing partition, or add the IF NOT EXISTS clause to tolerate a pre-existing partition." +], +"sqlState" : "42000" + }, + "PARTITIONS_NOT_FOUND" : { +"message" : [ + "The partition(s) cannot be found in table .", Review Comment: ```suggestion "The partition(s) cannot be found in table .", ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk opened a new pull request, #37885: [SPARK-40428][K8S][CORE][WIP] Add a shutdown hook in the CoarseGrainedSchedulerBackend
holdenk opened a new pull request, #37885: URL: https://github.com/apache/spark/pull/37885 ### What changes were proposed in this pull request? Add a shutdown hook in the CoarseGrainedSchedulerBackend ### Why are the changes needed? Sometimes if the driver shuts down abnormally resources may be left dangling. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #37811: [SPARK-40360] *_ALREADY_EXISTS and *_NOT_FOUND error
srielau commented on code in PR #37811: URL: https://github.com/apache/spark/pull/37811#discussion_r971132688 ## docs/sql-performance-tuning.md: ## @@ -27,9 +27,9 @@ turning on some experimental options. ## Caching Data In Memory -Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`. +Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("table_name")` or `dataFrame.cache()`. Then Spark SQL will scan only required columns and will automatically tune compression to minimize -memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableName")` or `dataFrame.unpersist()` to remove the table from memory. +memory usage and GC pressure. You can call `spark.catalog.uncacheTable("table_name")` or `dataFrame.unpersist()` to remove the table from memory. Review Comment: ```suggestion memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableName")` or `dataFrame.unpersist()` to remove the table from memory. ``` ## R/pkg/tests/fulltests/test_sparkSQL.R: ## @@ -662,8 +662,8 @@ test_that("test tableNames and tables", { tables <- listTables() expect_equal(count(tables), count + 1) expect_equal(count(tables()), count(tables)) - expect_true("tableName" %in% colnames(tables())) - expect_true(all(c("tableName", "namespace", "isTemporary") %in% colnames(tables( + expect_true("table_name" %in% colnames(tables())) + expect_true(all(c("table_name", "namespace", "isTemporary") %in% colnames(tables( Review Comment: ```suggestion expect_true("tableName" %in% colnames(tables())) expect_true(all(c("tableName", "namespace", "isTemporary") %in% colnames(tables( ``` ## R/pkg/R/catalog.R: ## @@ -291,7 +291,7 @@ dropTempView <- function(viewName) { #' @note tables since 1.4.0 tables <- function(databaseName = NULL) { # rename column to match previous output schema - withColumnRenamed(listTables(databaseName), "name", "tableName") + withColumnRenamed(listTables(databaseName), "name", "table_name") Review Comment: ```suggestion withColumnRenamed(listTables(databaseName), "name", "tableName") ``` ## docs/sql-performance-tuning.md: ## @@ -27,9 +27,9 @@ turning on some experimental options. ## Caching Data In Memory -Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`. +Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("table_name")` or `dataFrame.cache()`. Review Comment: ```suggestion Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`. ``` ## sql/core/src/test/scala/org/apache/spark/sql/SQLContextSuite.scala: ## @@ -136,7 +136,7 @@ class SQLContextSuite extends SparkFunSuite with SharedSparkContext { "SELECT isTemporary, tableName from tables WHERE tableName = 'listtablessuitetable'") .collect().toSeq == Row(true, "listtablessuitetable") :: Nil) assert( - sqlContext.tables().filter("tableName = 'tables'").select("tableName", "isTemporary") + sqlContext.tables().filter("tableName = 'tables'").select("table_name", "isTemporary") Review Comment: ```suggestion sqlContext.tables().filter("tableName = 'tables'").select("tableName", "isTemporary") ``` ## docs/structured-streaming-programming-guide.md: ## @@ -2170,7 +2170,7 @@ Hence, use it with caution. {% highlight scala %} writeStream .format("memory") -.queryName("tableName") +.queryName("table_name") Review Comment: ```suggestion .queryName("tableName") ``` ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala: ## @@ -716,7 +716,7 @@ case class ShowTableExtended( object ShowTableExtended { def getOutputAttrs: Seq[Attribute] = Seq( AttributeReference("namespace", StringType, nullable = false)(), -AttributeReference("tableName", StringType, nullable = false)(), +AttributeReference("table_name", StringType, nullable = false)(), Review Comment: ```suggestion AttributeReference("tableName", StringType, nullable = false)(), ``` ## sql/core/src/test/scala/org/apache/spark/sql/SQLContextSuite.scala: ## @@ -123,7 +123,7 @@ class SQLContextSuite extends SparkFunSuite with SharedSparkContext { val expectedSchema = StructType( StructField("namespace", StringType, false) :: -StructField("tableName", StringType, false) :: +StructField("table_name", StringType, false) :: Review Comment: ```suggestion StructField("tableName", StringType, false) :: ```
[GitHub] [spark] dtenedor opened a new pull request, #37884: [WIP][SPARK-40427][SQL] Move LIMIT/OFFSET CheckAnalysis error messages to use the new error framework
dtenedor opened a new pull request, #37884: URL: https://github.com/apache/spark/pull/37884 ### What changes were proposed in this pull request? Move LIMIT/OFFSET CheckAnalysis error messages to use the new error framework. This will help improve the usability of Apache Spark as a product, and link against documentation. ### Why are the changes needed? Error messages related to SQL query analysis in Apache Spark need some work to make them more descriptive and actionable. ### Does this PR introduce _any_ user-facing change? Yes, error messages change. ### How was this patch tested? Unit test and query test coverage shows the updates in error messages received. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #18748: [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel
srowen commented on PR #18748: URL: https://github.com/apache/spark/pull/18748#issuecomment-1247113606 What do you mean? It's in the API: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALSModel.html#pyspark.ml.recommendation.ALSModel.recommendForUserSubset https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/recommendation/ALSModel.html#recommendForUserSubset-org.apache.spark.sql.Dataset-int- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
dtenedor commented on PR #37840: URL: https://github.com/apache/spark/pull/37840#issuecomment-1247100836 Hi @MaxGekk I responded to your comments, please take another look at this when you have time. @gengliangwang @allisonwang-db @srielau FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
dtenedor commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r971116242 ## sql/core/src/test/resources/sql-tests/results/subquery/negative-cases/invalid-correlation.sql.out: ## @@ -105,14 +131,20 @@ WHERE t1a IN (SELECT t2a struct<> -- !query output org.apache.spark.sql.AnalysisException -Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses: -Aggregate [min(outer(t2a#x)) AS min(outer(t2.t2a))#x] -+- SubqueryAlias t3 - +- View (`t3`, [t3a#x,t3b#x,t3c#x]) - +- Project [cast(t3a#x as int) AS t3a#x, cast(t3b#x as int) AS t3b#x, cast(t3c#x as int) AS t3c#x] - +- Project [t3a#x, t3b#x, t3c#x] -+- SubqueryAlias t3 - +- LocalRelation [t3a#x, t3b#x, t3c#x] +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + "errorSubClass" : "UNSUPPORTED_CORRELATED_REFERENCE", + "messageParameters" : { +"planString" : ": Aggregate [min(outer(t2a#6)) AS min(outer(t2.t2a))#7]\n+- SubqueryAlias t3\n +- View (`t3`, [t3a#3,t3b#4,t3c#5])\n +- Project [cast(t3a#0 as int) AS t3a#3, cast(t3b#1 as int) AS t3b#4, cast(t3c#2 as int) AS t3c#5]\n +- Project [t3a#0, t3b#1, t3c#2]\n+- SubqueryAlias t3\n +- LocalRelation [t3a#0, t3b#1, t3c#2]\n" Review Comment: Good Q. I found they are not stable when running the tests outside of my local machine, therefore to prevent the tests from becoming flaky, I found it necessary to normalize the expression IDs. I added code in the `failAnalysis` overload that takes `TreeNode`s to do that, iterating through the trees to reset the expression IDs. This should be more robust than regex-replacing the string. We do have the existing `canonicalized` method on logical plans and expressions, but that uses `transformUp` which is not allowed in the analyzer, so I ended up doing it separately here for this purpose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
dtenedor commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r971113842 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -1563,10 +1563,12 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { new AnalysisException(s"'$operation' does not support partitioning") } - def mixedRefsInAggFunc(funcStr: String): Throwable = { -val msg = "Found an aggregate function in a correlated predicate that has both " + - "outer and local references, which is not supported: " + funcStr -new AnalysisException(msg) + def mixedRefsInAggFunc(funcStr: String, origin: Origin): Throwable = { +new AnalysisException( + errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + errorSubClass = "AGGREGATE_FUNCTION_MIXED_OUTER_LOCAL_REFERENCES", + origin = origin, + messageParameters = Map("function" -> funcStr)) Review Comment: Good Q, Looks like this is a `sql` string generated from a predicate which may contain multiple functions. It renders from the error-classes.json file like this: ``` Found an aggregate function in a correlated predicate that has both outer and local references, which is not supported: ``` It might be better to leave it unquoted since some parts of the predicate may contain their own double-quotes, e.g. literal string values. Will leave it alone for now, Serge please comment if you want it to work differently and I can update it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
dtenedor commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r971108777 ## sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out: ## @@ -100,12 +100,17 @@ WHERE udf(t1.v) >= (SELECT min(udf(t2.v)) struct<> -- !query output org.apache.spark.sql.AnalysisException -Correlated column is not allowed in predicate (CAST(udf(cast(k as string)) AS STRING) = CAST(udf(cast(outer(k#x) as string)) AS STRING)): -Aggregate [cast(udf(cast(max(cast(udf(cast(v#x as string)) as int)) as string)) as int) AS udf(max(udf(v)))#x] -+- Filter (cast(udf(cast(k#x as string)) as string) = cast(udf(cast(outer(k#x) as string)) as string)) - +- SubqueryAlias t2 - +- View (`t2`, [k#x,v#x]) - +- Project [cast(k#x as string) AS k#x, cast(v#x as int) AS v#x] -+- Project [k#x, v#x] - +- SubqueryAlias t2 - +- LocalRelation [k#x, v#x] +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + "errorSubClass" : "CORRELATED_COLUMN_IS_NOT_ALLOWED_IN_PREDICATE", + "messageParameters" : { +"planString" : ": (cast(udf(cast(k#0 as string)) as string) = cast(udf(cast(outer(k#1) as string)) as string))" Review Comment: Good point, this comes from expressions rather than plans. Renamed all to `treeNode` instead of `planString`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
dtenedor commented on PR #37840: URL: https://github.com/apache/spark/pull/37840#issuecomment-1247084014 > Hi @dtenedor, I have just merged the bug fix https://github.com/apache/spark/pull/37861. The query context should be set correctly in the AnlysisException. I believe the test cases in this PR need to be updated so please try updating the code base to the latest master. @gengliangwang thanks, this is done. We can see the new query contexts in this PR at the latest commit, they appear more accurate now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema
sunchao commented on PR #37881: URL: https://github.com/apache/spark/pull/37881#issuecomment-1247076334 @viirya please see test cases added in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). cc @sadikovi @cloud-fan @HyukjinKwon too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ucas010 commented on pull request #18748: [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel
ucas010 commented on PR #18748: URL: https://github.com/apache/spark/pull/18748#issuecomment-1247071158 recommendForUserSubset could not be used why ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37882: [SPARK-38017][FOLLOWUP][3.3] Hide TimestampNTZ in the doc
dongjoon-hyun commented on PR #37882: URL: https://github.com/apache/spark/pull/37882#issuecomment-1247068156 cc @wangyum since he is a release manager for Apache Spark 3.3.1 and `v3.3.1-rc1` is already made. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak opened a new pull request, #37883: [SPARK-38017][FOLLOWUP][3.2] Hide TimestampNTZ in the doc
sarutak opened a new pull request, #37883: URL: https://github.com/apache/spark/pull/37883 ### What changes were proposed in this pull request? This PR removes `TimestampNTZ` from the doc about `TimeWindow` and `SessionWIndow`. ### Why are the changes needed? As we discussed, it's better to hide `TimestampNTZ` from the doc. https://github.com/apache/spark/pull/35313#issuecomment-1185192162 ### Does this PR introduce _any_ user-facing change? The document will be changed, but there is no compatibility problem. ### How was this patch tested? Built the doc with `SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll build` at `doc` directory. Then, confirmed the generated HTML. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak opened a new pull request, #37882: [SPARK-38017][FOLLOWUP] Hide TimestampNTZ in the doc
sarutak opened a new pull request, #37882: URL: https://github.com/apache/spark/pull/37882 ### What changes were proposed in this pull request? This PR removes `TimestampNTZ` from the doc about `TimeWindow` and `SessionWIndow`. ### Why are the changes needed? As we discussed, it's better to hide `TimestampNTZ` from the doc. https://github.com/apache/spark/pull/35313#issuecomment-1185192162 ### Does this PR introduce _any_ user-facing change? The document will be changed, but there is no compatibility problem. ### How was this patch tested? Built the doc with `SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll build` at `doc` directory. Then, confirmed the generated HTML. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao opened a new pull request, #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema
sunchao opened a new pull request, #37881: URL: https://github.com/apache/spark/pull/37881 ### What changes were proposed in this pull request? Currently in Parquet V1 read path, Spark will pushdown data filters even if they have no reference in the Parquet read schema. This can cause correctness issues as described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). The root cause, it seems, is because in the V1 path, we first uses `AttributeReference` equality to filter out data columns without partition columns, and then use `AttributeSet` equality to filter out filters with only references to data columns. There's inconsistency in the two processes above. Take the following scenario as example: - data column: `[COL, a]` - partition column: `[col]` - filter: `col > 10` With `AttributeReference` equality, `COL` is not equal to `col` and thus the filtered out data column set is still `[COL, a]`. However, when calculating filters with only reference to data columns, `COL` is **considered equal** to `col` and the filter `col > 10`, when checking with `[COL, a]`, is considered to have reference to data columns, and thus will not be removed. ### Why are the changes needed? This fixes the correctness bug described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? There are existing test cases for this issue, so this PR just reuses them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
MaxGekk commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r971046318 ## sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out: ## @@ -100,12 +100,17 @@ WHERE udf(t1.v) >= (SELECT min(udf(t2.v)) struct<> -- !query output org.apache.spark.sql.AnalysisException -Correlated column is not allowed in predicate (CAST(udf(cast(k as string)) AS STRING) = CAST(udf(cast(outer(k#x) as string)) AS STRING)): -Aggregate [cast(udf(cast(max(cast(udf(cast(v#x as string)) as int)) as string)) as int) AS udf(max(udf(v)))#x] -+- Filter (cast(udf(cast(k#x as string)) as string) = cast(udf(cast(outer(k#x) as string)) as string)) - +- SubqueryAlias t2 - +- View (`t2`, [k#x,v#x]) - +- Project [cast(k#x as string) AS k#x, cast(v#x as int) AS v#x] -+- Project [k#x, v#x] - +- SubqueryAlias t2 - +- LocalRelation [k#x, v#x] +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + "errorSubClass" : "CORRELATED_COLUMN_IS_NOT_ALLOWED_IN_PREDICATE", + "messageParameters" : { +"planString" : ": (cast(udf(cast(k#0 as string)) as string) = cast(udf(cast(outer(k#1) as string)) as string))" Review Comment: Is it a plan string? Let's rename it. ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -1563,10 +1563,12 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { new AnalysisException(s"'$operation' does not support partitioning") } - def mixedRefsInAggFunc(funcStr: String): Throwable = { -val msg = "Found an aggregate function in a correlated predicate that has both " + - "outer and local references, which is not supported: " + funcStr -new AnalysisException(msg) + def mixedRefsInAggFunc(funcStr: String, origin: Origin): Throwable = { +new AnalysisException( + errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + errorSubClass = "AGGREGATE_FUNCTION_MIXED_OUTER_LOCAL_REFERENCES", + origin = origin, + messageParameters = Map("function" -> funcStr)) Review Comment: Should we quote the function? by `toSQLExpr` if it is an expression or maybe by `toSQLId`? cc @srielau ## sql/core/src/test/resources/sql-tests/results/subquery/negative-cases/invalid-correlation.sql.out: ## @@ -105,14 +131,20 @@ WHERE t1a IN (SELECT t2a struct<> -- !query output org.apache.spark.sql.AnalysisException -Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses: -Aggregate [min(outer(t2a#x)) AS min(outer(t2.t2a))#x] -+- SubqueryAlias t3 - +- View (`t3`, [t3a#x,t3b#x,t3c#x]) - +- Project [cast(t3a#x as int) AS t3a#x, cast(t3b#x as int) AS t3b#x, cast(t3c#x as int) AS t3c#x] - +- Project [t3a#x, t3b#x, t3c#x] -+- SubqueryAlias t3 - +- LocalRelation [t3a#x, t3b#x, t3c#x] +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY", + "errorSubClass" : "UNSUPPORTED_CORRELATED_REFERENCE", + "messageParameters" : { +"planString" : ": Aggregate [min(outer(t2a#6)) AS min(outer(t2.t2a))#7]\n+- SubqueryAlias t3\n +- View (`t3`, [t3a#3,t3b#4,t3c#5])\n +- Project [cast(t3a#0 as int) AS t3a#3, cast(t3b#1 as int) AS t3b#4, cast(t3c#2 as int) AS t3c#5]\n +- Project [t3a#0, t3b#1, t3c#2]\n+- SubqueryAlias t3\n +- LocalRelation [t3a#0, t3b#1, t3c#2]\n" Review Comment: > t2a#6 Are the IDs stable? Should we replace them like https://github.com/apache/spark/blob/db5aea60e4b22449d529f9e866366cfdc784b140/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala#L94 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #37877: [SPARK-40423][K8S][TESTS] Add explicit YuniKorn queue submission test coverage
dongjoon-hyun closed pull request #37877: [SPARK-40423][K8S][TESTS] Add explicit YuniKorn queue submission test coverage URL: https://github.com/apache/spark/pull/37877 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37877: [SPARK-40423][K8S][TESTS] Add explicit YuniKorn queue submission test coverage
dongjoon-hyun commented on PR #37877: URL: https://github.com/apache/spark/pull/37877#issuecomment-1247016463 Thank you, @viirya and @martin-g ! Merged to master/3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #37853: [SPARK-40404][DOCS] Add precondition description for `spark.shuffle.service.db.backend` in `running-on-yarn.md`
mridulm commented on code in PR #37853: URL: https://github.com/apache/spark/pull/37853#discussion_r971042246 ## docs/running-on-yarn.md: ## @@ -852,8 +852,8 @@ The following extra configuration options are available when the shuffle service spark.shuffle.service.db.backend LEVELDB -To specify the kind of disk-base store used in shuffle service state store, supports `LEVELDB` and `ROCKSDB` now -and `LEVELDB` as default value. +When work-preserving restart is enabled in YARN, this is used to specify the disk-base store used +in shuffle service state store, supports `LEVELDB` and `ROCKSDB` with `LEVELDB` as default value. The original data store in `LevelDB/RocksDB` will not be automatically convert to another kind of storage now. Review Comment: `convert` -> `converted` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #37843: [SPARK-40398][CORE][SQL] Use Loop instead of Arrays.stream api
mridulm commented on code in PR #37843: URL: https://github.com/apache/spark/pull/37843#discussion_r971037047 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/V2PredicateSuite.scala: ## @@ -287,7 +287,7 @@ class V2PredicateSuite extends SparkFunSuite { new Predicate("=", Array[Expression](ref("a"), LiteralValue(1, IntegerType))), new Predicate("=", Array[Expression](ref("b"), LiteralValue(1, IntegerType assert(predicate1.equals(predicate2)) -assert(predicate1.references.map(_.describe()).toSeq == Seq("a", "b")) +assert(predicate1.references.map(_.describe()).toSeq.sorted == Seq("a", "b")) assert(predicate1.describe.equals("(a = 1) OR (b = 1)")) Review Comment: I might have missed @huaxingao's comment - if she is fine with it, looks good to me :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
mridulm commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971031243 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala: ## @@ -129,20 +131,19 @@ object ArrayBasedMapData { def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = { val keys = map.keyArray.asInstanceOf[GenericArrayData].array val values = map.valueArray.asInstanceOf[GenericArrayData].array -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: scala.collection.Seq[Any], values: scala.collection.Seq[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] = { -import scala.collection.JavaConverters._ -keys.zip(values).toMap.asJava +Utils.toJavaMap(keys, values) } Review Comment: Do we need this method anymore ? Why not replace with `Utils.toJavaMap` entirely (in `JavaTypeInference`) ? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala: ## @@ -129,20 +131,19 @@ object ArrayBasedMapData { def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = { val keys = map.keyArray.asInstanceOf[GenericArrayData].array val values = map.valueArray.asInstanceOf[GenericArrayData].array -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toScalaMap(keys: scala.collection.Seq[Any], values: scala.collection.Seq[Any]): Map[Any, Any] = { -keys.zip(values).toMap +Utils.toMap(keys, values) } def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] = { -import scala.collection.JavaConverters._ -keys.zip(values).toMap.asJava +Utils.toJavaMap(keys, values) } Review Comment: Do we need this method anymore ? Why not replace with `Utils.toJavaMap` entirely (in `JavaTypeInference`) ? Any issues with that ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern
mridulm commented on code in PR #37876: URL: https://github.com/apache/spark/pull/37876#discussion_r971028877 ## core/src/main/scala/org/apache/spark/util/collection/Utils.scala: ## @@ -62,4 +63,30 @@ private[spark] object Utils { */ def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] = if (input.forall(_.isDefined)) Some(input.flatten) else None + + /** + * Same function as `keys.zip(values).toMap`, but has perf gain. + */ + def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = { +val builder = immutable.Map.newBuilder[K, V] +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)] +} +builder.result() + } + + /** + * Same function as `keys.zip(values).toMap.asJava`, but has perf gain. + */ + def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): java.util.Map[K, V] = { +val map = new java.util.HashMap[K, V]() +val keyIter = keys.iterator +val valueIter = values.iterator +while (keyIter.hasNext && valueIter.hasNext) { + map.put(keyIter.next(), valueIter.next()) +} +map Review Comment: Make this immutable ? I believe the original code would have resulted in an immutable map ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org