[GitHub] [spark] ucas010 commented on pull request #18748: [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel

2022-09-14 Thread GitBox


ucas010 commented on PR #18748:
URL: https://github.com/apache/spark/pull/18748#issuecomment-1247586071

   java: 找不到符号
 符号:   方法 
recommendForUserSubset(org.apache.spark.sql.Dataset,int)
 位置: 类型为org.apache.spark.ml.recommendation.ALSModel的变量 model


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2022-09-14 Thread GitBox


AmplabJenkins commented on PR #37880:
URL: https://github.com/apache/spark/pull/37880#issuecomment-1247568597

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR opened a new pull request, #37894: [DRAFT][DO-NOT-MERGE][SPARK-40435][SS][PYTHON] Add test suites for applyInPandasWithState in PySpark

2022-09-14 Thread GitBox


HeartSaVioR opened a new pull request, #37894:
URL: https://github.com/apache/spark/pull/37894

   ...TBD...
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR opened a new pull request, #37893: [DRAFT][DO-NOT-MERGE][SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-14 Thread GitBox


HeartSaVioR opened a new pull request, #37893:
URL: https://github.com/apache/spark/pull/37893

   ...TBD...
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. We are exposing new public API in PySpark which performs arbitrary 
stateful processing.
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #37885: [SPARK-40428][CORE][WIP] Add a shutdown hook in the CoarseGrainedSchedulerBackend

2022-09-14 Thread GitBox


mridulm commented on PR #37885:
URL: https://github.com/apache/spark/pull/37885#issuecomment-1247553777

   `SparkContext.stop` will take care of this - and that is invoked from a 
shutdown hook already.
   
   SparkContext.stop -> _dagScheduler.stop -> taskScheduler.stop -> backend.stop


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby

2022-09-14 Thread GitBox


zhengruifeng commented on PR #37836:
URL: https://github.com/apache/spark/pull/37836#issuecomment-1247546159

   > Actually ExpandingGroupby and RollingGroupby aren't documented in pandas 
side too.
   
   that's true, when I check the missing function in 
https://issues.apache.org/jira/browse/SPARK-40327, it's quite confusing that 
there is no document and I need to manually check whether a function exists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #37890: [SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and Expanding.quantile into PySpark documentation

2022-09-14 Thread GitBox


zhengruifeng commented on PR #37890:
URL: https://github.com/apache/spark/pull/37890#issuecomment-1247544737

   LGTM2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37890: [SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and Expanding.quantile into PySpark documentation

2022-09-14 Thread GitBox


HyukjinKwon closed pull request #37890: 
[SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and 
Expanding.quantile into PySpark documentation
URL: https://github.com/apache/spark/pull/37890


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37890: [SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and Expanding.quantile into PySpark documentation

2022-09-14 Thread GitBox


HyukjinKwon commented on PR #37890:
URL: https://github.com/apache/spark/pull/37890#issuecomment-1247534406

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread GitBox


HeartSaVioR commented on code in PR #37889:
URL: https://github.com/apache/spark/pull/37889#discussion_r971486761


##
python/pyspark/sql/streaming/state.py:
##
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import datetime
+import json
+from typing import Tuple, Optional
+
+from pyspark.sql.types import DateType, Row, StructType
+
+__all__ = ["GroupStateImpl", "GroupStateTimeout"]
+
+
+class GroupStateTimeout:
+NoTimeout: str = "NoTimeout"
+ProcessingTimeTimeout: str = "ProcessingTimeTimeout"
+EventTimeTimeout: str = "EventTimeTimeout"
+
+
+class GroupStateImpl:
+NO_TIMESTAMP: int = -1
+
+def __init__(
+self,
+# JVM Constructor
+optionalValue: Row,
+batchProcessingTimeMs: int,
+eventTimeWatermarkMs: int,
+timeoutConf: str,
+hasTimedOut: bool,
+watermarkPresent: bool,
+# JVM internal state.
+defined: bool,
+updated: bool,
+removed: bool,
+timeoutTimestamp: int,
+# Python internal state.
+keyAsUnsafe: bytes,
+valueSchema: StructType,
+) -> None:
+self._keyAsUnsafe = keyAsUnsafe
+self._value = optionalValue
+self._batch_processing_time_ms = batchProcessingTimeMs
+self._event_time_watermark_ms = eventTimeWatermarkMs
+
+assert timeoutConf in [
+GroupStateTimeout.NoTimeout,
+GroupStateTimeout.ProcessingTimeTimeout,
+GroupStateTimeout.EventTimeTimeout,
+]
+self._timeout_conf = timeoutConf
+
+self._has_timed_out = hasTimedOut
+self._watermark_present = watermarkPresent
+
+self._defined = defined
+self._updated = updated
+self._removed = removed
+self._timeout_timestamp = timeoutTimestamp
+# Python internal state.
+self._old_timeout_timestamp = timeoutTimestamp
+
+self._value_schema = valueSchema
+
+@property
+def exists(self) -> bool:
+return self._defined
+
+@property
+def get(self) -> Tuple:
+if self.exists:
+return tuple(self._value)
+else:
+raise ValueError("State is either not defined or has already been 
removed")
+
+@property
+def getOption(self) -> Optional[Tuple]:
+if self.exists:
+return tuple(self._value)
+else:
+return None
+
+@property
+def hasTimedOut(self) -> bool:
+return self._has_timed_out
+
+# NOTE: this function is only available to PySpark implementation due to 
underlying
+# implementation, do not port to Scala implementation!
+@property
+def oldTimeoutTimestamp(self) -> int:
+return self._old_timeout_timestamp
+
+def update(self, newValue: Tuple) -> None:
+if newValue is None:
+raise ValueError("'None' is not a valid state value")
+
+self._value = Row(*newValue)
+self._defined = True
+self._updated = True
+self._removed = False
+
+def remove(self) -> None:
+self._defined = False
+self._updated = False
+self._removed = True
+
+def setTimeoutDuration(self, durationMs: int) -> None:
+if isinstance(durationMs, str):
+# TODO(SPARK-X): Support string representation of durationMs.
+raise ValueError("durationMs should be int but get :%s" % 
type(durationMs))
+
+if self._timeout_conf != GroupStateTimeout.ProcessingTimeTimeout:
+raise RuntimeError(
+"Cannot set timeout duration without enabling processing time 
timeout in "
+"applyInPandasWithState"
+)
+
+if durationMs <= 0:
+raise ValueError("Timeout duration must be positive")
+self._timeout_timestamp = durationMs + self._batch_processing_time_ms
+
+# TODO(SPARK-X): Implement additionalDuration parameter.

Review Comment:
   Just reflected it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, 

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread GitBox


HyukjinKwon commented on code in PR #37889:
URL: https://github.com/apache/spark/pull/37889#discussion_r971486330


##
python/pyspark/sql/streaming/state.py:
##
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import datetime
+import json
+from typing import Tuple, Optional
+
+from pyspark.sql.types import DateType, Row, StructType
+
+__all__ = ["GroupStateImpl", "GroupStateTimeout"]
+
+
+class GroupStateTimeout:
+NoTimeout: str = "NoTimeout"
+ProcessingTimeTimeout: str = "ProcessingTimeTimeout"
+EventTimeTimeout: str = "EventTimeTimeout"
+
+
+class GroupStateImpl:
+NO_TIMESTAMP: int = -1
+
+def __init__(
+self,
+# JVM Constructor
+optionalValue: Row,
+batchProcessingTimeMs: int,
+eventTimeWatermarkMs: int,
+timeoutConf: str,
+hasTimedOut: bool,
+watermarkPresent: bool,
+# JVM internal state.
+defined: bool,
+updated: bool,
+removed: bool,
+timeoutTimestamp: int,
+# Python internal state.
+keyAsUnsafe: bytes,
+valueSchema: StructType,
+) -> None:
+self._keyAsUnsafe = keyAsUnsafe
+self._value = optionalValue
+self._batch_processing_time_ms = batchProcessingTimeMs
+self._event_time_watermark_ms = eventTimeWatermarkMs
+
+assert timeoutConf in [
+GroupStateTimeout.NoTimeout,
+GroupStateTimeout.ProcessingTimeTimeout,
+GroupStateTimeout.EventTimeTimeout,
+]
+self._timeout_conf = timeoutConf
+
+self._has_timed_out = hasTimedOut
+self._watermark_present = watermarkPresent
+
+self._defined = defined
+self._updated = updated
+self._removed = removed
+self._timeout_timestamp = timeoutTimestamp
+# Python internal state.
+self._old_timeout_timestamp = timeoutTimestamp
+
+self._value_schema = valueSchema
+
+@property
+def exists(self) -> bool:
+return self._defined
+
+@property
+def get(self) -> Tuple:
+if self.exists:
+return tuple(self._value)
+else:
+raise ValueError("State is either not defined or has already been 
removed")
+
+@property
+def getOption(self) -> Optional[Tuple]:
+if self.exists:
+return tuple(self._value)
+else:
+return None
+
+@property
+def hasTimedOut(self) -> bool:
+return self._has_timed_out
+
+# NOTE: this function is only available to PySpark implementation due to 
underlying
+# implementation, do not port to Scala implementation!
+@property
+def oldTimeoutTimestamp(self) -> int:
+return self._old_timeout_timestamp
+
+def update(self, newValue: Tuple) -> None:
+if newValue is None:
+raise ValueError("'None' is not a valid state value")
+
+self._value = Row(*newValue)
+self._defined = True
+self._updated = True
+self._removed = False
+
+def remove(self) -> None:
+self._defined = False
+self._updated = False
+self._removed = True
+
+def setTimeoutDuration(self, durationMs: int) -> None:
+if isinstance(durationMs, str):
+# TODO(SPARK-X): Support string representation of durationMs.
+raise ValueError("durationMs should be int but get :%s" % 
type(durationMs))
+
+if self._timeout_conf != GroupStateTimeout.ProcessingTimeTimeout:
+raise RuntimeError(
+"Cannot set timeout duration without enabling processing time 
timeout in "
+"applyInPandasWithState"
+)
+
+if durationMs <= 0:
+raise ValueError("Timeout duration must be positive")
+self._timeout_timestamp = durationMs + self._batch_processing_time_ms
+
+# TODO(SPARK-X): Implement additionalDuration parameter.

Review Comment:
   Filed SPARK-40438



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, 

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread GitBox


HeartSaVioR commented on code in PR #37889:
URL: https://github.com/apache/spark/pull/37889#discussion_r971485439


##
python/pyspark/sql/streaming/state.py:
##
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import datetime
+import json
+from typing import Tuple, Optional
+
+from pyspark.sql.types import DateType, Row, StructType
+
+__all__ = ["GroupStateImpl", "GroupStateTimeout"]
+
+
+class GroupStateTimeout:
+NoTimeout: str = "NoTimeout"
+ProcessingTimeTimeout: str = "ProcessingTimeTimeout"
+EventTimeTimeout: str = "EventTimeTimeout"
+
+
+class GroupStateImpl:
+NO_TIMESTAMP: int = -1
+
+def __init__(
+self,
+# JVM Constructor
+optionalValue: Row,
+batchProcessingTimeMs: int,
+eventTimeWatermarkMs: int,
+timeoutConf: str,
+hasTimedOut: bool,
+watermarkPresent: bool,
+# JVM internal state.
+defined: bool,
+updated: bool,
+removed: bool,
+timeoutTimestamp: int,
+# Python internal state.
+keyAsUnsafe: bytes,
+valueSchema: StructType,
+) -> None:
+self._keyAsUnsafe = keyAsUnsafe
+self._value = optionalValue
+self._batch_processing_time_ms = batchProcessingTimeMs
+self._event_time_watermark_ms = eventTimeWatermarkMs
+
+assert timeoutConf in [
+GroupStateTimeout.NoTimeout,
+GroupStateTimeout.ProcessingTimeTimeout,
+GroupStateTimeout.EventTimeTimeout,
+]
+self._timeout_conf = timeoutConf
+
+self._has_timed_out = hasTimedOut
+self._watermark_present = watermarkPresent
+
+self._defined = defined
+self._updated = updated
+self._removed = removed
+self._timeout_timestamp = timeoutTimestamp
+# Python internal state.
+self._old_timeout_timestamp = timeoutTimestamp
+
+self._value_schema = valueSchema
+
+@property
+def exists(self) -> bool:
+return self._defined
+
+@property
+def get(self) -> Tuple:
+if self.exists:
+return tuple(self._value)
+else:
+raise ValueError("State is either not defined or has already been 
removed")
+
+@property
+def getOption(self) -> Optional[Tuple]:
+if self.exists:
+return tuple(self._value)
+else:
+return None
+
+@property
+def hasTimedOut(self) -> bool:
+return self._has_timed_out
+
+# NOTE: this function is only available to PySpark implementation due to 
underlying
+# implementation, do not port to Scala implementation!
+@property
+def oldTimeoutTimestamp(self) -> int:
+return self._old_timeout_timestamp
+
+def update(self, newValue: Tuple) -> None:
+if newValue is None:
+raise ValueError("'None' is not a valid state value")
+
+self._value = Row(*newValue)
+self._defined = True
+self._updated = True
+self._removed = False
+
+def remove(self) -> None:
+self._defined = False
+self._updated = False
+self._removed = True
+
+def setTimeoutDuration(self, durationMs: int) -> None:
+if isinstance(durationMs, str):
+# TODO(SPARK-X): Support string representation of durationMs.

Review Comment:
   Nice! I'll update them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread GitBox


HyukjinKwon commented on code in PR #37889:
URL: https://github.com/apache/spark/pull/37889#discussion_r971485152


##
python/pyspark/sql/streaming/state.py:
##
@@ -0,0 +1,192 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import datetime
+import json
+from typing import Tuple, Optional
+
+from pyspark.sql.types import DateType, Row, StructType
+
+__all__ = ["GroupStateImpl", "GroupStateTimeout"]
+
+
+class GroupStateTimeout:
+NoTimeout: str = "NoTimeout"
+ProcessingTimeTimeout: str = "ProcessingTimeTimeout"
+EventTimeTimeout: str = "EventTimeTimeout"
+
+
+class GroupStateImpl:
+NO_TIMESTAMP: int = -1
+
+def __init__(
+self,
+# JVM Constructor
+optionalValue: Row,
+batchProcessingTimeMs: int,
+eventTimeWatermarkMs: int,
+timeoutConf: str,
+hasTimedOut: bool,
+watermarkPresent: bool,
+# JVM internal state.
+defined: bool,
+updated: bool,
+removed: bool,
+timeoutTimestamp: int,
+# Python internal state.
+keyAsUnsafe: bytes,
+valueSchema: StructType,
+) -> None:
+self._keyAsUnsafe = keyAsUnsafe
+self._value = optionalValue
+self._batch_processing_time_ms = batchProcessingTimeMs
+self._event_time_watermark_ms = eventTimeWatermarkMs
+
+assert timeoutConf in [
+GroupStateTimeout.NoTimeout,
+GroupStateTimeout.ProcessingTimeTimeout,
+GroupStateTimeout.EventTimeTimeout,
+]
+self._timeout_conf = timeoutConf
+
+self._has_timed_out = hasTimedOut
+self._watermark_present = watermarkPresent
+
+self._defined = defined
+self._updated = updated
+self._removed = removed
+self._timeout_timestamp = timeoutTimestamp
+# Python internal state.
+self._old_timeout_timestamp = timeoutTimestamp
+
+self._value_schema = valueSchema
+
+@property
+def exists(self) -> bool:
+return self._defined
+
+@property
+def get(self) -> Tuple:
+if self.exists:
+return tuple(self._value)
+else:
+raise ValueError("State is either not defined or has already been 
removed")
+
+@property
+def getOption(self) -> Optional[Tuple]:
+if self.exists:
+return tuple(self._value)
+else:
+return None
+
+@property
+def hasTimedOut(self) -> bool:
+return self._has_timed_out
+
+# NOTE: this function is only available to PySpark implementation due to 
underlying
+# implementation, do not port to Scala implementation!
+@property
+def oldTimeoutTimestamp(self) -> int:
+return self._old_timeout_timestamp
+
+def update(self, newValue: Tuple) -> None:
+if newValue is None:
+raise ValueError("'None' is not a valid state value")
+
+self._value = Row(*newValue)
+self._defined = True
+self._updated = True
+self._removed = False
+
+def remove(self) -> None:
+self._defined = False
+self._updated = False
+self._removed = True
+
+def setTimeoutDuration(self, durationMs: int) -> None:
+if isinstance(durationMs, str):
+# TODO(SPARK-X): Support string representation of durationMs.

Review Comment:
   Filed SPARK-40437



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread GitBox


HeartSaVioR commented on PR #37889:
URL: https://github.com/apache/spark/pull/37889#issuecomment-1247528542

   @HyukjinKwon 
   I just missed adding a py file - just added it. Could you please take a look 
again? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #37892: [SPARK-40436][BUILD] Upgrade Scala to 2.12.17

2022-09-14 Thread GitBox


LuciferYang commented on PR #37892:
URL: https://github.com/apache/spark/pull/37892#issuecomment-1247526234

   Test first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #37892: [SPARK-40436][BUILD] Upgrade Scala to 2.12.17

2022-09-14 Thread GitBox


LuciferYang opened a new pull request, #37892:
URL: https://github.com/apache/spark/pull/37892

   ### What changes were proposed in this pull request?
   This PR aims to upgrade Scala to 2.12.17
   
   
   ### Why are the changes needed?
   Will add
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, this is a Scala version change.
   
   
   
   
   ### How was this patch tested?
   Existing Test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread GitBox


zhengruifeng closed pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate 
`lit` function with NumPy scalar in sql and pandas module
URL: https://github.com/apache/spark/pull/37888


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby

2022-09-14 Thread GitBox


Yikun commented on PR #37836:
URL: https://github.com/apache/spark/pull/37836#issuecomment-1247519104

   @HyukjinKwon OK, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR opened a new pull request, #37891: [SPARK-40433][SS][PYTHON] Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row

2022-09-14 Thread GitBox


HeartSaVioR opened a new pull request, #37891:
URL: https://github.com/apache/spark/pull/37891

   ### What changes were proposed in this pull request?
   
   This PR adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to 
JVM Row.
   
   Co-authored with @HyukjinKwon .
   
   This is a breakdown PR of https://github.com/apache/spark/pull/37863.
   
   ### Why are the changes needed?
   
   This change will be leveraged in 
[SPARK-40434](https://issues.apache.org/jira/browse/SPARK-40434).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   N/A. We will make sure test suites are constructed via E2E manner under 
[SPARK-40431](https://issues.apache.org/jira/browse/SPARK-40431).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread GitBox


HyukjinKwon commented on code in PR #37889:
URL: https://github.com/apache/spark/pull/37889#discussion_r971467767


##
sql/catalyst/src/main/java/org/apache/spark/sql/streaming/GroupStateTimeout.java:
##
@@ -32,6 +32,11 @@
 @Experimental
 @Evolving
 public class GroupStateTimeout {
+  // scalastyle:off line.size.limit
+  // NOTE: if you're adding new type of timeout, you should also fix the 
places below:
+  // - Scala: 
org.apache.spark.sql.execution.streaming.GroupStateImpl.getGroupStateTimeoutFromString
+  // - Python: pyspark.sql.streaming.state.GroupStateTimeout
+  // scalastyle:on line.size.limit

Review Comment:
   ```suggestion
 // NOTE: if you're adding new type of timeout, you should also fix the 
places below:
 // - Scala:
 // 
org.apache.spark.sql.execution.streaming.GroupStateImpl.getGroupStateTimeoutFromString
 // - Python: pyspark.sql.streaming.state.GroupStateTimeout
   ```



##
sql/catalyst/src/main/java/org/apache/spark/sql/streaming/GroupStateTimeout.java:
##
@@ -32,6 +32,11 @@
 @Experimental
 @Evolving
 public class GroupStateTimeout {
+  // scalastyle:off line.size.limit
+  // NOTE: if you're adding new type of timeout, you should also fix the 
places below:
+  // - Scala: 
org.apache.spark.sql.execution.streaming.GroupStateImpl.getGroupStateTimeoutFromString
+  // - Python: pyspark.sql.streaming.state.GroupStateTimeout
+  // scalastyle:on line.size.limit

Review Comment:
   no big deal :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby

2022-09-14 Thread GitBox


HyukjinKwon commented on PR #37836:
URL: https://github.com/apache/spark/pull/37836#issuecomment-1247508508

   Opps, just made a followup at https://github.com/apache/spark/pull/37890.
   
   Actually `ExpandingGroupby` and `RollingGroupby` aren't documented in pandas 
side too. I don't know their intention thought. Let's make a quick followup 
only for `Rolling.quantile` and `Expanding.quantile` for now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #37890: [SPARK-40339][SPARK-40342][PS][DOCS][FOLLOW-UP] Add Rolling.quantile and Expanding.quantile into PySpark documentation

2022-09-14 Thread GitBox


HyukjinKwon opened a new pull request, #37890:
URL: https://github.com/apache/spark/pull/37890

   ### What changes were proposed in this pull request?
   
   This PR adds `Rolling.quantile` and `Expanding.quantile` into documentation. 
   
   ### Why are the changes needed?
   
   To show the documentation about the new features to end users.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No to end users because the original PR is not released yet.
   
   ### How was this patch tested?
   
   CI in this PR should test it out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #37878: [SPARK-40424][CORE][TESTS] Refactor `ChromeUIHistoryServerSuite` to add UTs for RocksDB

2022-09-14 Thread GitBox


LuciferYang commented on PR #37878:
URL: https://github.com/apache/spark/pull/37878#issuecomment-1247502653

   rebased and updated pr description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR opened a new pull request, #37889: [SPARK-40432][SS][PYTHON] Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread GitBox


HeartSaVioR opened a new pull request, #37889:
URL: https://github.com/apache/spark/pull/37889

   ### What changes were proposed in this pull request?
   
   This PR introduces GroupStateImpl and GroupStateTimeout in PySpark, and 
updates Scala codebase to support convenient conversion between PySpark 
implementation and Scala implementation.
   
   Co-authorship with @HyukjinKwon .
   
   This is a breakdown PR of #37863.
   
   ### Why are the changes needed?
   
   This change will be leveraged in SPARK-40434.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   N/A. We will make sure test suites are constructed via E2E manner under 
SPARK-40431.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby

2022-09-14 Thread GitBox


Yikun commented on PR #37836:
URL: https://github.com/apache/spark/pull/37836#issuecomment-1247496635

   @HyukjinKwon @zhengruifeng Thanks
   
   > I think we should also add quantile to window.rst
   
   Hmm, forgot again, I also noticed that `ExpandingGroupby` and 
`RollingGroupby` functions hasn't doc, so let me create a separate PR to cover 
doc:
   
   - all functions for ExpandingGroupby
   - all functions for RollingGroupby
   - quantile for Expanding
   - quantile for Rolling


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


LuciferYang commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971460711


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala:
##
@@ -129,20 +131,19 @@ object ArrayBasedMapData {
   def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = {
 val keys = map.keyArray.asInstanceOf[GenericArrayData].array
 val values = map.valueArray.asInstanceOf[GenericArrayData].array
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: scala.collection.Seq[Any],
   values: scala.collection.Seq[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] 
= {
-import scala.collection.JavaConverters._
-keys.zip(values).toMap.asJava
+Utils.toJavaMap(keys, values)
   }

Review Comment:
   removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby

2022-09-14 Thread GitBox


zhengruifeng commented on PR #37836:
URL: https://github.com/apache/spark/pull/37836#issuecomment-1247482347

   @Yikun I think we should also add `quantile` to `window.rst`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


cloud-fan commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971448585


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala:
##
@@ -129,20 +131,19 @@ object ArrayBasedMapData {
   def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = {
 val keys = map.keyArray.asInstanceOf[GenericArrayData].array
 val values = map.valueArray.asInstanceOf[GenericArrayData].array
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: scala.collection.Seq[Any],
   values: scala.collection.Seq[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] 
= {
-import scala.collection.JavaConverters._
-keys.zip(values).toMap.asJava
+Utils.toJavaMap(keys, values)
   }

Review Comment:
   `ArrayBasedMapData` is not a public API and shouldn't be tracked by mima.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #37874: [SPARK-40421][PS] Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-14 Thread GitBox


zhengruifeng commented on PR #37874:
URL: https://github.com/apache/spark/pull/37874#issuecomment-1247477326

   Merged into master, thanks @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #37874: [SPARK-40421][PS] Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-14 Thread GitBox


zhengruifeng closed pull request #37874: [SPARK-40421][PS] Make `spearman` 
correlation in `DataFrame.corr` support missing values and `min_periods`
URL: https://github.com/apache/spark/pull/37874


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37871: [SPARK-40426][SQL] Return a map from SparkThrowable.getMessageParameters

2022-09-14 Thread GitBox


HyukjinKwon closed pull request #37871: [SPARK-40426][SQL] Return a map from 
SparkThrowable.getMessageParameters
URL: https://github.com/apache/spark/pull/37871


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37871: [SPARK-40426][SQL] Return a map from SparkThrowable.getMessageParameters

2022-09-14 Thread GitBox


HyukjinKwon commented on PR #37871:
URL: https://github.com/apache/spark/pull/37871#issuecomment-1247474181

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-14 Thread GitBox


sadikovi commented on code in PR #37881:
URL: https://github.com/apache/spark/pull/37881#discussion_r971441087


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
 
   // Partition keys are not available in the statistics of the files.
   // `dataColumns` might have partition columns, we need to filter them 
out.
-  val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionColumns.contains)
+  val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- 
partitionColumns

Review Comment:
   I am not an expert in AttributeSet so it would be good if you could explain 
how it makes this work so I can reference it in the future .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby

2022-09-14 Thread GitBox


HyukjinKwon closed pull request #37836: 
[SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in 
Rolling/RollingGroupby/Expanding/ExpandingGroupby
URL: https://github.com/apache/spark/pull/37836


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-14 Thread GitBox


sadikovi commented on code in PR #37881:
URL: https://github.com/apache/spark/pull/37881#discussion_r971441087


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
 
   // Partition keys are not available in the statistics of the files.
   // `dataColumns` might have partition columns, we need to filter them 
out.
-  val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionColumns.contains)
+  val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- 
partitionColumns

Review Comment:
   I am not an expert in AttributeSet so it would be good if you could explain 
how it makes this work and pass the test so I can reference it in the future .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37836: [SPARK-40339][SPARK-40342][SPARK-40345][SPARK-40348][PS] Implement quantile in Rolling/RollingGroupby/Expanding/ExpandingGroupby

2022-09-14 Thread GitBox


HyukjinKwon commented on PR #37836:
URL: https://github.com/apache/spark/pull/37836#issuecomment-1247473695

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a diff in pull request #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread GitBox


huaxingao commented on code in PR #37886:
URL: https://github.com/apache/spark/pull/37886#discussion_r971441803


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala:
##
@@ -41,8 +41,19 @@ object V2ScanPartitioningAndOrdering extends 
Rule[LogicalPlan] with SQLConfHelpe
   private def partitioning(plan: LogicalPlan) = plan.transformDown {
 case d @ DataSourceV2ScanRelation(relation, scan: 
SupportsReportPartitioning, _, None, _) =>
   val catalystPartitioning = scan.outputPartitioning() match {
-case kgp: KeyGroupedPartitioning => sequenceToOption(kgp.keys().map(
-  V2ExpressionUtils.toCatalystOpt(_, relation, relation.funCatalog)))
+case kgp: KeyGroupedPartitioning =>
+  val partitioning = sequenceToOption(
+kgp.keys().map(V2ExpressionUtils.toCatalystOpt(_, relation, 
relation.funCatalog)))
+  if (partitioning.isEmpty) {
+None
+  } else {
+val ref = 
AttributeSet.fromAttributeSets(partitioning.get.map(_.references))

Review Comment:
   Sounds good! Changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-14 Thread GitBox


sadikovi commented on code in PR #37881:
URL: https://github.com/apache/spark/pull/37881#discussion_r971441087


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
 
   // Partition keys are not available in the statistics of the files.
   // `dataColumns` might have partition columns, we need to filter them 
out.
-  val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionColumns.contains)
+  val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- 
partitionColumns

Review Comment:
   I am not an expert in AttributeSet so it would be good if you could explain 
how it works so I can reference it in the future .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak closed pull request #37868: [SPARK-40397][BUILD] Upgrade `org.scalatestplus:selenium` to 3.12.13

2022-09-14 Thread GitBox


sarutak closed pull request #37868: [SPARK-40397][BUILD] Upgrade 
`org.scalatestplus:selenium` to 3.12.13
URL: https://github.com/apache/spark/pull/37868


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #37868: [SPARK-40397][BUILD] Upgrade `org.scalatestplus:selenium` to 3.12.13

2022-09-14 Thread GitBox


sarutak commented on PR #37868:
URL: https://github.com/apache/spark/pull/37868#issuecomment-1247470619

   Merging to `master`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37878: [SPARK-40424][CORE][TESTS] Refactor `ChromeUIHistoryServerSuite` to add UTs for RocksDB

2022-09-14 Thread GitBox


HyukjinKwon commented on PR #37878:
URL: https://github.com/apache/spark/pull/37878#issuecomment-1247466952

   cc @sarutak FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


LuciferYang commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971425920


##
core/src/main/scala/org/apache/spark/util/collection/Utils.scala:
##
@@ -62,4 +63,30 @@ private[spark] object Utils {
*/
   def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] =
 if (input.forall(_.isDefined)) Some(input.flatten) else None
+
+  /**
+   * Same function as `keys.zip(values).toMap`, but has perf gain.
+   */
+  def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = {
+val builder = immutable.Map.newBuilder[K, V]
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)]
+}
+builder.result()
+  }
+
+  /**
+   * Same function as `keys.zip(values).toMap.asJava`, but has perf gain.
+   */
+  def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): 
java.util.Map[K, V] = {
+val map = new java.util.HashMap[K, V]()
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  map.put(keyIter.next(), valueIter.next())
+}
+map

Review Comment:
   u are right, change to return `Collections.unmodifiableMap(map)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread GitBox


cloud-fan commented on code in PR #37886:
URL: https://github.com/apache/spark/pull/37886#discussion_r971424921


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala:
##
@@ -41,8 +41,19 @@ object V2ScanPartitioningAndOrdering extends 
Rule[LogicalPlan] with SQLConfHelpe
   private def partitioning(plan: LogicalPlan) = plan.transformDown {
 case d @ DataSourceV2ScanRelation(relation, scan: 
SupportsReportPartitioning, _, None, _) =>
   val catalystPartitioning = scan.outputPartitioning() match {
-case kgp: KeyGroupedPartitioning => sequenceToOption(kgp.keys().map(
-  V2ExpressionUtils.toCatalystOpt(_, relation, relation.funCatalog)))
+case kgp: KeyGroupedPartitioning =>
+  val partitioning = sequenceToOption(
+kgp.keys().map(V2ExpressionUtils.toCatalystOpt(_, relation, 
relation.funCatalog)))
+  if (partitioning.isEmpty) {
+None
+  } else {
+val ref = 
AttributeSet.fromAttributeSets(partitioning.get.map(_.references))

Review Comment:
   how about `partitioning.get.forall(p => p.references.subsetOf(...))`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread GitBox


cloud-fan commented on code in PR #37886:
URL: https://github.com/apache/spark/pull/37886#discussion_r971424375


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanPartitioningAndOrdering.scala:
##
@@ -41,8 +41,19 @@ object V2ScanPartitioningAndOrdering extends 
Rule[LogicalPlan] with SQLConfHelpe
   private def partitioning(plan: LogicalPlan) = plan.transformDown {
 case d @ DataSourceV2ScanRelation(relation, scan: 
SupportsReportPartitioning, _, None, _) =>
   val catalystPartitioning = scan.outputPartitioning() match {
-case kgp: KeyGroupedPartitioning => sequenceToOption(kgp.keys().map(
-  V2ExpressionUtils.toCatalystOpt(_, relation, relation.funCatalog)))
+case kgp: KeyGroupedPartitioning =>
+  val partitioning = sequenceToOption(
+kgp.keys().map(V2ExpressionUtils.toCatalystOpt(_, relation, 
relation.funCatalog)))
+  if (partitioning.isEmpty) {
+None
+  } else {
+val ref = 
AttributeSet.fromAttributeSets(partitioning.get.map(_.references))
+if (ref.subsetOf(AttributeSet(d.output))) {

Review Comment:
   ```suggestion
   if (ref.subsetOf(d.outputSet)) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


LuciferYang commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971423880


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala:
##
@@ -129,20 +131,19 @@ object ArrayBasedMapData {
   def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = {
 val keys = map.keyArray.asInstanceOf[GenericArrayData].array
 val values = map.valueArray.asInstanceOf[GenericArrayData].array
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: scala.collection.Seq[Any],
   values: scala.collection.Seq[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] 
= {
-import scala.collection.JavaConverters._
-keys.zip(values).toMap.asJava
+Utils.toJavaMap(keys, values)
   }

Review Comment:
   let me check this later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


LuciferYang commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971408078


##
core/src/main/scala/org/apache/spark/util/collection/Utils.scala:
##
@@ -62,4 +63,30 @@ private[spark] object Utils {
*/
   def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] =
 if (input.forall(_.isDefined)) Some(input.flatten) else None
+
+  /**
+   * Same function as `keys.zip(values).toMap`, but has perf gain.
+   */
+  def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = {
+val builder = immutable.Map.newBuilder[K, V]
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)]
+}
+builder.result()
+  }
+
+  /**
+   * Same function as `keys.zip(values).toMap.asJava`, but has perf gain.
+   */
+  def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): 
java.util.Map[K, V] = {
+val map = new java.util.HashMap[K, V]()
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  map.put(keyIter.next(), valueIter.next())
+}
+map

Review Comment:
   ~~this method return a Java Map, how to make it immutable...~~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


LuciferYang commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971411597


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala:
##
@@ -129,20 +131,19 @@ object ArrayBasedMapData {
   def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = {
 val keys = map.keyArray.asInstanceOf[GenericArrayData].array
 val values = map.valueArray.asInstanceOf[GenericArrayData].array
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: scala.collection.Seq[Any],
   values: scala.collection.Seq[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] 
= {
-import scala.collection.JavaConverters._
-keys.zip(values).toMap.asJava
+Utils.toJavaMap(keys, values)
   }

Review Comment:
   `ArrayBasedMapData#toJavaMap` is already a never used method, I think we can 
delete it, but need to confirm whether MiMa check can pass first
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zzccctv commented on pull request #31302: [SPARK-34210][SQL] After upgrading 3.0.1, Spark SQL access hive on HBase table access exception

2022-09-14 Thread GitBox


zzccctv commented on PR #31302:
URL: https://github.com/apache/spark/pull/31302#issuecomment-1247436633

   
   
   
   > How should I apply this change
   
   Change it to the code in PR and recompile it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread GitBox


xinrong-meng commented on PR #37888:
URL: https://github.com/apache/spark/pull/37888#issuecomment-1247433538

   CC @ueshin @zhengruifeng @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


LuciferYang commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971408657


##
core/src/main/scala/org/apache/spark/util/collection/Utils.scala:
##
@@ -62,4 +63,30 @@ private[spark] object Utils {
*/
   def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] =
 if (input.forall(_.isDefined)) Some(input.flatten) else None
+
+  /**
+   * Same function as `keys.zip(values).toMap`, but has perf gain.
+   */
+  def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = {
+val builder = immutable.Map.newBuilder[K, V]
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)]
+}
+builder.result()
+  }
+
+  /**
+   * Same function as `keys.zip(values).toMap.asJava`, but has perf gain.
+   */
+  def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): 
java.util.Map[K, V] = {
+val map = new java.util.HashMap[K, V]()
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  map.put(keyIter.next(), valueIter.next())
+}
+map

Review Comment:
   Wrap to Collections.unmodifiableMap?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37882: [SPARK-38017][FOLLOWUP][3.3] Hide TimestampNTZ in the doc

2022-09-14 Thread GitBox


HyukjinKwon closed pull request #37882: [SPARK-38017][FOLLOWUP][3.3] Hide 
TimestampNTZ in the doc
URL: https://github.com/apache/spark/pull/37882


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on a diff in pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread GitBox


xinrong-meng commented on code in PR #37888:
URL: https://github.com/apache/spark/pull/37888#discussion_r971408190


##
python/pyspark/pandas/tests/test_spark_functions.py:
##
@@ -20,25 +20,12 @@
 from pyspark.pandas.spark import functions as SF
 from pyspark.pandas.utils import spark_column_equals
 from pyspark.sql import functions as F
-from pyspark.sql.types import (
-ByteType,
-FloatType,
-IntegerType,
-LongType,
-)
 from pyspark.testing.pandasutils import PandasOnSparkTestCase
 
 
 class SparkFunctionsTests(PandasOnSparkTestCase):
-def test_lit(self):
-self.assertTrue(spark_column_equals(SF.lit(np.int64(1)), 
F.lit(1).astype(LongType(
-self.assertTrue(spark_column_equals(SF.lit(np.int32(1)), 
F.lit(1).astype(IntegerType(
-self.assertTrue(spark_column_equals(SF.lit(np.int8(1)), 
F.lit(1).astype(ByteType(
-self.assertTrue(spark_column_equals(SF.lit(np.byte(1)), 
F.lit(1).astype(ByteType(
-self.assertTrue(
-spark_column_equals(SF.lit(np.float32(1)), 
F.lit(float(1)).astype(FloatType()))
-)
-self.assertTrue(spark_column_equals(SF.lit(1), F.lit(1)))
+def test_repeat(self):

Review Comment:
   This change is irrelevant to the PR proposal but is made to keep the 
`test_spark_functions` file.
   
   We shall complete tests for `pyspark/pandas/spark/functions` here later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #37883: [SPARK-38017][FOLLOWUP][3.2] Hide TimestampNTZ in the doc

2022-09-14 Thread GitBox


HyukjinKwon closed pull request #37883: [SPARK-38017][FOLLOWUP][3.2] Hide 
TimestampNTZ in the doc
URL: https://github.com/apache/spark/pull/37883


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


LuciferYang commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971408078


##
core/src/main/scala/org/apache/spark/util/collection/Utils.scala:
##
@@ -62,4 +63,30 @@ private[spark] object Utils {
*/
   def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] =
 if (input.forall(_.isDefined)) Some(input.flatten) else None
+
+  /**
+   * Same function as `keys.zip(values).toMap`, but has perf gain.
+   */
+  def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = {
+val builder = immutable.Map.newBuilder[K, V]
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)]
+}
+builder.result()
+  }
+
+  /**
+   * Same function as `keys.zip(values).toMap.asJava`, but has perf gain.
+   */
+  def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): 
java.util.Map[K, V] = {
+val map = new java.util.HashMap[K, V]()
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  map.put(keyIter.next(), valueIter.next())
+}
+map

Review Comment:
   this method return a Java Map, how to make it immutable...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37882: [SPARK-38017][FOLLOWUP][3.3] Hide TimestampNTZ in the doc

2022-09-14 Thread GitBox


HyukjinKwon commented on PR #37882:
URL: https://github.com/apache/spark/pull/37882#issuecomment-1247431419

   Merged to branch-3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #37883: [SPARK-38017][FOLLOWUP][3.2] Hide TimestampNTZ in the doc

2022-09-14 Thread GitBox


HyukjinKwon commented on PR #37883:
URL: https://github.com/apache/spark/pull/37883#issuecomment-1247431164

   Merged to branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-14 Thread GitBox


cloud-fan commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r971407420


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -1378,28 +1378,84 @@ case class Pivot(
  * A constructor for creating an Unpivot, which will later be converted to an 
[[Expand]]
  * during the query analysis.
  *
- * An empty values array will be replaced during analysis with all resolved 
outputs of child except
+ * Either ids or values array must be set. The ids array can be empty,
+ * the values array must not be empty if not None.
+ *
+ * A None ids array will be replaced during analysis with all resolved outputs 
of child except
+ * the values. This expansion allows to easily select all non-value columns as 
id columns.
+ *
+ * A None values array will be replaced during analysis with all resolved 
outputs of child except

Review Comment:
   Ah I see, let's add an assert here to guarantee this assumption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-14 Thread GitBox


cloud-fan commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r971407082


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -1378,28 +1378,84 @@ case class Pivot(
  * A constructor for creating an Unpivot, which will later be converted to an 
[[Expand]]
  * during the query analysis.
  *
- * An empty values array will be replaced during analysis with all resolved 
outputs of child except
+ * Either ids or values array must be set. The ids array can be empty,
+ * the values array must not be empty if not None.
+ *
+ * A None ids array will be replaced during analysis with all resolved outputs 
of child except
+ * the values. This expansion allows to easily select all non-value columns as 
id columns.
+ *
+ * A None values array will be replaced during analysis with all resolved 
outputs of child except
  * the ids. This expansion allows to easily unpivot all non-id columns.
  *
  * @see `org.apache.spark.sql.catalyst.analysis.Analyzer.ResolveUnpivot`
  *
- * The type of the value column is derived from all value columns during 
analysis once all values
- * are resolved. All values' types have to be compatible, otherwise the result 
value column cannot
- * be assigned the individual values and an AnalysisException is thrown.
+ * Multiple columns can be unpivoted in one row by providing multiple value 
column names
+ * and the same number of unpivot value expressions:
+ * {{{
+ *   // one-dimensional value columns
+ *   Unpivot(
+ * Some(Seq("id")),
+ * Some(Seq(
+ *   Seq("val1"),
+ *   Seq("val2")
+ * )),
+ * None,
+ * "var",
+ * Seq("val")
+ *   )
+ *
+ *   // two-dimensional value columns
+ *   Unpivot(
+ * Some(Seq("id")),
+ * Some(Seq(
+ *   Seq("val1.1", "val1.2"),
+ *   Seq("val2.1", "val2.2"))
+ * ),
+ * None,
+ * "var",
+ * Seq("val1", "val2")
+ *   )
+ * }}}
+ *
+ * The variable column will contain the name of the unpivot value while the 
value columns contain
+ * the unpivot values. Multi-dimensional unpivot values can be given `aliases`:
+ * }}}
+ *   // two-dimensional value columns with aliases
+ *   Unpivot(
+ * Some(Seq("id")),
+ * Some(Seq(
+ *   Seq("val1.1", "val1.2"),
+ *   Seq("val2.1", "val2.2"))
+ * ),
+ * Some(Seq(
+ *   Some("val1"),
+ *   Some("val2")
+ * )),
+ * "var",
+ * Seq("val1", "val2")
+ *   )
+ * }}}
+ *
+ * All "value" columns must share a least common data type. Unless they are 
the same data type,
+ * all "value" columns are cast to the nearest common data type. For instance,
+ * types `IntegerType` and `LongType` are cast to `LongType`, while 
`IntegerType` and `StringType`
+ * do not have a common data type and `unpivot` fails with an 
`AnalysisException`.
  *
  * @see 
`org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.UnpivotCoercion`
  *
  * @param idsId columns
  * @param values Value columns to unpivot
+ * @param aliasesOptional aliases for values
  * @param variableColumnName Name of the variable column
- * @param valueColumnNameName of the value column
+ * @param valueColumnNames   Names of the value columns
  * @param child  Child operator
  */
 case class Unpivot(
-ids: Seq[NamedExpression],
-values: Seq[NamedExpression],
+ids: Option[Seq[NamedExpression]],
+values: Option[Seq[Seq[NamedExpression]]],
+aliases: Option[Seq[Option[String]]],

Review Comment:
   It's in `QueryPlan.mapExpressions`, Spark does not accept expressions in 
`Tuple2`...
   
   Another way is to create a `case class UnpivotValueColumns(children: 
Seq[Expression], alias: Option[String]) extends Expression`, and then use 
`values: Option[Seq[UnpivotValueColumns]]` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #36766: [SPARK-32184][SQL] Remove inferred predicate if it has InOrCorrelatedExistsSubquery

2022-09-14 Thread GitBox


github-actions[bot] commented on PR #36766:
URL: https://github.com/apache/spark/pull/36766#issuecomment-1247429345

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #36626: [SPARK-39249][SQL] Improve subexpression elimination for conditional expressions

2022-09-14 Thread GitBox


github-actions[bot] commented on PR #36626:
URL: https://github.com/apache/spark/pull/36626#issuecomment-1247429366

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread GitBox


xinrong-meng commented on PR #37888:
URL: https://github.com/apache/spark/pull/37888#issuecomment-1247424459

   It may be easier to review by commits :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng opened a new pull request, #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread GitBox


xinrong-meng opened a new pull request, #37888:
URL: https://github.com/apache/spark/pull/37888

   ### What changes were proposed in this pull request?
   Consolidate `lit` function with NumPy scalar in sql and pandas module.
   
   Major changes:
   - More accurate types for `lit` in sql with NumPy scalar input
   - Replace `lit` in pandas with `lit` in sql
   
   ### Why are the changes needed?
   As part of [SPARK-39405](https://issues.apache.org/jira/browse/SPARK-39405) 
for NumPy support in SQL.
   
   That provides more precise and consistent Numpy support across PySpark 
moduels.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. More accurate types for `lit`.
   
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


dtenedor commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r971397869


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -823,12 +834,16 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   // it must also be in the aggregate expressions to be rewritten 
in the optimization
   // phase.
   if (containsExpr(a.groupingExpressions) && 
!containsExpr(a.aggregateExpressions)) {
-failAnalysis("Correlated scalar subqueries in the group by 
clause " +
-  s"must also be in the aggregate expressions:\n$a")
+a.failAnalysis(
+  errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  errorSubClass = "MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY",
+  treeNodes = Seq(a))

Review Comment:
   Good Q. The way I implemented it, `a` is a `TreeNode`. This overload of 
`failAnalysis` I added takes a `Seq[TreeNode[_]]` and then assigns their 
`toString`s to the `treeNode` parameter. I do this in the `failAnalysis` 
overload itself in order to normalize the expression IDs in the string to keep 
tests deterministic (the LogicalPlan `canonicalized` method aims to support 
this as well, but is not available during analysis since it uses `transformUp`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #37884: [WIP][SPARK-40427][SQL] Move LIMIT/OFFSET CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


AmplabJenkins commented on PR #37884:
URL: https://github.com/apache/spark/pull/37884#issuecomment-1247415506

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng closed pull request #37777: [WIP][SPARK-40309][PYTHON][PS] Introduce `sql_conf` context manager for `pyspark.sql`

2022-09-14 Thread GitBox


xinrong-meng closed pull request #3: [WIP][SPARK-40309][PYTHON][PS] 
Introduce `sql_conf` context manager for `pyspark.sql`
URL: https://github.com/apache/spark/pull/3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


gengliangwang commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r971390448


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -823,12 +834,16 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
   // it must also be in the aggregate expressions to be rewritten 
in the optimization
   // phase.
   if (containsExpr(a.groupingExpressions) && 
!containsExpr(a.aggregateExpressions)) {
-failAnalysis("Correlated scalar subqueries in the group by 
clause " +
-  s"must also be in the aggregate expressions:\n$a")
+a.failAnalysis(
+  errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  errorSubClass = "MUST_AGGREGATE_CORRELATED_SCALAR_SUBQUERY",
+  treeNodes = Seq(a))

Review Comment:
   why do we need to pass `a` itself?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau opened a new pull request, #37887: [SPARK-40360] ALREADY_EXISTS and NOT_FOUND exceptions

2022-09-14 Thread GitBox


srielau opened a new pull request, #37887:
URL: https://github.com/apache/spark/pull/37887

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau closed pull request #37811: [SPARK-40360] *_ALREADY_EXISTS and *_NOT_FOUND error

2022-09-14 Thread GitBox


srielau closed pull request #37811: [SPARK-40360] *_ALREADY_EXISTS and 
*_NOT_FOUND error
URL: https://github.com/apache/spark/pull/37811


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a diff in pull request #37671: [SPARK-40229][PS][TEST] Re-enable excel I/O test for pandas API on Spark

2022-09-14 Thread GitBox


Yikun commented on code in PR #37671:
URL: https://github.com/apache/spark/pull/37671#discussion_r971381963


##
python/pyspark/pandas/tests/test_dataframe_conversion.py:
##
@@ -90,7 +90,6 @@ def get_excel_dfs(pandas_on_spark_location, pandas_location):
 "expected": pd.read_excel(pandas_location, index_col=0),
 }
 
-@unittest.skip("openpyxl")
 def test_to_excel(self):

Review Comment:
   Thanks, good to know!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread GitBox


huaxingao commented on PR #37886:
URL: https://github.com/apache/spark/pull/37886#issuecomment-1247393264

   cc @cloud-fan @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rahulbhatia2702 commented on pull request #32397: [WIP][SPARK-35084][CORE] Spark 3: supporting "--packages" in k8s cluster mode

2022-09-14 Thread GitBox


rahulbhatia2702 commented on PR #32397:
URL: https://github.com/apache/spark/pull/32397#issuecomment-1247390380

   is there a solution for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


dtenedor commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r971345715


##
sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out:
##
@@ -317,11 +317,20 @@ SELECT * FROM t1, LATERAL (SELECT c1 + c2 + rand(0) AS c3)
 struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
-Non-deterministic lateral subqueries are not supported when joining with outer 
relations that produce more than one row
-SubqueryAlias __auto_generated_subquery_name
-+- Project [(cast((outer(c1#x) + outer(c2#x)) as double) + rand(0)) AS c3#x]
-   +- OneRowRelation
-; line 1 pos 9
+{
+  "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  "errorSubClass" : "NON_DETERMINISTIC_LATERAL_SUBQUERIES",
+  "messageParameters" : {
+"treeNode" : ": !LateralJoin lateral-subquery#5 [c1#6 && c2#7], Inner\n:  
+- SubqueryAlias __auto_generated_subquery_name\n: +- Project 
[(cast((outer(c1#2) + outer(c2#3)) as double) + rand(0)) AS c3#4]\n:+- 
OneRowRelation\n+- SubqueryAlias spark_catalog.default.t1\n   +- View 
(`spark_catalog`.`default`.`t1`, [c1#2,c2#3])\n  +- Project [cast(col1#0 as 
int) AS c1#2, cast(col2#1 as int) AS c2#3]\n +- LocalRelation [col1#0, 
col2#1]\n"

Review Comment:
   Yes we did discuss this already in the PR threads :) I suggested eliding the 
plan strings given that we have the query context now, but some folks opined 
that the strings are helpful for debugging. In the end I struck a compromise by 
including the strings but covered by a new SQLConf to toggle their presence. 
This new config can remain enabled by default for now in Apache Spark so folks 
who are used to the strings for debugging may continue to use them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


dtenedor commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r971345715


##
sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out:
##
@@ -317,11 +317,20 @@ SELECT * FROM t1, LATERAL (SELECT c1 + c2 + rand(0) AS c3)
 struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
-Non-deterministic lateral subqueries are not supported when joining with outer 
relations that produce more than one row
-SubqueryAlias __auto_generated_subquery_name
-+- Project [(cast((outer(c1#x) + outer(c2#x)) as double) + rand(0)) AS c3#x]
-   +- OneRowRelation
-; line 1 pos 9
+{
+  "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  "errorSubClass" : "NON_DETERMINISTIC_LATERAL_SUBQUERIES",
+  "messageParameters" : {
+"treeNode" : ": !LateralJoin lateral-subquery#5 [c1#6 && c2#7], Inner\n:  
+- SubqueryAlias __auto_generated_subquery_name\n: +- Project 
[(cast((outer(c1#2) + outer(c2#3)) as double) + rand(0)) AS c3#4]\n:+- 
OneRowRelation\n+- SubqueryAlias spark_catalog.default.t1\n   +- View 
(`spark_catalog`.`default`.`t1`, [c1#2,c2#3])\n  +- Project [cast(col1#0 as 
int) AS c1#2, cast(col2#1 as int) AS c2#3]\n +- LocalRelation [col1#0, 
col2#1]\n"

Review Comment:
   Yes we did discuss this already in the PR threads :) I suggested eliding the 
plan strings given that we have the query context now, but some folks opined 
that the strings are helpful for debugging. In the end I stuck a compromise by 
including the strings but covered by a new SQLConf to toggle their presence. 
This new config can remain enabled by default for now in Apache Spark so folks 
who are used to the strings for debugging may continue to use them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-14 Thread GitBox


sadikovi commented on code in PR #37881:
URL: https://github.com/apache/spark/pull/37881#discussion_r971343049


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##
@@ -186,10 +186,10 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
 
   // Partition keys are not available in the statistics of the files.
   // `dataColumns` might have partition columns, we need to filter them 
out.
-  val dataColumnsWithoutPartitionCols = 
dataColumns.filterNot(partitionColumns.contains)
+  val dataColumnsWithoutPartitionCols = AttributeSet(dataColumns) -- 
partitionColumns

Review Comment:
   I am not sure this is correct. Can you elaborate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


gengliangwang commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r971337689


##
sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out:
##
@@ -317,11 +317,20 @@ SELECT * FROM t1, LATERAL (SELECT c1 + c2 + rand(0) AS c3)
 struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
-Non-deterministic lateral subqueries are not supported when joining with outer 
relations that produce more than one row
-SubqueryAlias __auto_generated_subquery_name
-+- Project [(cast((outer(c1#x) + outer(c2#x)) as double) + rand(0)) AS c3#x]
-   +- OneRowRelation
-; line 1 pos 9
+{
+  "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  "errorSubClass" : "NON_DETERMINISTIC_LATERAL_SUBQUERIES",
+  "messageParameters" : {
+"treeNode" : ": !LateralJoin lateral-subquery#5 [c1#6 && c2#7], Inner\n:  
+- SubqueryAlias __auto_generated_subquery_name\n: +- Project 
[(cast((outer(c1#2) + outer(c2#3)) as double) + rand(0)) AS c3#4]\n:+- 
OneRowRelation\n+- SubqueryAlias spark_catalog.default.t1\n   +- View 
(`spark_catalog`.`default`.`t1`, [c1#2,c2#3])\n  +- Project [cast(col1#0 as 
int) AS c1#2, cast(col2#1 as int) AS c2#3]\n +- LocalRelation [col1#0, 
col2#1]\n"

Review Comment:
   Given there is already query context in the error message, are we sure about 
showing the logical plan as well? (This PR contains a lot of threads, I am not 
sure if this is discussed.)
   cc @srielau 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37885: [SPARK-40428][CORE][WIP] Add a shutdown hook in the CoarseGrainedSchedulerBackend

2022-09-14 Thread GitBox


dongjoon-hyun commented on code in PR #37885:
URL: https://github.com/apache/spark/pull/37885#discussion_r971295296


##
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##
@@ -127,6 +127,19 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   
ThreadUtils.newDaemonSingleThreadScheduledExecutor("cleanup-decommission-execs")
 }
 
+  private var _shutdownHookRef: AnyRef = _
+
+  _shutdownHookRef = ShutdownHookManager.addShutdownHook(
+ShutdownHookManager.DEFAULT_SHUTDOWN_PRIORITY) { () =>

Review Comment:
   nit. indentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao opened a new pull request, #37886: [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread GitBox


huaxingao opened a new pull request, #37886:
URL: https://github.com/apache/spark/pull/37886

   
   ### What changes were proposed in this pull request?
   Only set `KeyGroupedPartitioning` when the referenced column is in the output
   
   ### Why are the changes needed?
   bug fixing
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   new test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on a diff in pull request #37811: [SPARK-40360] *_ALREADY_EXISTS and *_NOT_FOUND error

2022-09-14 Thread GitBox


srielau commented on code in PR #37811:
URL: https://github.com/apache/spark/pull/37811#discussion_r971174244


##
core/src/main/resources/error/error-classes.json:
##
@@ -400,6 +418,21 @@
 ],
 "sqlState" : "42000"
   },
+  "PARTITIONS_ALREADY_EXIST" : {
+"message" : [
+  "Cannot ADD or RENAME TO partition(s)  in table 
 because they already exist.",
+  "Choose a different name, drop the existing partition, or add the IF NOT 
EXISTS clause to tolerate a pre-existing partition."
+],
+"sqlState" : "42000"
+  },
+  "PARTITIONS_NOT_FOUND" : {
+"message" : [
+  "The partition(s)  cannot be found in table 
.",

Review Comment:
   ```suggestion
 "The partition(s)  cannot be found in table 
.",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk opened a new pull request, #37885: [SPARK-40428][K8S][CORE][WIP] Add a shutdown hook in the CoarseGrainedSchedulerBackend

2022-09-14 Thread GitBox


holdenk opened a new pull request, #37885:
URL: https://github.com/apache/spark/pull/37885

   ### What changes were proposed in this pull request?
   
   Add a shutdown hook in the CoarseGrainedSchedulerBackend
   
   ### Why are the changes needed?
   
   Sometimes if the driver shuts down abnormally resources may be left dangling.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on a diff in pull request #37811: [SPARK-40360] *_ALREADY_EXISTS and *_NOT_FOUND error

2022-09-14 Thread GitBox


srielau commented on code in PR #37811:
URL: https://github.com/apache/spark/pull/37811#discussion_r971132688


##
docs/sql-performance-tuning.md:
##
@@ -27,9 +27,9 @@ turning on some experimental options.
 
 ## Caching Data In Memory
 
-Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
+Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("table_name")` or `dataFrame.cache()`.
 Then Spark SQL will scan only required columns and will automatically tune 
compression to minimize
-memory usage and GC pressure. You can call 
`spark.catalog.uncacheTable("tableName")` or `dataFrame.unpersist()` to remove 
the table from memory.
+memory usage and GC pressure. You can call 
`spark.catalog.uncacheTable("table_name")` or `dataFrame.unpersist()` to remove 
the table from memory.

Review Comment:
   ```suggestion
   memory usage and GC pressure. You can call 
`spark.catalog.uncacheTable("tableName")` or `dataFrame.unpersist()` to remove 
the table from memory.
   ```



##
R/pkg/tests/fulltests/test_sparkSQL.R:
##
@@ -662,8 +662,8 @@ test_that("test tableNames and tables", {
   tables <- listTables()
   expect_equal(count(tables), count + 1)
   expect_equal(count(tables()), count(tables))
-  expect_true("tableName" %in% colnames(tables()))
-  expect_true(all(c("tableName", "namespace", "isTemporary") %in% 
colnames(tables(
+  expect_true("table_name" %in% colnames(tables()))
+  expect_true(all(c("table_name", "namespace", "isTemporary") %in% 
colnames(tables(

Review Comment:
   ```suggestion
 expect_true("tableName" %in% colnames(tables()))
 expect_true(all(c("tableName", "namespace", "isTemporary") %in% 
colnames(tables(
   ```



##
R/pkg/R/catalog.R:
##
@@ -291,7 +291,7 @@ dropTempView <- function(viewName) {
 #' @note tables since 1.4.0
 tables <- function(databaseName = NULL) {
   # rename column to match previous output schema
-  withColumnRenamed(listTables(databaseName), "name", "tableName")
+  withColumnRenamed(listTables(databaseName), "name", "table_name")

Review Comment:
   ```suggestion
 withColumnRenamed(listTables(databaseName), "name", "tableName")
   ```



##
docs/sql-performance-tuning.md:
##
@@ -27,9 +27,9 @@ turning on some experimental options.
 
 ## Caching Data In Memory
 
-Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
+Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("table_name")` or `dataFrame.cache()`.

Review Comment:
   ```suggestion
   Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
   ```



##
sql/core/src/test/scala/org/apache/spark/sql/SQLContextSuite.scala:
##
@@ -136,7 +136,7 @@ class SQLContextSuite extends SparkFunSuite with 
SharedSparkContext {
 "SELECT isTemporary, tableName from tables WHERE tableName = 
'listtablessuitetable'")
 .collect().toSeq == Row(true, "listtablessuitetable") :: Nil)
 assert(
-  sqlContext.tables().filter("tableName = 
'tables'").select("tableName", "isTemporary")
+  sqlContext.tables().filter("tableName = 
'tables'").select("table_name", "isTemporary")

Review Comment:
   ```suggestion
 sqlContext.tables().filter("tableName = 
'tables'").select("tableName", "isTemporary")
   ```



##
docs/structured-streaming-programming-guide.md:
##
@@ -2170,7 +2170,7 @@ Hence, use it with caution.
 {% highlight scala %}
 writeStream
 .format("memory")
-.queryName("tableName")
+.queryName("table_name")

Review Comment:
   ```suggestion
   .queryName("tableName")
   ```



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala:
##
@@ -716,7 +716,7 @@ case class ShowTableExtended(
 object ShowTableExtended {
   def getOutputAttrs: Seq[Attribute] = Seq(
 AttributeReference("namespace", StringType, nullable = false)(),
-AttributeReference("tableName", StringType, nullable = false)(),
+AttributeReference("table_name", StringType, nullable = false)(),

Review Comment:
   ```suggestion
   AttributeReference("tableName", StringType, nullable = false)(),
   ```



##
sql/core/src/test/scala/org/apache/spark/sql/SQLContextSuite.scala:
##
@@ -123,7 +123,7 @@ class SQLContextSuite extends SparkFunSuite with 
SharedSparkContext {
 
 val expectedSchema = StructType(
   StructField("namespace", StringType, false) ::
-StructField("tableName", StringType, false) ::
+StructField("table_name", StringType, false) ::

Review Comment:
   ```suggestion
   StructField("tableName", StringType, false) ::
   ```




[GitHub] [spark] dtenedor opened a new pull request, #37884: [WIP][SPARK-40427][SQL] Move LIMIT/OFFSET CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


dtenedor opened a new pull request, #37884:
URL: https://github.com/apache/spark/pull/37884

   ### What changes were proposed in this pull request?
   
   Move LIMIT/OFFSET CheckAnalysis error messages to use the new error 
framework.
   
   This will help improve the usability of Apache Spark as a product, and link 
against documentation.
   
   ### Why are the changes needed?
   
   Error messages related to SQL query analysis in Apache Spark need some work 
to make them more descriptive and actionable.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, error messages change.
   
   ### How was this patch tested?
   
   Unit test and query test coverage shows the updates in error messages 
received.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #18748: [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel

2022-09-14 Thread GitBox


srowen commented on PR #18748:
URL: https://github.com/apache/spark/pull/18748#issuecomment-1247113606

   What do you mean? It's in the API:
   
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALSModel.html#pyspark.ml.recommendation.ALSModel.recommendForUserSubset
   
https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/recommendation/ALSModel.html#recommendForUserSubset-org.apache.spark.sql.Dataset-int-


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


dtenedor commented on PR #37840:
URL: https://github.com/apache/spark/pull/37840#issuecomment-1247100836

   Hi @MaxGekk I responded to your comments, please take another look at this 
when you have time.  @gengliangwang @allisonwang-db @srielau FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


dtenedor commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r971116242


##
sql/core/src/test/resources/sql-tests/results/subquery/negative-cases/invalid-correlation.sql.out:
##
@@ -105,14 +131,20 @@ WHERE  t1a IN (SELECT t2a
 struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
-Expressions referencing the outer query are not supported outside of 
WHERE/HAVING clauses:
-Aggregate [min(outer(t2a#x)) AS min(outer(t2.t2a))#x]
-+- SubqueryAlias t3
-   +- View (`t3`, [t3a#x,t3b#x,t3c#x])
-  +- Project [cast(t3a#x as int) AS t3a#x, cast(t3b#x as int) AS t3b#x, 
cast(t3c#x as int) AS t3c#x]
- +- Project [t3a#x, t3b#x, t3c#x]
-+- SubqueryAlias t3
-   +- LocalRelation [t3a#x, t3b#x, t3c#x]
+{
+  "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  "errorSubClass" : "UNSUPPORTED_CORRELATED_REFERENCE",
+  "messageParameters" : {
+"planString" : ": Aggregate [min(outer(t2a#6)) AS 
min(outer(t2.t2a))#7]\n+- SubqueryAlias t3\n   +- View (`t3`, 
[t3a#3,t3b#4,t3c#5])\n  +- Project [cast(t3a#0 as int) AS t3a#3, cast(t3b#1 
as int) AS t3b#4, cast(t3c#2 as int) AS t3c#5]\n +- Project [t3a#0, 
t3b#1, t3c#2]\n+- SubqueryAlias t3\n   +- LocalRelation 
[t3a#0, t3b#1, t3c#2]\n"

Review Comment:
   Good Q. I found they are not stable when running the tests outside of my 
local machine, therefore to prevent the tests from becoming flaky, I found it 
necessary to normalize the expression IDs. I added code in the `failAnalysis` 
overload that takes `TreeNode`s to do that, iterating through the trees to 
reset the expression IDs. This should be more robust than regex-replacing the 
string. We do have the existing `canonicalized` method on logical plans and 
expressions, but that uses `transformUp` which is not allowed in the analyzer, 
so I ended up doing it separately here for this purpose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


dtenedor commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r971113842


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -1563,10 +1563,12 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
 new AnalysisException(s"'$operation' does not support partitioning")
   }
 
-  def mixedRefsInAggFunc(funcStr: String): Throwable = {
-val msg = "Found an aggregate function in a correlated predicate that has 
both " +
-  "outer and local references, which is not supported: " + funcStr
-new AnalysisException(msg)
+  def mixedRefsInAggFunc(funcStr: String, origin: Origin): Throwable = {
+new AnalysisException(
+  errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  errorSubClass = "AGGREGATE_FUNCTION_MIXED_OUTER_LOCAL_REFERENCES",
+  origin = origin,
+  messageParameters = Map("function" -> funcStr))

Review Comment:
   Good Q,
   Looks like this is a `sql` string generated from a predicate which may 
contain multiple functions. It renders from the error-classes.json file like 
this:
   
   ```
   Found an aggregate function in a correlated predicate that has both outer 
and local references, which is not supported: 
   ```
   
   It might be better to leave it unquoted since some parts of the predicate 
may contain their own double-quotes, e.g. literal string values. Will leave it 
alone for now, Serge please comment if you want it to work differently and I 
can update it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


dtenedor commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r971108777


##
sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out:
##
@@ -100,12 +100,17 @@ WHERE  udf(t1.v) >= (SELECT   min(udf(t2.v))
 struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
-Correlated column is not allowed in predicate (CAST(udf(cast(k as string)) AS 
STRING) = CAST(udf(cast(outer(k#x) as string)) AS STRING)):
-Aggregate [cast(udf(cast(max(cast(udf(cast(v#x as string)) as int)) as 
string)) as int) AS udf(max(udf(v)))#x]
-+- Filter (cast(udf(cast(k#x as string)) as string) = cast(udf(cast(outer(k#x) 
as string)) as string))
-   +- SubqueryAlias t2
-  +- View (`t2`, [k#x,v#x])
- +- Project [cast(k#x as string) AS k#x, cast(v#x as int) AS v#x]
-+- Project [k#x, v#x]
-   +- SubqueryAlias t2
-  +- LocalRelation [k#x, v#x]
+{
+  "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  "errorSubClass" : "CORRELATED_COLUMN_IS_NOT_ALLOWED_IN_PREDICATE",
+  "messageParameters" : {
+"planString" : ": (cast(udf(cast(k#0 as string)) as string) = 
cast(udf(cast(outer(k#1) as string)) as string))"

Review Comment:
   Good point, this comes from expressions rather than plans. Renamed all to 
`treeNode` instead of `planString`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


dtenedor commented on PR #37840:
URL: https://github.com/apache/spark/pull/37840#issuecomment-1247084014

   > Hi @dtenedor, I have just merged the bug fix 
https://github.com/apache/spark/pull/37861. The query context should be set 
correctly in the AnlysisException. I believe the test cases in this PR need to 
be updated so please try updating the code base to the latest master.
   
   @gengliangwang thanks, this is done. We can see the new query contexts in 
this PR at the latest commit, they appear more accurate now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-14 Thread GitBox


sunchao commented on PR #37881:
URL: https://github.com/apache/spark/pull/37881#issuecomment-1247076334

   @viirya please see test cases added in 
[SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833).
   
   cc @sadikovi @cloud-fan @HyukjinKwon too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ucas010 commented on pull request #18748: [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel

2022-09-14 Thread GitBox


ucas010 commented on PR #18748:
URL: https://github.com/apache/spark/pull/18748#issuecomment-1247071158

   recommendForUserSubset could not be used
   why ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37882: [SPARK-38017][FOLLOWUP][3.3] Hide TimestampNTZ in the doc

2022-09-14 Thread GitBox


dongjoon-hyun commented on PR #37882:
URL: https://github.com/apache/spark/pull/37882#issuecomment-1247068156

   cc @wangyum since he is a release manager for Apache Spark 3.3.1 and 
`v3.3.1-rc1` is already made.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak opened a new pull request, #37883: [SPARK-38017][FOLLOWUP][3.2] Hide TimestampNTZ in the doc

2022-09-14 Thread GitBox


sarutak opened a new pull request, #37883:
URL: https://github.com/apache/spark/pull/37883

   ### What changes were proposed in this pull request?
   
   This PR removes `TimestampNTZ` from the doc about `TimeWindow` and 
`SessionWIndow`.
   
   ### Why are the changes needed?
   
   As we discussed, it's better to hide `TimestampNTZ` from the doc.
   https://github.com/apache/spark/pull/35313#issuecomment-1185192162
   
   ### Does this PR introduce _any_ user-facing change?
   
   The document will be changed, but there is no compatibility problem.
   
   ### How was this patch tested?
   
   Built the doc with `SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll build` at 
`doc` directory.
   Then, confirmed the generated HTML.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak opened a new pull request, #37882: [SPARK-38017][FOLLOWUP] Hide TimestampNTZ in the doc

2022-09-14 Thread GitBox


sarutak opened a new pull request, #37882:
URL: https://github.com/apache/spark/pull/37882

   ### What changes were proposed in this pull request?
   
   This PR removes `TimestampNTZ` from the doc about `TimeWindow` and 
`SessionWIndow`.
   
   ### Why are the changes needed?
   
   As we discussed, it's better to hide `TimestampNTZ` from the doc.
   https://github.com/apache/spark/pull/35313#issuecomment-1185192162
   
   ### Does this PR introduce _any_ user-facing change?
   
   The document will be changed, but there is no compatibility problem.
   
   ### How was this patch tested?
   
   Built the doc with `SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll build` at 
`doc` directory.
   Then, confirmed the generated HTML.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao opened a new pull request, #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-14 Thread GitBox


sunchao opened a new pull request, #37881:
URL: https://github.com/apache/spark/pull/37881

   
   
   ### What changes were proposed in this pull request?
   
   
   Currently in Parquet V1 read path, Spark will pushdown data filters even if 
they have no reference in the Parquet read schema. This can cause correctness 
issues as described in 
[SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833).
   
   The root cause, it seems, is because in the V1 path, we first uses 
`AttributeReference` equality to filter out data columns without partition 
columns, and then use `AttributeSet` equality to filter out filters with only 
references to data columns.
   There's inconsistency in the two processes above.
   
   Take the following scenario as example:
   - data column: `[COL, a]`
   - partition column: `[col]`
   - filter: `col > 10`
   
   With `AttributeReference` equality, `COL` is not equal to `col` and thus the 
filtered out data column set is still `[COL, a]`. However, when calculating 
filters with only reference to data columns, `COL` is **considered equal** to 
`col` and the filter `col > 10`, when checking with `[COL, a]`, is considered 
to have reference to data columns, and thus will not be removed.
   
   ### Why are the changes needed?
   
   
   This fixes the correctness bug described in 
[SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833).
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   
   ### How was this patch tested?
   
   
   There are existing test cases for this issue, so this PR just reuses them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-14 Thread GitBox


MaxGekk commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r971046318


##
sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out:
##
@@ -100,12 +100,17 @@ WHERE  udf(t1.v) >= (SELECT   min(udf(t2.v))
 struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
-Correlated column is not allowed in predicate (CAST(udf(cast(k as string)) AS 
STRING) = CAST(udf(cast(outer(k#x) as string)) AS STRING)):
-Aggregate [cast(udf(cast(max(cast(udf(cast(v#x as string)) as int)) as 
string)) as int) AS udf(max(udf(v)))#x]
-+- Filter (cast(udf(cast(k#x as string)) as string) = cast(udf(cast(outer(k#x) 
as string)) as string))
-   +- SubqueryAlias t2
-  +- View (`t2`, [k#x,v#x])
- +- Project [cast(k#x as string) AS k#x, cast(v#x as int) AS v#x]
-+- Project [k#x, v#x]
-   +- SubqueryAlias t2
-  +- LocalRelation [k#x, v#x]
+{
+  "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  "errorSubClass" : "CORRELATED_COLUMN_IS_NOT_ALLOWED_IN_PREDICATE",
+  "messageParameters" : {
+"planString" : ": (cast(udf(cast(k#0 as string)) as string) = 
cast(udf(cast(outer(k#1) as string)) as string))"

Review Comment:
   Is it a plan string? Let's rename it.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -1563,10 +1563,12 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
 new AnalysisException(s"'$operation' does not support partitioning")
   }
 
-  def mixedRefsInAggFunc(funcStr: String): Throwable = {
-val msg = "Found an aggregate function in a correlated predicate that has 
both " +
-  "outer and local references, which is not supported: " + funcStr
-new AnalysisException(msg)
+  def mixedRefsInAggFunc(funcStr: String, origin: Origin): Throwable = {
+new AnalysisException(
+  errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  errorSubClass = "AGGREGATE_FUNCTION_MIXED_OUTER_LOCAL_REFERENCES",
+  origin = origin,
+  messageParameters = Map("function" -> funcStr))

Review Comment:
   Should we quote the function? by `toSQLExpr` if it is an expression or maybe 
by `toSQLId`? cc @srielau 



##
sql/core/src/test/resources/sql-tests/results/subquery/negative-cases/invalid-correlation.sql.out:
##
@@ -105,14 +131,20 @@ WHERE  t1a IN (SELECT t2a
 struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
-Expressions referencing the outer query are not supported outside of 
WHERE/HAVING clauses:
-Aggregate [min(outer(t2a#x)) AS min(outer(t2.t2a))#x]
-+- SubqueryAlias t3
-   +- View (`t3`, [t3a#x,t3b#x,t3c#x])
-  +- Project [cast(t3a#x as int) AS t3a#x, cast(t3b#x as int) AS t3b#x, 
cast(t3c#x as int) AS t3c#x]
- +- Project [t3a#x, t3b#x, t3c#x]
-+- SubqueryAlias t3
-   +- LocalRelation [t3a#x, t3b#x, t3c#x]
+{
+  "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY",
+  "errorSubClass" : "UNSUPPORTED_CORRELATED_REFERENCE",
+  "messageParameters" : {
+"planString" : ": Aggregate [min(outer(t2a#6)) AS 
min(outer(t2.t2a))#7]\n+- SubqueryAlias t3\n   +- View (`t3`, 
[t3a#3,t3b#4,t3c#5])\n  +- Project [cast(t3a#0 as int) AS t3a#3, cast(t3b#1 
as int) AS t3b#4, cast(t3c#2 as int) AS t3c#5]\n +- Project [t3a#0, 
t3b#1, t3c#2]\n+- SubqueryAlias t3\n   +- LocalRelation 
[t3a#0, t3b#1, t3c#2]\n"

Review Comment:
   > t2a#6
   
   Are the IDs stable? Should we replace them like
   
https://github.com/apache/spark/blob/db5aea60e4b22449d529f9e866366cfdc784b140/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala#L94
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #37877: [SPARK-40423][K8S][TESTS] Add explicit YuniKorn queue submission test coverage

2022-09-14 Thread GitBox


dongjoon-hyun closed pull request #37877: [SPARK-40423][K8S][TESTS] Add 
explicit YuniKorn queue submission test coverage
URL: https://github.com/apache/spark/pull/37877


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37877: [SPARK-40423][K8S][TESTS] Add explicit YuniKorn queue submission test coverage

2022-09-14 Thread GitBox


dongjoon-hyun commented on PR #37877:
URL: https://github.com/apache/spark/pull/37877#issuecomment-1247016463

   Thank you, @viirya and @martin-g ! Merged to master/3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #37853: [SPARK-40404][DOCS] Add precondition description for `spark.shuffle.service.db.backend` in `running-on-yarn.md`

2022-09-14 Thread GitBox


mridulm commented on code in PR #37853:
URL: https://github.com/apache/spark/pull/37853#discussion_r971042246


##
docs/running-on-yarn.md:
##
@@ -852,8 +852,8 @@ The following extra configuration options are available 
when the shuffle service
   spark.shuffle.service.db.backend
   LEVELDB
   
-To specify the kind of disk-base store used in shuffle service state 
store, supports `LEVELDB` and `ROCKSDB` now 
-and `LEVELDB` as default value. 
+When work-preserving restart is enabled in YARN, this is used to specify 
the disk-base store used 
+in shuffle service state store, supports `LEVELDB` and `ROCKSDB` with 
`LEVELDB` as default value. 
 The original data store in `LevelDB/RocksDB` will not be automatically 
convert to another kind of storage now.

Review Comment:
   `convert` -> `converted` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #37843: [SPARK-40398][CORE][SQL] Use Loop instead of Arrays.stream api

2022-09-14 Thread GitBox


mridulm commented on code in PR #37843:
URL: https://github.com/apache/spark/pull/37843#discussion_r971037047


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/V2PredicateSuite.scala:
##
@@ -287,7 +287,7 @@ class V2PredicateSuite extends SparkFunSuite {
   new Predicate("=", Array[Expression](ref("a"), LiteralValue(1, 
IntegerType))),
   new Predicate("=", Array[Expression](ref("b"), LiteralValue(1, 
IntegerType
 assert(predicate1.equals(predicate2))
-assert(predicate1.references.map(_.describe()).toSeq == Seq("a", "b"))
+assert(predicate1.references.map(_.describe()).toSeq.sorted == Seq("a", 
"b"))
 assert(predicate1.describe.equals("(a = 1) OR (b = 1)"))

Review Comment:
   I might have missed @huaxingao's comment - if she is fine with it, looks 
good to me :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


mridulm commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971031243


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala:
##
@@ -129,20 +131,19 @@ object ArrayBasedMapData {
   def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = {
 val keys = map.keyArray.asInstanceOf[GenericArrayData].array
 val values = map.valueArray.asInstanceOf[GenericArrayData].array
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: scala.collection.Seq[Any],
   values: scala.collection.Seq[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] 
= {
-import scala.collection.JavaConverters._
-keys.zip(values).toMap.asJava
+Utils.toJavaMap(keys, values)
   }

Review Comment:
   Do we need this method anymore ? Why not replace with `Utils.toJavaMap` 
entirely (in `JavaTypeInference`) ?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala:
##
@@ -129,20 +131,19 @@ object ArrayBasedMapData {
   def toScalaMap(map: ArrayBasedMapData): Map[Any, Any] = {
 val keys = map.keyArray.asInstanceOf[GenericArrayData].array
 val values = map.valueArray.asInstanceOf[GenericArrayData].array
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: Array[Any], values: Array[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toScalaMap(keys: scala.collection.Seq[Any],
   values: scala.collection.Seq[Any]): Map[Any, Any] = {
-keys.zip(values).toMap
+Utils.toMap(keys, values)
   }
 
   def toJavaMap(keys: Array[Any], values: Array[Any]): java.util.Map[Any, Any] 
= {
-import scala.collection.JavaConverters._
-keys.zip(values).toMap.asJava
+Utils.toJavaMap(keys, values)
   }

Review Comment:
   Do we need this method anymore ? Why not replace with `Utils.toJavaMap` 
entirely (in `JavaTypeInference`) ? Any issues with that ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #37876: [SPARK-40175][CORE][SQL][MLLIB][STREAMING] Optimize the performance of `keys.zip(values).toMap` code pattern

2022-09-14 Thread GitBox


mridulm commented on code in PR #37876:
URL: https://github.com/apache/spark/pull/37876#discussion_r971028877


##
core/src/main/scala/org/apache/spark/util/collection/Utils.scala:
##
@@ -62,4 +63,30 @@ private[spark] object Utils {
*/
   def sequenceToOption[T](input: Seq[Option[T]]): Option[Seq[T]] =
 if (input.forall(_.isDefined)) Some(input.flatten) else None
+
+  /**
+   * Same function as `keys.zip(values).toMap`, but has perf gain.
+   */
+  def toMap[K, V](keys: Iterable[K], values: Iterable[V]): Map[K, V] = {
+val builder = immutable.Map.newBuilder[K, V]
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  builder += (keyIter.next(), valueIter.next()).asInstanceOf[(K, V)]
+}
+builder.result()
+  }
+
+  /**
+   * Same function as `keys.zip(values).toMap.asJava`, but has perf gain.
+   */
+  def toJavaMap[K, V](keys: Iterable[K], values: Iterable[V]): 
java.util.Map[K, V] = {
+val map = new java.util.HashMap[K, V]()
+val keyIter = keys.iterator
+val valueIter = values.iterator
+while (keyIter.hasNext && valueIter.hasNext) {
+  map.put(keyIter.next(), valueIter.next())
+}
+map

Review Comment:
   Make this immutable ?
   I believe the original code would have resulted in an immutable map ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >