[GitHub] [spark] nija-at commented on a diff in pull request #40066: [SPARK-42498] [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-19 Thread via GitHub


nija-at commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r573178


##
python/pyspark/sql/connect/client.py:
##
@@ -772,6 +772,7 @@ def __exit__(
 if isinstance(exc_val, BaseException):
 # Swallow the exception.
 if self._can_retry(exc_val):
+logger.debug(f"Server responded with retryable error: 
{exc_val}")

Review Comment:
   If the service is unreachable for a moment and if the retry is going to fix 
it, it's not an "error".
   It could be a warn but that'll pollute real warnings. I think debug makes 
the most sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nija-at commented on a diff in pull request #40066: [SPARK-42498] [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-19 Thread via GitHub


nija-at commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r573178


##
python/pyspark/sql/connect/client.py:
##
@@ -772,6 +772,7 @@ def __exit__(
 if isinstance(exc_val, BaseException):
 # Swallow the exception.
 if self._can_retry(exc_val):
+logger.debug(f"Server responded with retryable error: 
{exc_val}")

Review Comment:
   If the service is unreachable for a moment and if the retry is going to fix 
it, it's not an "error".
   It could be a warn but that'll pollute real warnings. I think it's debug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nija-at commented on a diff in pull request #40066: [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-19 Thread via GitHub


nija-at commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r576506


##
python/pyspark/sql/connect/client.py:
##
@@ -632,7 +632,7 @@ def _execute(self, req: pb2.ExecutePlanRequest) -> None:
 Proto representation of the plan.
 
 """
-logger.info("Execute")
+logger.debug(f"Execute request: » {os.linesep}{req}{os.linesep} «")

Review Comment:
   Can you suggest a better alternative?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-19 Thread via GitHub


LuciferYang commented on code in PR #40091:
URL: https://github.com/apache/spark/pull/40091#discussion_r576062


##
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java:
##
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.hadoop.CodecFactory;
+import org.apache.parquet.hadoop.codec.ZstandardCodec;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * This class implements a codec factory that is used when reading from 
Parquet. It adds a
+ * workaround for memory issues encountered when reading from zstd-compressed 
files. For
+ * details, see https://issues.apache.org/jira/browse/PARQUET-2160";>PARQUET-2160
+ */
+public class ParquetCodecFactory extends CodecFactory {

Review Comment:
   I think we should add a `TODO` to remind us to revert this file when 
updating Parquet version
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nija-at commented on a diff in pull request #40066: [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-19 Thread via GitHub


nija-at commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r573491


##
python/pyspark/sql/connect/client.py:
##
@@ -409,7 +409,7 @@ def __init__(
 self._builder = ChannelBuilder(connectionString, channelOptions)
 self._user_id = None
 self._retry_policy = {
-"max_retries": 15,
+"max_retries": 4,

Review Comment:
   See PR description.
   
   4 retries keeps the retry loop to 4 seconds which feels more reasonable than 
going down to 3.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nija-at commented on a diff in pull request #40066: [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-19 Thread via GitHub


nija-at commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r573771


##
python/pyspark/sql/connect/client.py:
##
@@ -632,7 +632,7 @@ def _execute(self, req: pb2.ExecutePlanRequest) -> None:
 Proto representation of the plan.
 
 """
-logger.info("Execute")
+logger.debug(f"Execute request: » {os.linesep}{req}{os.linesep} «")

Review Comment:
   curious: why?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nija-at commented on a diff in pull request #40066: [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-19 Thread via GitHub


nija-at commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r573178


##
python/pyspark/sql/connect/client.py:
##
@@ -772,6 +772,7 @@ def __exit__(
 if isinstance(exc_val, BaseException):
 # Swallow the exception.
 if self._can_retry(exc_val):
+logger.debug(f"Server responded with retryable error: 
{exc_val}")

Review Comment:
   If the service is unreachable for a moment and if the retry is going to fix 
it, it's not an error.
   It could be a warn but that'll pollute real warnings. I think it's debug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nija-at commented on a diff in pull request #40066: [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-19 Thread via GitHub


nija-at commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r573491


##
python/pyspark/sql/connect/client.py:
##
@@ -409,7 +409,7 @@ def __init__(
 self._builder = ChannelBuilder(connectionString, channelOptions)
 self._user_id = None
 self._retry_policy = {
-"max_retries": 15,
+"max_retries": 4,

Review Comment:
   See PR description.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-19 Thread via GitHub


pan3793 commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1436414874

   I verified this patch by scanning a large parquet/zstd table and updated the 
PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-19 Thread via GitHub


pan3793 commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1436411798

   > banch-3.3/3.2 use parquet 1.12.2, if this fix is accepted, would you mind 
submitting pr for these two branches? @pan3793
   
   sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #39991: [SPARK-42419][CONNECT][PYTHON] Migrate into error framework for Spark Connect Column API.

2023-02-19 Thread via GitHub


HyukjinKwon commented on PR #39991:
URL: https://github.com/apache/spark/pull/39991#issuecomment-1436398640

   Still fails. Mind fixing them please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-19 Thread via GitHub


LuciferYang commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1436389833

   banch-3.3/3.2 use parquet 1.12.2, if this fix is accepted, would you mind 
submitting pr for these two branches? @pan3793 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ninebigbig commented on pull request #39967: [SPARK-42395][K8S]The code logic of the configmap max size validation lacks extra content

2023-02-19 Thread via GitHub


ninebigbig commented on PR #39967:
URL: https://github.com/apache/spark/pull/39967#issuecomment-1436386864

   Can anyone else help review this PR? @viirya 
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-19 Thread via GitHub


Yikun commented on PR #30:
URL: https://github.com/apache/spark-docker/pull/30#issuecomment-1436385782

   cc @viirya @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect

2023-02-19 Thread via GitHub


itholic opened a new pull request, #40092:
URL: https://github.com/apache/spark/pull/40092

   ### What changes were proposed in this pull request?
   
   This PR proposes to add "Live Notebook: DataFrame with Spark Connect" at 
[Getting 
Started](https://spark.apache.org/docs/latest/api/python/getting_started/index.html)
 documents as below:
   
   https://user-images.githubusercontent.com/44108233/220023031-e3e6bc66-cda9-422d-89bd-cf4b8b322d17.png";>
   
   Basically, the notebook copied the contents of 'Live Notebook: DataFrame', 
and updated the contents related to Spark Connect.
   
   The Notebook looks like the below:
   
   https://user-images.githubusercontent.com/44108233/220023726-dac3e0cc-766a-4a4e-b850-099e07dfda98.png";>
   
   ### Why are the changes needed?
   
   To help quick start using DataFrame with Spark Connect for those who new to 
Spark Connect.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's documentation.
   
   
   ### How was this patch tested?
   
   
   Manually built the docs, and run the CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40090: [SPARK-41741][SQL] Encode the string using the UTF_8 charset in ParquetFilters

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40090:
URL: https://github.com/apache/spark/pull/40090#discussion_r488834


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala:
##
@@ -2133,6 +2134,30 @@ abstract class ParquetFilterSuite extends QueryTest with 
ParquetTest with Shared
   }
 }
   }
+
+  test("SPARK-41741: StringStartsWith should encode the string using the UTF_8 
charset") {
+// A hacky way to set the default Java character encoding.
+def setDefaultEncoding(charset: Charset): Unit = {
+  System.setProperty("file.encoding", charset.name())
+  val defaultCharsetField = 
classOf[Charset].getDeclaredField("defaultCharset")
+  defaultCharsetField.setAccessible(true)

Review Comment:
   I think it's fine without this test - It's too hacky.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-19 Thread via GitHub


itholic commented on PR #40067:
URL: https://github.com/apache/spark/pull/40067#issuecomment-1436379916

   Thanks, @HyukjinKwon !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40082: [SPARK-42488][BUILD] Upgrade commons-crypto to 1.2.0

2023-02-19 Thread via GitHub


LuciferYang commented on PR #40082:
URL: https://github.com/apache/spark/pull/40082#issuecomment-1436378739

   From the release notes, nothing is lost.
   
   `OpenSsl20XNativeJna`,  which is not mentioned in release notes, is a new 
support in version 1.2.0(Because jna is excluded, it cannot be used in Spark).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40089:
URL: https://github.com/apache/spark/pull/40089#discussion_r486490


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -2238,6 +2238,1359 @@ object functions {
*/
   def radians(columnName: String): Column = radians(Column(columnName))
 
+  
//
+  // Misc functions
+  
//
+
+  /**
+   * Calculates the MD5 digest of a binary column and returns the value as a 
32 character hex
+   * string.
+   *
+   * @group misc_funcs
+   * @since 1.5.0

Review Comment:
   yeah let's change `since`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-19 Thread via GitHub


pan3793 commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1436374948

   Based on 
https://github.com/apache/parquet-mr/pull/982#issuecomment-1376750703, I guess 
that the Parquet community may think it's not a critical issue, but in my case, 
it's critical.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 opened a new pull request, #40091: [SPARK-41952] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-19 Thread via GitHub


pan3793 opened a new pull request, #40091:
URL: https://github.com/apache/spark/pull/40091

   
   
   ### What changes were proposed in this pull request?
   
   SPARK-41952 was raised for a while, but unfortunately, the Parquet community 
has not published the patched version yet, as a workaround, we can fix the 
issue on the Spark side first.
   
   We encountered this memory issue when migrating data from parquet/snappy to 
parquet/zstd, Spark executors always occupy unreasonable off-heap memory and 
have a high risk of being killed by NM.
   
   See more discussions at https://github.com/apache/parquet-mr/pull/982 and 
https://github.com/apache/iceberg/pull/5681
   
   ### Why are the changes needed?
   
   The issue is fixed in the parquet community 
[PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the 
patched version is not available yet.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it's bug fix.
   
   ### How was this patch tested?
   
   The existing UT should cover the correctness check, I'm going to run it w/ a 
large table scan to valid the memory leak get fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions

2023-02-19 Thread via GitHub


dongjoon-hyun commented on code in PR #40089:
URL: https://github.com/apache/spark/pull/40089#discussion_r478919


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -2238,6 +2238,1359 @@ object functions {
*/
   def radians(columnName: String): Column = radians(Column(columnName))
 
+  
//
+  // Misc functions
+  
//
+
+  /**
+   * Calculates the MD5 digest of a binary column and returns the value as a 
32 character hex
+   * string.
+   *
+   * @group misc_funcs
+   * @since 1.5.0

Review Comment:
   Ur, I'm a little confused. Is it correct, @hvanhovell and @zhengruifeng ?
   
   cc @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40080: [SPARK-42406]Fix check for missing required fields.

2023-02-19 Thread via GitHub


dongjoon-hyun commented on PR #40080:
URL: https://github.com/apache/spark/pull/40080#issuecomment-1436363361

   Thank you for updates, @rangadi .
   
   cc @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rangadi commented on pull request #40080: [SPARK-42406]Fix check for missing required fields.

2023-02-19 Thread via GitHub


rangadi commented on PR #40080:
URL: https://github.com/apache/spark/pull/40080#issuecomment-1436361799

   @dongjoon-hyun, done. Remove 3.4 in the summary. 
   
   > @rangadi with this change, we can conclude that spark-protobuf supports 
proto2 and proto3 versions right?
   
   Yes. And we always supported Proto2. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #40036: [SPARK-42448][SQL] Fix spark sql shell prompt for current db

2023-02-19 Thread via GitHub


yaooqinn commented on PR #40036:
URL: https://github.com/apache/spark/pull/40036#issuecomment-1436361416

   kindly ping @HyukjinKwon @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #39855: [SPARK-42286][SQL] Fallback to previous codegen code path for complex expr with CAST

2023-02-19 Thread via GitHub


wangyum commented on PR #39855:
URL: https://github.com/apache/spark/pull/39855#issuecomment-1436356362

   @RunyaoChen Could you backport this fix to branch-3.3 to fix SPARK-42473?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum opened a new pull request, #40090: [SPARK-41741][SQL] Encode the string using the UTF_8 charset in ParquetFilters

2023-02-19 Thread via GitHub


wangyum opened a new pull request, #40090:
URL: https://github.com/apache/spark/pull/40090

   ### What changes were proposed in this pull request?
   
   This PR makes it encode the string using the `UTF_8` charset in 
`ParquetFilters`.
   
   ### Why are the changes needed?
   
   Fix data issue where the default charset is not `UTF_8`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40082: [SPARK-42488][BUILD] Upgrade commons-crypto to 1.2.0

2023-02-19 Thread via GitHub


dongjoon-hyun commented on PR #40082:
URL: https://github.com/apache/spark/pull/40082#issuecomment-1436336072

   BTW, I have one question, @LuciferYang . We explicitly `jna` from 
`commons-crypto`. Do you know what are we losing from the enumerated items in 
the `commons-crypto` releasenotes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40082: [SPARK-42488][BUILD] Upgrade commons-crypto to 1.2.0

2023-02-19 Thread via GitHub


LuciferYang commented on PR #40082:
URL: https://github.com/apache/spark/pull/40082#issuecomment-1436333698

   Thanks @dongjoon-hyun @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40082: [SPARK-42488][BUILD] Upgrade commons-crypto to 1.2.0

2023-02-19 Thread via GitHub


dongjoon-hyun closed pull request #40082: [SPARK-42488][BUILD] Upgrade 
commons-crypto to 1.2.0
URL: https://github.com/apache/spark/pull/40082


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40076: [SPARK-42048][PYTHON][CONNECT] Fix the alias name for numpy literals

2023-02-19 Thread via GitHub


zhengruifeng commented on PR #40076:
URL: https://github.com/apache/spark/pull/40076#issuecomment-1436327185

   Late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40067:
URL: https://github.com/apache/spark/pull/40067#discussion_r444363


##
python/pyspark/sql/window.py:
##
@@ -207,6 +210,9 @@ def rowsBetween(start: int, end: int) -> "WindowSpec":
 
 .. versionadded:: 2.1.0
 
+.. versionchanged:: 3.4.0

Review Comment:
   ditto I believe this is already mentioned above in the class doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40067:
URL: https://github.com/apache/spark/pull/40067#discussion_r444023


##
python/pyspark/sql/session.py:
##
@@ -303,6 +306,9 @@ def master(self, master: str) -> "SparkSession.Builder":
 
 .. versionadded:: 2.0.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.

Review Comment:
   Let's remove. Spark Connect doesnt' need master, and it doesn't work together



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40067:
URL: https://github.com/apache/spark/pull/40067#discussion_r443902


##
python/pyspark/sql/readwriter.py:
##
@@ -1934,6 +1934,9 @@ def using(self, provider: str) -> "DataFrameWriterV2":
 """
 Specifies a provider for the underlying output data source.
 Spark's default catalog supports "parquet", "json", etc.
+
+.. versionchanged:: 3.4.0
+Support Spark Connect.

Review Comment:
   ditto `.. versionchanged:: 3.4.0` is already mentioned above in the class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40067:
URL: https://github.com/apache/spark/pull/40067#discussion_r443581


##
python/pyspark/sql/connect/session.py:
##
@@ -106,9 +106,13 @@ def config(
 def master(self, master: str) -> "SparkSession.Builder":
 return self
 
+master.__doc__ = PySparkSession.builder.master.__doc__
+

Review Comment:
   let's remove these too. we should actually remove these methods away later.



##
python/pyspark/sql/connect/session.py:
##
@@ -106,9 +106,13 @@ def config(
 def master(self, master: str) -> "SparkSession.Builder":
 return self
 
+master.__doc__ = PySparkSession.builder.master.__doc__
+
 def appName(self, name: str) -> "SparkSession.Builder":
 return self.config("spark.app.name", name)
 
+appName.__doc__ = PySparkSession.builder.appName.__doc__

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40067:
URL: https://github.com/apache/spark/pull/40067#discussion_r443135


##
python/pyspark/sql/connect/client.py:
##
@@ -203,6 +203,8 @@ def metadata(self) -> Iterable[Tuple[str, str]]:
 parameters will be converted to metadata except ones that are 
explicitly used
 by the channel.
 
+.. versionadded:: 3.4.0

Review Comment:
   `versionadded:: 3.4.0` is in the class already so it's fine



##
python/pyspark/sql/connect/client.py:
##
@@ -440,8 +450,12 @@ def register_udf(
 eval_type: int = PythonEvalType.SQL_BATCHED_UDF,
 deterministic: bool = True,
 ) -> str:
-"""Create a temporary UDF in the session catalog on the other side. We 
generate a
-temporary name for it."""
+"""
+Create a temporary UDF in the session catalog on the other side. We 
generate a
+temporary name for it.
+
+.. versionadded:: 3.4.0

Review Comment:
   ditto let's remove them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40088: [SPARK-42427][SQL][TESTS][FOLLOW-UP] Remove duplicate overflow test for conv

2023-02-19 Thread via GitHub


HyukjinKwon closed pull request #40088: [SPARK-42427][SQL][TESTS][FOLLOW-UP] 
Remove duplicate overflow test for conv
URL: https://github.com/apache/spark/pull/40088


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40088: [SPARK-42427][SQL][TESTS][FOLLOW-UP] Remove duplicate overflow test for conv

2023-02-19 Thread via GitHub


HyukjinKwon commented on PR #40088:
URL: https://github.com/apache/spark/pull/40088#issuecomment-1436312194

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-19 Thread via GitHub


itholic commented on PR #40067:
URL: https://github.com/apache/spark/pull/40067#issuecomment-1436282392

   cc @HyukjinKwon mind taking a look when you find some time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell opened a new pull request, #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions

2023-02-19 Thread via GitHub


hvanhovell opened a new pull request, #40089:
URL: https://github.com/apache/spark/pull/40089

   ### What changes were proposed in this pull request?
   This PR adds the following functions to the scala client:
   - Misc functions.
   - String functions.
   - Date/Time functions.
   
   ### Why are the changes needed?
   We want to be able the same APIs in the scala client as in the original 
Dataset API.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, it adds new for functions to the Spark Connect Scala client.
   
   
   ### How was this patch tested?
   Added tests to `PlanGenerationTestSuite` (and indirectly to 
`ProtoToPlanTestSuite`. Overloads are tested in `FunctionTesSuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40084: [SPARK-42490][BUILD] Upgrade protobuf-java from 3.21.12 to 3.22.0

2023-02-19 Thread via GitHub


LuciferYang commented on PR #40084:
URL: https://github.com/apache/spark/pull/40084#issuecomment-1436279676

   Updated pr description, listing bug fix and possibly useful improvements


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun closed pull request #29: Test on 3.3.2-rc1

2023-02-19 Thread via GitHub


Yikun closed pull request #29: Test on 3.3.2-rc1
URL: https://github.com/apache/spark-docker/pull/29


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun opened a new pull request, #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-19 Thread via GitHub


Yikun opened a new pull request, #30:
URL: https://github.com/apache/spark-docker/pull/30

   ### What changes were proposed in this pull request?
   Add Apache Spark 3.3.2 Dockerfiles.
   - Add 3.3.2 GPG key
   - Add .github/workflows/build_3.3.2.yaml
   - ./add-dockerfiles.sh 3.3.2
   
   ### Why are the changes needed?
   Apache Spark 3.3.2 released.
   
   https://lists.apache.org/thread/k8skf16wyn6rg9n0vd0t6l3bhw7c9svq
   
   ### Does this PR introduce _any_ user-facing change?
   Yes in future, new image will publised in future (after DOI reviewed)
   
   ### How was this patch tested?
   Add workflow and CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


Ngone51 commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r415309


##
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala:
##
@@ -2266,6 +2270,150 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with PrivateMethodTe
 }
   }
 
+  test("SPARK-41497: getOrElseUpdateRDDBlock do compute based on cache 
visibility statue") {
+val store = makeBlockManager(8000, "executor1")
+val blockId = RDDBlockId(rddId = 1, splitIndex = 1)
+var computed: Boolean = false
+val data = Seq(1, 2, 3)
+val makeIterator = () => {
+  computed = true
+  data.iterator
+}
+
+// Cache doesn't exist and is not visible.
+assert(store.getStatus(blockId).isEmpty && 
!store.isRDDBlockVisible(blockId))
+val res1 = store.getOrElseUpdateRDDBlock(
+  1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator)
+// Put cache successfully and reported block task info.
+assert(res1.isLeft && computed)
+verify(master, times(1)).updateRDDBlockTaskInfo(blockId, 1)
+
+// Cache exists but not visible.
+computed = false
+assert(store.getStatus(blockId).nonEmpty && 
!store.isRDDBlockVisible(blockId))
+val res2 = store.getOrElseUpdateRDDBlock(
+  1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator)
+// Load cache successfully and reported block task info.
+assert(res2.isLeft && computed)
+verify(master, times(2)).updateRDDBlockTaskInfo(blockId, 1)
+
+// Cache exists and visible.
+store.blockInfoManager.tryAddVisibleBlock(blockId)
+computed = false
+assert(store.getStatus(blockId).nonEmpty && 
store.isRDDBlockVisible(blockId))
+val res3 = store.getOrElseUpdateRDDBlock(
+  1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator)
+// Load cache successfully but not report block task info.
+assert(res3.isLeft && !computed)
+verify(master, times(2)).updateRDDBlockTaskInfo(blockId, 1)
+  }
+
+  test("add block rdd visibility status") {

Review Comment:
   Please add the JIR number for the new tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


Ngone51 commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r414367


##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -150,6 +150,12 @@ private[storage] class BlockInfoManager extends Logging {
*/
   private[this] val blockInfoWrappers = new ConcurrentHashMap[BlockId, 
BlockInfoWrapper]
 
+  /**
+   * Record visible rdd blocks stored in the block manager, entries will be 
removed
+   * by [[removeBlock()]]
+   */
+  private[spark] val visibleRDDBlocks = ConcurrentHashMap.newKeySet[RDDBlockId]

Review Comment:
   It's weird to me that we use the opposite visibility tracking logic between 
the driver and executor. Couldn't we use the same "invisible" logic at the 
executor too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


Ngone51 commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r412825


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -210,6 +219,65 @@ class BlockManagerMasterEndpoint(
 case StopBlockManagerMaster =>
   context.reply(true)
   stop()
+
+case UpdateRDDBlockTaskInfo(blockId, taskId) =>
+  // This is to report the information that a rdd block(with `blockId`) is 
computed
+  // and cached by task(with `taskId`). The happens right after the task 
finished
+  // computing/caching the block only when the block is not visible yet. 
And the rdd
+  // block will be marked as visible only when the corresponding task 
finished successfully.
+  context.reply(updateRDDBlockTaskInfo(blockId, taskId))
+
+case GetRDDBlockVisibility(blockId) =>
+  // Get the visibility status of a specific rdd block.
+  if (!trackingCacheVisibility) {

Review Comment:
   Could you push down the flag check into `isRDDBlockVisible` as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


Ngone51 commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r402068


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -210,6 +219,65 @@ class BlockManagerMasterEndpoint(
 case StopBlockManagerMaster =>
   context.reply(true)
   stop()
+
+case UpdateRDDBlockTaskInfo(blockId, taskId) =>
+  // This is to report the information that a rdd block(with `blockId`) is 
computed
+  // and cached by task(with `taskId`). The happens right after the task 
finished

Review Comment:
   "The" -> "This"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


Ngone51 commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r400497


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -77,6 +78,11 @@ class BlockManagerMasterEndpoint(
   // Mapping from block id to the set of block managers that have the block.
   private val blockLocations = new JHashMap[BlockId, 
mutable.HashSet[BlockManagerId]]
 
+  // Mapping from task id to the set of rdd blocks which are generated from 
the task.
+  private val tidToRddBlockIds = new mutable.HashMap[Long, 
mutable.HashSet[RDDBlockId]]
+  // Record the visible RDD blocks which have been generated at least from one 
successful task.

Review Comment:
   visible? but the variable name is invisible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


Ngone51 commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r400497


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -77,6 +78,11 @@ class BlockManagerMasterEndpoint(
   // Mapping from block id to the set of block managers that have the block.
   private val blockLocations = new JHashMap[BlockId, 
mutable.HashSet[BlockManagerId]]
 
+  // Mapping from task id to the set of rdd blocks which are generated from 
the task.
+  private val tidToRddBlockIds = new mutable.HashMap[Long, 
mutable.HashSet[RDDBlockId]]
+  // Record the visible RDD blocks which have been generated at least from one 
successful task.

Review Comment:
   visible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #39475: [SPARK-41959][SQL] Improve v1 writes with empty2null

2023-02-19 Thread via GitHub


ulysses-you commented on PR #39475:
URL: https://github.com/apache/spark/pull/39475#issuecomment-1436207357

   @cloud-fan any comments ? thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allanf-db commented on pull request #40087: [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide

2023-02-19 Thread via GitHub


allanf-db commented on PR #40087:
URL: https://github.com/apache/spark/pull/40087#issuecomment-1436200153

   Attaching screenshot illustrating the change.
   Current page on the left - Scala code example on the first tab.
   New page on the right - Python code example on the first tab.
   
   https://user-images.githubusercontent.com/112507318/219992905-c4c079d8-ae3f-4462-b41e-2de38d83e6d8.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40081: [SPARK-42487][BUILD] Upgrade Netty to 4.1.89

2023-02-19 Thread via GitHub


LuciferYang commented on PR #40081:
URL: https://github.com/apache/spark/pull/40081#issuecomment-1436186506

   Thanks @srowen @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40088: [SPARK-42427][SQL][TESTS] Remove duplicate overflow test for conv

2023-02-19 Thread via GitHub


HyukjinKwon commented on PR #40088:
URL: https://github.com/apache/spark/pull/40088#issuecomment-1436181584

   cc @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #40088: [SPARK-42427][SQL][TESTS] Remove duplicate overflow test for conv

2023-02-19 Thread via GitHub


HyukjinKwon opened a new pull request, #40088:
URL: https://github.com/apache/spark/pull/40088

   ### What changes were proposed in this pull request?
   
   This PR proposes to remove duplicate test (see 
https://github.com/apache/spark/commit/cb463fb40e8f663b7e3019c8d8560a3490c241d0).
   
   ### Why are the changes needed?
   
   This test fails with ANSI mode on 
https://github.com/apache/spark/actions/runs/4213931226/jobs/7314033662, and 
it's a duplicate. Should better remove.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   Manually tested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


Ngone51 commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r377539


##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -1424,6 +1470,28 @@ private[spark] class BlockManager(
 blockStoreUpdater.save()
   }
 
+  // Check whether a rdd block is visible or not.
+  private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+// Cached blocks are always visible if the feature flag is disabled.
+if (!trackingCacheVisibility) {
+  return true
+}
+
+// If the rdd block visibility information not available in the block 
manager,
+// asking master for the information.
+if (blockInfoManager.isRDDBlockVisible(blockId)) {
+  return true
+}
+
+if(master.isRDDBlockVisible(blockId)) {
+  // Cache the visibility status if block exists.
+  blockInfoManager.tryAddVisibleBlock(blockId)
+  true

Review Comment:
   Should we still return `true` if `blockInfoManager` doesn't contain the 
block?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40013: [WIP][SPARK-42367][CONNECT][PYTHON] `DataFrame.drop` should handle duplicated columns properly

2023-02-19 Thread via GitHub


zhengruifeng commented on PR #40013:
URL: https://github.com/apache/spark/pull/40013#issuecomment-1436177852

   > @zhengruifeng what is the current problem with drop?
   
   existing implement in Python Client follows the behavior in PySpark that 
always convert column name to column, then it has this problem 
https://issues.apache.org/jira/browse/SPARK-42444



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4

2023-02-19 Thread via GitHub


dongjoon-hyun commented on PR #40079:
URL: https://github.com/apache/spark/pull/40079#issuecomment-1436165602

   Oh, I missed that you are say `branch-3.4`. Are you sure?
   > I think we can have this small update now to master and branch 3.4. When 
3.4 is released, we can try to update to a newer version.
   
   If you want to claim this PR as a blocker, you need to do that properly. In 
general, you had better do two things at least.
   - Ping, @xinrong-meng , in order to inform her the existence of the blocker.
   - Update SPARK-42486 properly. Currently, the JIRA Affected Version is quite 
opposite from your comment (or intention) here.
   
   ![Screenshot 2023-02-19 at 5 15 05 
PM](https://user-images.githubusercontent.com/9700541/219988088-7bc0c346-4ef3-41ce-91cf-0f83b57868f5.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40087: [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide

2023-02-19 Thread via GitHub


dongjoon-hyun commented on PR #40087:
URL: https://github.com/apache/spark/pull/40087#issuecomment-1436162671

   BTW, I changed the affected version of this JIRA from 3.4.0 to 3.5.0.
   I don't think Apache Spark 3.4 needs this in any cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40087: [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide

2023-02-19 Thread via GitHub


dongjoon-hyun commented on PR #40087:
URL: https://github.com/apache/spark/pull/40087#issuecomment-1436161884

   Thank you, @allanf-db and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allanf-db commented on pull request #40087: [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide

2023-02-19 Thread via GitHub


allanf-db commented on PR #40087:
URL: https://github.com/apache/spark/pull/40087#issuecomment-1436161575

   Absolutely! Will start a thread @dongjoon-hyun.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40087: [SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide

2023-02-19 Thread via GitHub


HyukjinKwon commented on PR #40087:
URL: https://github.com/apache/spark/pull/40087#issuecomment-1436158404

   +1 to raise a discussion thread in mailing list


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40081: [SPARK-42487][BUILD] Upgrade Netty to 4.1.89

2023-02-19 Thread via GitHub


dongjoon-hyun closed pull request #40081: [SPARK-42487][BUILD] Upgrade Netty to 
4.1.89 
URL: https://github.com/apache/spark/pull/40081


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allanf-db opened a new pull request, #40087: [WIP][SPARK-42493][DOCS][PYTHON] Make Python the first tab for code examples - Spark SQL, DataFrames and Datasets Guide

2023-02-19 Thread via GitHub


allanf-db opened a new pull request, #40087:
URL: https://github.com/apache/spark/pull/40087

   
   
   ### What changes were proposed in this pull request?
   Making Python the first tab for code examples in the Spark SQL, DataFrames 
and Datasets Guide.
   
   
   
   ### Why are the changes needed?
   Python is the easiest approachable and most popular language so this change 
moves it to the first tab (showing by default) for code examples.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the user facing Spark documentation is updated.
   
   
   
   ### How was this patch tested?
   I built the website locally and manually tested the pages.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40066: [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-19 Thread via GitHub


HyukjinKwon commented on PR #40066:
URL: https://github.com/apache/spark/pull/40066#issuecomment-1436149452

   I think we should better file a JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40066: [CONNECT] [PYTHON] reduce spark connect service retries

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r356716


##
python/pyspark/sql/connect/client.py:
##
@@ -409,7 +409,7 @@ def __init__(
 self._builder = ChannelBuilder(connectionString, channelOptions)
 self._user_id = None
 self._retry_policy = {
-"max_retries": 15,
+"max_retries": 4,

Review Comment:
   I don't mind about this number but just curious. Is there a reason to pick 
4? I was thinking 3 is sort of more common :-).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40066: [CONNECT] [PYTHON] reduce spark connect service retries

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r356716


##
python/pyspark/sql/connect/client.py:
##
@@ -409,7 +409,7 @@ def __init__(
 self._builder = ChannelBuilder(connectionString, channelOptions)
 self._user_id = None
 self._retry_policy = {
-"max_retries": 15,
+"max_retries": 4,

Review Comment:
   I don't mind about this number but just curious. Is there a reason to pick 
4? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40066: [CONNECT] [PYTHON] reduce spark connect service retries

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r356494


##
python/pyspark/sql/connect/client.py:
##
@@ -772,6 +772,7 @@ def __exit__(
 if isinstance(exc_val, BaseException):
 # Swallow the exception.
 if self._can_retry(exc_val):
+logger.debug(f"Server responded with retryable error: 
{exc_val}")

Review Comment:
   Shall we also mention that we're trying from this error? The message itself 
looks like it has to be either error or warning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40066: [CONNECT] [PYTHON] reduce spark connect service retries

2023-02-19 Thread via GitHub


HyukjinKwon commented on code in PR #40066:
URL: https://github.com/apache/spark/pull/40066#discussion_r356271


##
python/pyspark/sql/connect/client.py:
##
@@ -632,7 +632,7 @@ def _execute(self, req: pb2.ExecutePlanRequest) -> None:
 Proto representation of the plan.
 
 """
-logger.info("Execute")
+logger.debug(f"Execute request: » {os.linesep}{req}{os.linesep} «")

Review Comment:
   I would avoid the characters such as `»` that's not used in the current 
codebase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40086: [MINOR][SQL] Fix typo and whitespaces in SQLConf

2023-02-19 Thread via GitHub


HyukjinKwon closed pull request #40086: [MINOR][SQL] Fix typo and whitespaces 
in SQLConf
URL: https://github.com/apache/spark/pull/40086


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40086: [MINOR][SQL] Fix typo and whitespaces in SQLConf

2023-02-19 Thread via GitHub


HyukjinKwon commented on PR #40086:
URL: https://github.com/apache/spark/pull/40086#issuecomment-1436148025

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #38592: [SPARK-41088][SQL] Add PartialAggregate and FinalAggregate logic operators

2023-02-19 Thread via GitHub


github-actions[bot] closed pull request #38592: [SPARK-41088][SQL] Add 
PartialAggregate and FinalAggregate logic operators
URL: https://github.com/apache/spark/pull/38592


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40076: [SPARK-42048][PYTHON][CONNECT] Fix the alias name for numpy literals

2023-02-19 Thread via GitHub


HyukjinKwon closed pull request #40076: [SPARK-42048][PYTHON][CONNECT] Fix the 
alias name for numpy literals
URL: https://github.com/apache/spark/pull/40076


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40076: [SPARK-42048][PYTHON][CONNECT] Fix the alias name for numpy literals

2023-02-19 Thread via GitHub


HyukjinKwon commented on PR #40076:
URL: https://github.com/apache/spark/pull/40076#issuecomment-1436140411

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40071: [SPARK-41818][CONNECT][PYTHON][FOLLOWUP][TEST] Enable a doctest for DataFrame.write

2023-02-19 Thread via GitHub


HyukjinKwon closed pull request #40071: 
[SPARK-41818][CONNECT][PYTHON][FOLLOWUP][TEST] Enable a doctest for 
DataFrame.write
URL: https://github.com/apache/spark/pull/40071


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40071: [SPARK-41818][CONNECT][PYTHON][FOLLOWUP][TEST] Enable a doctest for DataFrame.write

2023-02-19 Thread via GitHub


HyukjinKwon commented on PR #40071:
URL: https://github.com/apache/spark/pull/40071#issuecomment-1436139937

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AndreyBozhko opened a new pull request, #40086: [MINOR][SQL] Fix typo and whitespaces in SQLConf

2023-02-19 Thread via GitHub


AndreyBozhko opened a new pull request, #40086:
URL: https://github.com/apache/spark/pull/40086

   
   
   ### What changes were proposed in this pull request?
   
   - Fix `SIMPLIFIEID` -> `SIMPLIFIED`
   - Fix indentation and whitespaces around a few `val` definitions
   
   ### Why are the changes needed?
   
   Fix typo and code formatting
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Existing tests pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4

2023-02-19 Thread via GitHub


dongjoon-hyun commented on PR #40079:
URL: https://github.com/apache/spark/pull/40079#issuecomment-1436104835

   Got it. Thank you for the clarification. Sounds safe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #40063: [CONNECT] Eager Execution of DF.sql()

2023-02-19 Thread via GitHub


grundprinzip commented on code in PR #40063:
URL: https://github.com/apache/spark/pull/40063#discussion_r325958


##
python/setup.py:
##
@@ -297,7 +297,15 @@ def run(self):
 license="http://www.apache.org/licenses/LICENSE-2.0";,
 # Don't forget to update python/docs/source/getting_started/install.rst
 # if you're updating the versions or dependencies.
-install_requires=["py4j==0.10.9.7"],
+install_requires=[

Review Comment:
   ups this should not be here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


mridulm commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r297209


##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -1325,31 +1328,71 @@ private[spark] class BlockManager(
 blockInfoManager.releaseAllLocksForTask(taskAttemptId)
   }
 
+  /**
+   * Retrieve the given rdd block if it exists and is visible, otherwise call 
the provided
+   * `makeIterator` method to compute the block, persist it, and return its 
values.
+   *
+   * @return either a BlockResult if the block was successfully cached, or an 
iterator if the block
+   * could not be cached.
+   */
+  def getOrElseUpdateRDDBlock[T](
+  taskId: Long,
+  blockId: RDDBlockId,
+  level: StorageLevel,
+  classTag: ClassTag[T],
+  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
+val isCacheVisible = isRDDBlockVisible(blockId)
+val res = getOrElseUpdate(blockId, level, classTag, makeIterator, 
isCacheVisible)
+if (res.isLeft && !isCacheVisible) {
+  // Block exists but not visible, report taskId -> blockId info to master.
+  master.updateRDDBlockTaskInfo(blockId, taskId)
+}
+
+res
+  }
+
   /**
* Retrieve the given block if it exists, otherwise call the provided 
`makeIterator` method
* to compute the block, persist it, and return its values.
*
* @return either a BlockResult if the block was successfully cached, or an 
iterator if the block
* could not be cached.
*/
-  def getOrElseUpdate[T](
+  private def getOrElseUpdate[T](
   blockId: BlockId,
   level: StorageLevel,
   classTag: ClassTag[T],
-  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
-// Attempt to read the block from local or remote storage. If it's 
present, then we don't need
-// to go through the local-get-or-put path.
-get[T](blockId)(classTag) match {
-  case Some(block) =>
-return Left(block)
-  case _ =>
-// Need to compute the block.
+  makeIterator: () => Iterator[T],
+  isCacheVisible: Boolean = true): Either[BlockResult, Iterator[T]] = {
+// Track whether the data is computed or not, force to do the computation 
later if need to.
+// The reason we push the force computing later is that once the executor 
is decommissioned we
+// will have a better chance to replicate the cache block because of the 
`checkShouldStore`
+// validation when putting a new block.
+var computed: Boolean = false
+val iterator = () => {
+  computed = true
+  makeIterator()
+}
+if (isCacheVisible) {
+  // Attempt to read the block from local or remote storage. If it's 
present, then we don't need
+  // to go through the local-get-or-put path.
+  get[T](blockId)(classTag) match {
+case Some(block) =>
+  return Left(block)
+case _ =>
+  // Need to compute the block.
+  }
 }
+
 // Initially we hold no locks on this block.
-doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) 
match {
+doPutIterator(blockId, iterator, level, classTag, keepReadLock = true) 
match {
   case None =>
 // doPut() didn't hand work back to us, so the block already existed 
or was successfully
 // stored. Therefore, we now hold a read lock on the block.
+if (!isCacheVisible && !computed) {
+  // Force compute to report accumulator updates.

Review Comment:
   This is the same issue as 
[here](https://github.com/apache/spark/pull/39459#discussion_r1071815096) 
@Ngone51 ... I thought we were doing it as a follow up ?
   
   Note that replacing cached block would also require replication (if 
relevant), master update, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] khalidmammadov commented on a diff in pull request #40073: [SPARK-42484] [SQL] UnsafeRowUtils better error message

2023-02-19 Thread via GitHub


khalidmammadov commented on code in PR #40073:
URL: https://github.com/apache/spark/pull/40073#discussion_r295831


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala:
##
@@ -225,12 +225,12 @@ case class StateStoreCustomTimingMetric(name: String, 
desc: String) extends Stat
 /**
  * An exception thrown when an invalid UnsafeRow is detected in state store.
  */
-class InvalidUnsafeRowException
-  extends RuntimeException("The streaming query failed by state format 
invalidation. " +
-"The following reasons may cause this: 1. An old Spark version wrote the 
checkpoint that is " +
-"incompatible with the current one; 2. Broken checkpoint files; 3. The 
query is changed " +
-"among restart. For the first case, you can try to restart the application 
without " +
-"checkpoint or use the legacy Spark version to process the streaming 
state.", null)
+class InvalidUnsafeRowException(error: String)
+  extends RuntimeException(s"The streaming query failed by state format 
invalidation. " +
+s"The following reasons may cause this: 1. An old Spark version wrote the 
checkpoint that is " +

Review Comment:
   You don’t need “s” interpolation at each string do you? Last one is enough 
IMHO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] khalidmammadov commented on a diff in pull request #40067: [WIP][SPARK-42476][CONNECT][DOCS] Spark Connect API reference

2023-02-19 Thread via GitHub


khalidmammadov commented on code in PR #40067:
URL: https://github.com/apache/spark/pull/40067#discussion_r290404


##
python/docs/source/reference/pyspark.connect/functions.rst:
##
@@ -0,0 +1,356 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+=
+Functions
+=
+.. currentmodule:: pyspark.sql.connect
+
+Normal Functions

Review Comment:
   Perhaps “Standard Functions” sounds better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40013: [WIP][SPARK-42367][CONNECT][PYTHON] `DataFrame.drop` should handle duplicated columns properly

2023-02-19 Thread via GitHub


hvanhovell commented on code in PR #40013:
URL: https://github.com/apache/spark/pull/40013#discussion_r284972


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -318,10 +318,17 @@ message Drop {
   // (Required) The input relation.
   Relation input = 1;
 
-  // (Required) columns to drop.
+  // (Optional) columns to drop.
   //
-  // Should contain at least 1 item.
-  repeated Expression cols = 2;
+  // 'columns' and 'column_names' are mutually exclusive.
+  // If set, should contain at least 1 item.
+  repeated Expression columns = 2;
+
+  // (Optional) column names to drop.
+  //
+  // 'columns' and 'column_names' are mutually exclusive.

Review Comment:
   Why not make it oneof then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40013: [WIP][SPARK-42367][CONNECT][PYTHON] `DataFrame.drop` should handle duplicated columns properly

2023-02-19 Thread via GitHub


hvanhovell commented on PR #40013:
URL: https://github.com/apache/spark/pull/40013#issuecomment-1436042026

   @zhengruifeng what is the current problem with drop?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40077: [SPIP][POC] Driver scaling: parallel schedulers

2023-02-19 Thread via GitHub


hvanhovell commented on PR #40077:
URL: https://github.com/apache/spark/pull/40077#issuecomment-1436041768

   Before we try to use multiple task schedulers, why not fix the parallelism 
issues in the existing one. Moving from a locking everywhere approach, to an 
event loop should yield quite a troughput improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40061: [SPARK-42482][CONNECT] Scala Client Write API V1

2023-02-19 Thread via GitHub


hvanhovell closed pull request #40061: [SPARK-42482][CONNECT] Scala Client 
Write API V1
URL: https://github.com/apache/spark/pull/40061


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40084: [SPARK-42490][BUILD] Upgrade protobuf-java from 3.21.12 to 3.22.0

2023-02-19 Thread via GitHub


srowen commented on PR #40084:
URL: https://github.com/apache/spark/pull/40084#issuecomment-1436005281

   On these, can you give any info about why the upgrade is useful or important?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Kimahriman opened a new pull request, #40085: [SPARK-42492][SQL] Add new function filter_value

2023-02-19 Thread via GitHub


Kimahriman opened a new pull request, #40085:
URL: https://github.com/apache/spark/pull/40085

   
   
   ### What changes were proposed in this pull request?
   
   Add a new higher order function `filter_value` which takes a column to 
validate and a function that takes the result of that column and returns a 
boolean indicating whether to keep the value or return null. This is 
semantically the same as a `when` expression passing the column into a 
validation function, except it guarantees to only evaluate the initial column 
once. The idea was taken from the Scala `Option.filter`, open to other names if 
anyone has a better idea.
   
   ### Why are the changes needed?
   
   Conditionally evaluated expressions are currently not candidates for 
subexpression elimination. This can lead to a lot of duplicate evaluations of 
expressions when doing common data cleaning tasks, such as only keeping a value 
if it matches some validation checks. It gets worse when multiple different 
checks are chained together, and you can end up with a single expensive 
expression being evaluated numerous times.
   
   https://github.com/apache/spark/pull/32987 attempts to solve this by 
allowing conditionally evaluated expressions to be candidates for subexpression 
elimination, however I have not been able to get that merged in the past 1.5 
years. I still think that is valuable and useful, especially as an opt-in 
behavior, but this is an alternative option to help improve performance of 
these kinds of data validation tasks.
   
   A custom implementation of `NullIf` could help as well, however it would 
only support exact equals checks, where this can support any logic you need to 
do validation.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Adds a new function.
   
   ### How was this patch tested?
   
   New UTs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #40078: Update scalastyle-config.xml

2023-02-19 Thread via GitHub


srowen closed pull request #40078: Update scalastyle-config.xml
URL: https://github.com/apache/spark/pull/40078


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bjornjorgensen commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4

2023-02-19 Thread via GitHub


bjornjorgensen commented on PR #40079:
URL: https://github.com/apache/spark/pull/40079#issuecomment-1435979942

   @dongjoon-hyun yes, that's right.
   I think we can have this small update now to master and branch 3.4. When 3.4 
is released, we can try to update to a newer version. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r227309


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2468,4 +2468,15 @@ package object config {
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)
+
+  private[spark] val RDD_CACHE_VISIBILITY_TRACKING_ENABLED =
+ConfigBuilder("spark.rdd.cache.visibilityTracking.enabled")
+  .internal()
+  .doc("Set to be true to enabled RDD cache block's visibility status. 
Once it's enabled," +
+" a RDD cache block can be used only when it's marked as visible. And 
a RDD block will be" +
+" marked as visible only when one of the tasks generating the cache 
block finished" +
+" successfully.")

Review Comment:
   Thanks, updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r226606


##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -1325,14 +1325,47 @@ private[spark] class BlockManager(
 blockInfoManager.releaseAllLocksForTask(taskAttemptId)
   }
 
+  /**
+   * Retrieve the given rdd block if it exists and is visible, otherwise call 
the provided
+   * `makeIterator` method to compute the block, persist it, and return its 
values.
+   *
+   * @return either a BlockResult if the block was successfully cached, or an 
iterator if the block
+   * could not be cached.
+   */
+  def getOrElseUpdateRDDBlock[T](
+  taskId: Long,
+  blockId: RDDBlockId,
+  level: StorageLevel,
+  classTag: ClassTag[T],
+  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
+val isCacheVisible = isRDDBlockVisible(blockId)
+var computed: Boolean = false
+val getIterator = () => {
+  computed = true
+  makeIterator()
+}
+
+val res = getOrElseUpdate(blockId, level, classTag, getIterator)
+if (res.isLeft && !isCacheVisible) {
+  if (!computed) {
+// Loaded from cache, re-compute to update accumulators.
+makeIterator()
+  }

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r226580


##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -1424,6 +1457,16 @@ private[spark] class BlockManager(
 blockStoreUpdater.save()
   }
 
+  // Check whether a rdd block is visible or not.
+  private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+// If the rdd block visibility information not available in the block 
manager,
+// asking master for the information.
+if (blockInfoManager.isRDDBlockVisible(blockId)) {
+  return true
+}
+master.isRDDBlockVisible(blockId)

Review Comment:
   Updated, please take a look. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r226512


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -77,6 +77,11 @@ class BlockManagerMasterEndpoint(
   // Mapping from block id to the set of block managers that have the block.
   private val blockLocations = new JHashMap[BlockId, 
mutable.HashSet[BlockManagerId]]
 
+  // Mapping from task id to the set of rdd blocks which are generated from 
the task.
+  private val tidToRddBlockIds = new mutable.HashMap[Long, 
mutable.HashSet[RDDBlockId]]
+  // Record the visible RDD blocks which have been generated at least from one 
successful task.
+  private val visibleRDDBlocks = new mutable.HashSet[RDDBlockId]

Review Comment:
   Updated. Please take a look. Thanks. cc @Ngone51  @mridulm 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r226349


##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -1325,31 +1328,71 @@ private[spark] class BlockManager(
 blockInfoManager.releaseAllLocksForTask(taskAttemptId)
   }
 
+  /**
+   * Retrieve the given rdd block if it exists and is visible, otherwise call 
the provided
+   * `makeIterator` method to compute the block, persist it, and return its 
values.
+   *
+   * @return either a BlockResult if the block was successfully cached, or an 
iterator if the block
+   * could not be cached.
+   */
+  def getOrElseUpdateRDDBlock[T](
+  taskId: Long,
+  blockId: RDDBlockId,
+  level: StorageLevel,
+  classTag: ClassTag[T],
+  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
+val isCacheVisible = isRDDBlockVisible(blockId)
+val res = getOrElseUpdate(blockId, level, classTag, makeIterator, 
isCacheVisible)
+if (res.isLeft && !isCacheVisible) {
+  // Block exists but not visible, report taskId -> blockId info to master.
+  master.updateRDDBlockTaskInfo(blockId, taskId)
+}
+
+res
+  }
+
   /**
* Retrieve the given block if it exists, otherwise call the provided 
`makeIterator` method
* to compute the block, persist it, and return its values.
*
* @return either a BlockResult if the block was successfully cached, or an 
iterator if the block
* could not be cached.
*/
-  def getOrElseUpdate[T](
+  private def getOrElseUpdate[T](
   blockId: BlockId,
   level: StorageLevel,
   classTag: ClassTag[T],
-  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
-// Attempt to read the block from local or remote storage. If it's 
present, then we don't need
-// to go through the local-get-or-put path.
-get[T](blockId)(classTag) match {
-  case Some(block) =>
-return Left(block)
-  case _ =>
-// Need to compute the block.
+  makeIterator: () => Iterator[T],
+  isCacheVisible: Boolean = true): Either[BlockResult, Iterator[T]] = {
+// Track whether the data is computed or not, force to do the computation 
later if need to.
+// The reason we push the force computing later is that once the executor 
is decommissioned we
+// will have a better chance to replicate the cache block because of the 
`checkShouldStore`
+// validation when putting a new block.
+var computed: Boolean = false
+val iterator = () => {
+  computed = true
+  makeIterator()
+}
+if (isCacheVisible) {
+  // Attempt to read the block from local or remote storage. If it's 
present, then we don't need
+  // to go through the local-get-or-put path.
+  get[T](blockId)(classTag) match {
+case Some(block) =>
+  return Left(block)
+case _ =>
+  // Need to compute the block.
+  }
 }
+
 // Initially we hold no locks on this block.
-doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) 
match {
+doPutIterator(blockId, iterator, level, classTag, keepReadLock = true) 
match {
   case None =>
 // doPut() didn't hand work back to us, so the block already existed 
or was successfully
 // stored. Therefore, we now hold a read lock on the block.
+if (!isCacheVisible && !computed) {
+  // Force compute to report accumulator updates.
+  makeIterator()

Review Comment:
   Thanks for pointing this out. Updated and added a UT for this.



##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -1787,6 +1792,12 @@ private[spark] class DAGScheduler(
   case _: ExceptionFailure | _: TaskKilled => updateAccumulators(event)
   case _ =>
 }
+if (trackingCacheVisibility) {
+  // Update rdd blocks' visibility status.
+  blockManagerMaster.updateRDDBlockVisibility(

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-19 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r226310


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -210,6 +219,51 @@ class BlockManagerMasterEndpoint(
 case StopBlockManagerMaster =>
   context.reply(true)
   stop()
+

Review Comment:
   Added some comments here, please take a look. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #40084: [SPARK-42490][BUILD] Upgrade protobuf-java from 3.21.12 to 3.22.0

2023-02-19 Thread via GitHub


LuciferYang opened a new pull request, #40084:
URL: https://github.com/apache/spark/pull/40084

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40083: [SPARK-42489][BUILD] Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

2023-02-19 Thread via GitHub


LuciferYang commented on PR #40083:
URL: https://github.com/apache/spark/pull/40083#issuecomment-1435967063

   test first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #40083: [SPARK-42489][BUILD] Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

2023-02-19 Thread via GitHub


LuciferYang opened a new pull request, #40083:
URL: https://github.com/apache/spark/pull/40083

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40082: [SPARK-42488][BUILD] Upgrade commons-crypto from 1.1.0 to 1.2.0

2023-02-19 Thread via GitHub


LuciferYang commented on PR #40082:
URL: https://github.com/apache/spark/pull/40082#issuecomment-1435965880

   test first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #40082: [SPARK-42488][BUILD] Upgrade commons-crypto from 1.1.0 to 1.2.0

2023-02-19 Thread via GitHub


LuciferYang opened a new pull request, #40082:
URL: https://github.com/apache/spark/pull/40082

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >