[GitHub] [spark] zhengruifeng commented on pull request #40100: [SPARK-42506][SQL] Fix Sort's maxRowsPerPartition if maxRows does not exist

2023-02-20 Thread via GitHub


zhengruifeng commented on PR #40100:
URL: https://github.com/apache/spark/pull/40100#issuecomment-1437975144

   I'm not sure whether we can use `child.maxRowsPerPartition` in a global 
sort, for example, the partition sizes maybe [5, 5, 5] in child, but after 
global sort the distribution maybe [3, 3, 3, 3, 3]
   
   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn opened a new pull request, #40102: [MINOR][TESTS] Avoid NPE in an anonym SparkListener in DataFrameReaderWriterSuite

2023-02-20 Thread via GitHub


yaooqinn opened a new pull request, #40102:
URL: https://github.com/apache/spark/pull/40102

   
   
   ### What changes were proposed in this pull request?
   
   
   Avoid the following NPE in an anonym SparkListener in 
DataFrameReaderWriterSuite, as job desc may be absent
   
   ```
   java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentLinkedQueue.checkNotNull(ConcurrentLinkedQueue.java:920)
at 
java.util.concurrent.ConcurrentLinkedQueue.offer(ConcurrentLinkedQueue.java:327)
at 
java.util.concurrent.ConcurrentLinkedQueue.add(ConcurrentLinkedQueue.java:297)
at 
org.apache.spark.sql.test.DataFrameReaderWriterSuite$$anon$2.onJobStart(DataFrameReaderWriterSuite.scala:1151)
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
at 
org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
at 
org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
at 
scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1462)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
   
   ```
   
   ### Why are the changes needed?
   
   
   Test Improvement
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   no
   
   ### How was this patch tested?
   
   
   existing tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang closed pull request #40099: [WIP][CONNECT] Scala client add collection functions

2023-02-20 Thread via GitHub


LuciferYang closed pull request #40099: [WIP][CONNECT] Scala client add 
collection functions
URL: https://github.com/apache/spark/pull/40099


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40095:
URL: https://github.com/apache/spark/pull/40095#issuecomment-1437963238

   Hi, @xinrong-meng . The branch-3.4 CI is still broken and I made the second 
PR to mitigate it.
   - https://github.com/apache/spark/pull/40101


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40101: [SPARK-42507][SQL][TESTS] Simplify ORC schema merging conflict error check

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40101:
URL: https://github.com/apache/spark/pull/40101#issuecomment-1437962223

   Hi, @xinrong-meng , @HyukjinKwon , @cloud-fan 
   I decided to simply the test case to be robust in terms of the order of 
merged schema.
   This will unblock Apache Spark 3.4.0 RC1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation

2023-02-20 Thread via GitHub


beliefer commented on PR #38799:
URL: https://github.com/apache/spark/pull/38799#issuecomment-1437962007

   @cloud-fan @zhengruifeng @LuciferYang @ulysses-you Thank you for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39954: [SPARK-42289][SQL] DS V2 pushdown could let JDBC dialect decide to push down offset and limit

2023-02-20 Thread via GitHub


cloud-fan commented on code in PR #39954:
URL: https://github.com/apache/spark/pull/39954#discussion_r1112636420


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala:
##
@@ -135,15 +135,16 @@ case class JDBCScanBuilder(
   }
 
   override def pushLimit(limit: Int): Boolean = {
-if (jdbcOptions.pushDownLimit) {
+if (jdbcOptions.pushDownLimit && 
JdbcDialects.get(jdbcOptions.url).supportsLimit) {

Review Comment:
   shall we create a dialect instance as a field of `JDBCScanBuilder`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request, #40101: [SPARK-42507][SQL][TESTS] Simplify ORC schema merging conflict error check

2023-02-20 Thread via GitHub


dongjoon-hyun opened a new pull request, #40101:
URL: https://github.com/apache/spark/pull/40101

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation

2023-02-20 Thread via GitHub


cloud-fan closed pull request #38799: [SPARK-37099][SQL] Introduce the group 
limit of Window for rank-based filter to optimize top-k computation
URL: https://github.com/apache/spark/pull/38799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation

2023-02-20 Thread via GitHub


cloud-fan commented on PR #38799:
URL: https://github.com/apache/spark/pull/38799#issuecomment-1437957471

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect

2023-02-20 Thread via GitHub


itholic commented on code in PR #40092:
URL: https://github.com/apache/spark/pull/40092#discussion_r1112629589


##
python/docs/source/conf.py:
##
@@ -89,6 +89,8 @@
 .. _binder_df: 
https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
 .. |binder_ps| replace:: Live Notebook: pandas API on Spark
 .. _binder_ps: 
https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb
+.. |binder_connect| replace:: Live Notebook: DataFrame with Spark Connect
+.. _binder_connect: 
https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb

Review Comment:
   Yes, this link is completed when building the docs, so it's not working as 
is.
   I manually checked it shows the Notebook properly after building the docs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect

2023-02-20 Thread via GitHub


itholic commented on code in PR #40092:
URL: https://github.com/apache/spark/pull/40092#discussion_r1112629589


##
python/docs/source/conf.py:
##
@@ -89,6 +89,8 @@
 .. _binder_df: 
https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
 .. |binder_ps| replace:: Live Notebook: pandas API on Spark
 .. _binder_ps: 
https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb
+.. |binder_connect| replace:: Live Notebook: DataFrame with Spark Connect
+.. _binder_connect: 
https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb

Review Comment:
   Yes, this link is completed when building the docs, so it's not working as 
is.
   I manually checked it shows the Notebook properly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #40100: [SPARK-42506][SQL] Fix Sort's maxRowsPerPartition if maxRows does not exist

2023-02-20 Thread via GitHub


wangyum commented on PR #40100:
URL: https://github.com/apache/spark/pull/40100#issuecomment-1437946552

   cc @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum opened a new pull request, #40100: [SPARK-42506][SQL] Fix Sort's maxRowsPerPartition if maxRows does not exist

2023-02-20 Thread via GitHub


wangyum opened a new pull request, #40100:
URL: https://github.com/apache/spark/pull/40100

   ### What changes were proposed in this pull request?
   
Fix `Sort`'s maxRowsPerPartition if maxRows does not exist.
   
   ### Why are the changes needed?
   
   `LimitPushDown` may be use this value.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40099: [WIP][CONNECT] Scala client collection functions

2023-02-20 Thread via GitHub


LuciferYang commented on PR #40099:
URL: https://github.com/apache/spark/pull/40099#issuecomment-1437930816

   Adding test cases
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #40099: [WIP][CONNECT] Scala client collection functions

2023-02-20 Thread via GitHub


LuciferYang opened a new pull request, #40099:
URL: https://github.com/apache/spark/pull/40099

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikf commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-02-20 Thread via GitHub


Yikf commented on PR #40064:
URL: https://github.com/apache/spark/pull/40064#issuecomment-1437928464

   updated, verified w/ kyuubi on spark 3.3.2 and all tests passed
   ```
   build/mvn clean test -pl :kyuubi-spark-connector-hive_2.12 -am -Pspark-3.3 
-Dspark.version=3.3.2
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun opened a new pull request, #31: [SPARK-42505] Apply entrypoint template change to 3.3.0/3.3.1

2023-02-20 Thread via GitHub


Yikun opened a new pull request, #31:
URL: https://github.com/apache/spark-docker/pull/31

   ### What changes were proposed in this pull request?
   Apply entrypoint template change to 3.3.0/3.3.1
   
   
   ### Why are the changes needed?
   We remove the redundant PySpark related vars in 
https://github.com/apache/spark-docker/commit/e8f5b0a1151c349d9c7fdb09cf76300b42a6946b
 . This change also should be apply to 3.3.0/3.3.1.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No, because the image hasn't plublished yet.
   
   
   ### How was this patch tested?
   CI for 3.3.0/3.3.1 passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread via GitHub


Yikun commented on PR #30:
URL: https://github.com/apache/spark-docker/pull/30#issuecomment-1437921002

   @viirya @HyukjinKwon Merged, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun closed pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread via GitHub


Yikun closed pull request #30: [SPARK-42494] Add official image Dockerfile for 
Spark v3.3.2
URL: https://github.com/apache/spark-docker/pull/30


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikf commented on a diff in pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-02-20 Thread via GitHub


Yikf commented on code in PR #40064:
URL: https://github.com/apache/spark/pull/40064#discussion_r1112588510


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala:
##
@@ -30,7 +30,7 @@ case class FileWriterFactory (
 description: WriteJobDescription,
 committer: FileCommitProtocol) extends DataWriterFactory {
 
-  private val jobId = SparkHadoopWriterUtils.createJobID(new Date, 0)

Review Comment:
   sound good to me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-02-20 Thread via GitHub


cloud-fan commented on code in PR #40064:
URL: https://github.com/apache/spark/pull/40064#discussion_r1112579465


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala:
##
@@ -30,7 +30,7 @@ case class FileWriterFactory (
 description: WriteJobDescription,
 committer: FileCommitProtocol) extends DataWriterFactory {
 
-  private val jobId = SparkHadoopWriterUtils.createJobID(new Date, 0)

Review Comment:
   shall we simply mark it as `@transient lazy val`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2

2023-02-20 Thread via GitHub


LuciferYang commented on code in PR #40075:
URL: https://github.com/apache/spark/pull/40075#discussion_r1112571085


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala:
##
@@ -156,6 +156,19 @@ class ClientE2ETestSuite extends RemoteSparkSession {
 }
   }
 
+  test("write v2") {
+try {
+  spark.range(3).writeTo("myTableV2").using("parquet").create()

Review Comment:
   One problem is that I'm not sure why the test dependency `parquet-hadoop` 
need to be added to the `connect-client-jvm `
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40076: [SPARK-42048][PYTHON][CONNECT] Fix the alias name for numpy literals

2023-02-20 Thread via GitHub


xinrong-meng commented on PR #40076:
URL: https://github.com/apache/spark/pull/40076#issuecomment-1437884897

   Sorry late LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikf commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-02-20 Thread via GitHub


Yikf commented on PR #40064:
URL: https://github.com/apache/spark/pull/40064#issuecomment-1437856558

   @cloud-fan This case is the error that Apache kyuubi encountered when 
upgrading from spark 3.3.1 to 3.3.2, can see this 
[link](https://github.com/apache/kyuubi/actions/runs/4192366930/jobs/7268919556#step:6:2611)
 to find the error stacktrace.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests

2023-02-20 Thread via GitHub


xinrong-meng commented on PR #40095:
URL: https://github.com/apache/spark/pull/40095#issuecomment-1437846634

   Sorry I may be out of context. Do we happen to have an overview of the 
situation of 3.4 CI? I removed the RC tag just now. Let me know if there is 
something I could help with.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4

2023-02-20 Thread via GitHub


xinrong-meng commented on PR #40079:
URL: https://github.com/apache/spark/pull/40079#issuecomment-1437841242

   Thanks @dongjoon-hyun!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40079:
URL: https://github.com/apache/spark/pull/40079#issuecomment-1437840964

   IMO, we don't need to wait for this PR, @xinrong-meng .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39062: [SPARK-41516] [SQL] Allow jdbc dialects to override the query used to create a table

2023-02-20 Thread via GitHub


cloud-fan commented on code in PR #39062:
URL: https://github.com/apache/spark/pull/39062#discussion_r1112517669


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala:
##
@@ -135,6 +135,38 @@ abstract class JdbcDialect extends Serializable with 
Logging {
 s$colName
   }
 
+  /**
+   * Get the SQL that should be used to create a table. Dialects can
+   * override this method to return a SQL that works best in a particular 
database.
+   * To allow certain options to append when create a new table, which can be
+   * table_options or partition_options.
+   * E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8"
+   * @param tableName The name of the table.
+   * @param strSchema The scheme string corresponding to the dialect, the 
schemaString method of
+   *  JdbcUtils has provided the function of calculating the 
scheme string
+   *  for different databases
+   * @param createTableOptions The string of the options that to allow certain 
options to append
+   *   when create a new table, which can be 
table_options
+   *   or partition_options.
+   */
+  def createTable(
+  statement: Statement,
+  tableName: String,
+  strSchema: String,
+  options: JdbcOptionsInWrite): Unit = {
+val createTableQuery = getCreateTableQuery(tableName, strSchema, 
options.createTableOptions)

Review Comment:
   `getAddColumnQuery` is used by `alterTable`. The idea is, `alterTable` has 
many parts and it's too much to override `alterTable` if only the ADD COLUMN 
syntax needs to be adjusted.
   
   Can we follow `createSchema` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4

2023-02-20 Thread via GitHub


xinrong-meng commented on PR #40079:
URL: https://github.com/apache/spark/pull/40079#issuecomment-1437840387

   Since branch-3.4 CI is down I may have to re-create the tag later. Please 
let me know if we shall wait for this PR or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #39062: [SPARK-41516] [SQL] Allow jdbc dialects to override the query used to create a table

2023-02-20 Thread via GitHub


cloud-fan commented on code in PR #39062:
URL: https://github.com/apache/spark/pull/39062#discussion_r1112517669


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala:
##
@@ -135,6 +135,38 @@ abstract class JdbcDialect extends Serializable with 
Logging {
 s$colName
   }
 
+  /**
+   * Get the SQL that should be used to create a table. Dialects can
+   * override this method to return a SQL that works best in a particular 
database.
+   * To allow certain options to append when create a new table, which can be
+   * table_options or partition_options.
+   * E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8"
+   * @param tableName The name of the table.
+   * @param strSchema The scheme string corresponding to the dialect, the 
schemaString method of
+   *  JdbcUtils has provided the function of calculating the 
scheme string
+   *  for different databases
+   * @param createTableOptions The string of the options that to allow certain 
options to append
+   *   when create a new table, which can be 
table_options
+   *   or partition_options.
+   */
+  def createTable(
+  statement: Statement,
+  tableName: String,
+  strSchema: String,
+  options: JdbcOptionsInWrite): Unit = {
+val createTableQuery = getCreateTableQuery(tableName, strSchema, 
options.createTableOptions)

Review Comment:
   `getAddColumnQuery` is used by `alterTable`. The idea is, `alterTable` has 
too many parts and it's too much to override `alterTable` if only the ADD 
COLUMN syntax needs to be adjusted.
   
   Can we follow `createSchema` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-02-20 Thread via GitHub


cloud-fan commented on PR #40064:
URL: https://github.com/apache/spark/pull/40064#issuecomment-1437838332

   @Yikf can you provide a test case, or at least the error stacktrace you hit 
in your environment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you opened a new pull request, #40098: [SPARK-42504][SQL] NestedColumnAliasing support pruning adjacent projects

2023-02-20 Thread via GitHub


ulysses-you opened a new pull request, #40098:
URL: https://github.com/apache/spark/pull/40098

   
   
   ### What changes were proposed in this pull request?
   
   This pr improves the `NestedColumnAliasing` to support prune the adjacent 
project nodes.
   
   ### Why are the changes needed?
   
   CollapseProject won't combine adjacent projects into one, e.g. non-cheap 
expression has been accessed more than once with the below project. Then there 
would be possible to appear some adjacent project nodes that 
`NestedColumnAliasing` does not support pruning.
   
   It's a common case with lateral column alias that it would push down an 
extra project, e.g.
   ```sql
   CREATE TABLE t (c struct) USING PARQUET;
   SELECT c.c1, c.c1 + 1 as x, x + 1 FROM t;
   
   -- before
   *(1) Project [c#15.c1 AS c1#53, x#47, (x#47 + 1) AS 
(lateralAliasReference(x) + 1)#55]
   +- *(1) Project [c#15, (c#15.c1 + 1) AS x#47]
  +- *(1) ColumnarToRow
 +- FileScan parquet spark_catalog.default.t[c#15] Batched: true, 
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[], 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct>
   
   -- after
   *(1) Project [_extract_c1#38 AS c1#34, x#27, (x#27 + 1) AS 
(lateralAliasReference(x) + 1)#37]
   +- *(1) Project [c#33.c1 AS _extract_c1#38, (c#33.c1 + 1) AS x#27]
  +- *(1) ColumnarToRow
 +- FileScan parquet spark_catalog.default.t[c#33] Batched: true, 
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct>
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   add new test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #37588: [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2

2023-02-20 Thread via GitHub


panbingkun commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r993130


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##
@@ -2791,4 +2791,17 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase {
 "location" -> toSQLValue(location.toString, StringType),
 "identifier" -> toSQLId(tableId.nameParts)))
   }
+
+  def showTableExtendedMultiPartitionUnsupportedError(tableName: String): 
Throwable = {

Review Comment:
   I will refactor it later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [WIP][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-02-20 Thread via GitHub


zhengruifeng commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1112495582


##
connector/connect/client/jvm/pom.xml:
##
@@ -62,6 +62,30 @@
 
   
 
+
+  org.apache.spark
+  spark-mllib-common_${scala.binary.version}
+  ${project.version}
+  provided

Review Comment:
   yes, scala client will depend on both `spark-mllib-common` and 
`spark-mllib-local`; and the server will depend on `spark-mllib`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yabola commented on pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

2023-02-20 Thread via GitHub


yabola commented on PR #39950:
URL: https://github.com/apache/spark/pull/39950#issuecomment-1437794387

   @sunchao Thank you for your reply! Yes, I also noticed this, just before is 
for minimal changes.
   In the original implementation:
   The first `footerFileMetaData` use `SKIP_ROW_GROUPS` option 
(`SkipMetadataFilter`, return meta without rowGroup);
   The second `footerFileMetaData` use `RangeMetadataFilter`(return meta with 
rowGroup info).
   Actually the second `footerFileMetaData` contains all information used in 
the first `footerFileMetaData`(the detail implementation difference can see 
`ParquetMetadataConverter#readParquetMetadata`)
   
   So when in case that we need filter pushdown and also 
`enableVectorizedReader`, we can only create one `ParquetFileReader` and read 
parquet footer only once. Other situations can also be optimized to read footer.
   
   This needs to modify some more codes, do you think it is suitable?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect

2023-02-20 Thread via GitHub


amaliujia commented on code in PR #40092:
URL: https://github.com/apache/spark/pull/40092#discussion_r1112479064


##
python/docs/source/conf.py:
##
@@ -89,6 +89,8 @@
 .. _binder_df: 
https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
 .. |binder_ps| replace:: Live Notebook: pandas API on Spark
 .. _binder_ps: 
https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb
+.. |binder_connect| replace:: Live Notebook: DataFrame with Spark Connect
+.. _binder_connect: 
https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb

Review Comment:
   This links seems to be broken to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #40093: [SPARK-42500][SQL] ConstantPropagation supports more cases

2023-02-20 Thread via GitHub


wangyum commented on PR #40093:
URL: https://github.com/apache/spark/pull/40093#issuecomment-1437782372

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2

2023-02-20 Thread via GitHub


LuciferYang commented on PR #40075:
URL: https://github.com/apache/spark/pull/40075#issuecomment-1437772288

   > > Move `SaveMode` to catalyst module is a break change, need add 
`ProblemFilters` to `MimaExcludes`
   > 
   > I think we shouldn't make such breaking change?
   
   best to avoid


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2

2023-02-20 Thread via GitHub


amaliujia commented on PR #40075:
URL: https://github.com/apache/spark/pull/40075#issuecomment-1437771626

   > Move `SaveMode` to catalyst module is a break change, need add 
`ProblemFilters` to `MimaExcludes`
   I think we shouldn't make such breaking change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yaohua628 commented on pull request #40035: [SPARK-41151][FOLLOW-UP][SQL] Improve the doc of `_metadata` generated columns nullability implementation

2023-02-20 Thread via GitHub


Yaohua628 commented on PR #40035:
URL: https://github.com/apache/spark/pull/40035#issuecomment-1437769149

   @cloud-fan @olaky Added more clarifications, let me know WDYT! Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40097: [WIP][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-02-20 Thread via GitHub


amaliujia commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1112464074


##
connector/connect/client/jvm/pom.xml:
##
@@ -62,6 +62,30 @@
 
   
 
+
+  org.apache.spark
+  spark-mllib-common_${scala.binary.version}
+  ${project.version}
+  provided

Review Comment:
   I am pretty new to pom here so just try to understand:
   
   this means that the scala client needs runtime dependencies (jars) in class 
path on `spark-mllib-common` and `spark-mllib-local`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #40093: [SPARK-42500][SQL] ConstantPropagation supports more cases

2023-02-20 Thread via GitHub


wangyum commented on code in PR #40093:
URL: https://github.com/apache/spark/pull/40093#discussion_r1112460665


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##
@@ -200,14 +200,20 @@ object ConstantPropagation extends Rule[LogicalPlan] {
 
   private def replaceConstants(condition: Expression, equalityPredicates: 
EqualityPredicates)
 : Expression = {
-val constantsMap = AttributeMap(equalityPredicates.map(_._1))
-val predicates = equalityPredicates.map(_._2).toSet
-def replaceConstants0(expression: Expression) = expression transform {
+val allConstantsMap = AttributeMap(equalityPredicates.map(_._1))
+val allPredicates = equalityPredicates.map(_._2).toSet
+def replaceConstants0(
+expression: Expression, constantsMap: AttributeMap[Literal]) = 
expression transform {
   case a: AttributeReference => constantsMap.getOrElse(a, a)
 }
 condition transform {
-  case e @ EqualTo(_, _) if !predicates.contains(e) => replaceConstants0(e)
-  case e @ EqualNullSafe(_, _) if !predicates.contains(e) => 
replaceConstants0(e)
+  case b: BinaryComparison =>
+if (!allPredicates.contains(b)) {
+  replaceConstants0(b, allConstantsMap)
+} else {
+  val excludedEqualityPredicates = 
equalityPredicates.filterNot(_._2.semanticEquals(b))

Review Comment:
   Exclude current binary comparison to support the following case:
   ```
   a = 1 and a = 2 ==> 2 = 1 and 1 = 2
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #40093: [SPARK-42500][SQL] ConstantPropagation supports more cases

2023-02-20 Thread via GitHub


wangyum commented on code in PR #40093:
URL: https://github.com/apache/spark/pull/40093#discussion_r1112458407


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##
@@ -200,14 +200,20 @@ object ConstantPropagation extends Rule[LogicalPlan] {
 
   private def replaceConstants(condition: Expression, equalityPredicates: 
EqualityPredicates)
 : Expression = {
-val constantsMap = AttributeMap(equalityPredicates.map(_._1))
-val predicates = equalityPredicates.map(_._2).toSet
-def replaceConstants0(expression: Expression) = expression transform {
+val allConstantsMap = AttributeMap(equalityPredicates.map(_._1))
+val allPredicates = equalityPredicates.map(_._2).toSet
+def replaceConstants0(
+expression: Expression, constantsMap: AttributeMap[Literal]) = 
expression transform {
   case a: AttributeReference => constantsMap.getOrElse(a, a)
 }
 condition transform {
-  case e @ EqualTo(_, _) if !predicates.contains(e) => replaceConstants0(e)
-  case e @ EqualNullSafe(_, _) if !predicates.contains(e) => 
replaceConstants0(e)
+  case b: BinaryComparison =>

Review Comment:
   To support other binary comparisons. For example: `>`, `<`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-20 Thread via GitHub


Ngone51 commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1112457906


##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -502,7 +536,10 @@ private[storage] class BlockInfoManager extends Logging {
 throw new IllegalStateException(
   s"Task $taskAttemptId called remove() on block $blockId without a 
write lock")
   } else {
-blockInfoWrappers.remove(blockId)
+invisibleRDDBlocks.synchronized {
+  blockInfoWrappers.remove(blockId)

Review Comment:
   `blockInfoWrappers` isn't necessary to be protected by 
`invisibleRDDBlocks.synchronized`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-20 Thread via GitHub


Ngone51 commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1112456693


##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -399,7 +426,14 @@ private[storage] class BlockInfoManager extends Logging {
 try {
   val wrapper = new BlockInfoWrapper(newBlockInfo, lock)
   while (true) {
-val previous = blockInfoWrappers.putIfAbsent(blockId, wrapper)
+val previous = invisibleRDDBlocks.synchronized {
+  val res = blockInfoWrappers.putIfAbsent(blockId, wrapper)
+  if (res == null && trackingCacheVisibility) {
+// Added to invisible blocks if it doesn't exist before.
+blockId.asRDDId.foreach(invisibleRDDBlocks.add)

Review Comment:
   I'd suggest updating `invisibleRDDBlocks` within `lockForWriting()` when the 
write locking is acquired.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yaohua628 commented on a diff in pull request #40035: [SPARK-41151][FOLLOW-UP][SQL] Improve the doc of `_metadata` generated columns nullability implementation

2023-02-20 Thread via GitHub


Yaohua628 commented on code in PR #40035:
URL: https://github.com/apache/spark/pull/40035#discussion_r1112452613


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala:
##
@@ -578,6 +578,13 @@ object FileSourceGeneratedMetadataAttribute {
 
   val FILE_SOURCE_GENERATED_METADATA_COL_ATTR_KEY = 
"__file_source_generated_metadata_col"
 
+  /**
+   * We keep generated metadata attributes nullability configurable here:
+   * 1. Before passing to readers, we create generated metadata attributes as 
nullable;
+   *Because, for row_index, the readers do not consider the column 
required.

Review Comment:
   Thanks for the clarification @olaky! 
   @cloud-fan I can further improve the doc here and add more clarifications in 
some other places



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread via GitHub


Yikun commented on code in PR #30:
URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920


##
3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh:
##
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid 
case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+if [ -w /etc/passwd ] ; then
+echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous 
uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+else
+echo "Container ENTRYPOINT failed to add passwd entry for anonymous 
UID"
+fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+  JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 
'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+  SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+export PYSPARK_DRIVER_PYTHON
+fi

Review Comment:
   Good point, I also checked the original changes [1], it's for PySpark only.
   
   I will get this merged first (3.2.2 dockerfiles). And submit a new PR to 
support entrypoint.template to cleanup these vars in all other versions.
   
   [1] 
https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread via GitHub


Yikun commented on code in PR #30:
URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920


##
3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh:
##
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid 
case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+if [ -w /etc/passwd ] ; then
+echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous 
uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+else
+echo "Container ENTRYPOINT failed to add passwd entry for anonymous 
UID"
+fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+  JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 
'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+  SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+export PYSPARK_DRIVER_PYTHON
+fi

Review Comment:
   Good point, I also checked the original changes [1], it's for PySpark only.
   
   I will get this merged first (3.2.2 dockerfiles). And submit a new PR to 
support entrypoint.template to cleanup these vars in all versions.
   
   [1] 
https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread via GitHub


Yikun commented on code in PR #30:
URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920


##
3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh:
##
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid 
case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+if [ -w /etc/passwd ] ; then
+echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous 
uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+else
+echo "Container ENTRYPOINT failed to add passwd entry for anonymous 
UID"
+fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+  JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 
'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+  SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+export PYSPARK_DRIVER_PYTHON
+fi

Review Comment:
   Good point, I also checked the original changes [1], it's for PySpark only.
   
   I will get this merged first (3.2.2 dockerfiles). And submit a new PR to 
support entrypoint.template to cleanup these vars
   
   [1] 
https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread via GitHub


Yikun commented on code in PR #30:
URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920


##
3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh:
##
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid 
case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+if [ -w /etc/passwd ] ; then
+echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous 
uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+else
+echo "Container ENTRYPOINT failed to add passwd entry for anonymous 
UID"
+fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+  JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 
'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+  SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+export PYSPARK_DRIVER_PYTHON
+fi

Review Comment:
   Good point, I also checked the original changes [1], it's for PySpark only.
   
   I will get this merged first (3.2.2 dockerfiles). And submit a new PR to 
support entrypoint.template
   
   [1] 
https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread via GitHub


Yikun commented on code in PR #30:
URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920


##
3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh:
##
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid 
case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+if [ -w /etc/passwd ] ; then
+echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous 
uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+else
+echo "Container ENTRYPOINT failed to add passwd entry for anonymous 
UID"
+fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+  JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 
'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+  SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+export PYSPARK_DRIVER_PYTHON
+fi

Review Comment:
   Good point, I also check the original changes [1], it's for PySpark only.
   
   I will get this merged first (3.2.2 dockerfiles). And submit a new PR to 
support entrypoint.template
   
   [1] 
https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40094: [SPARK-41812][SPARK-41823][CONNECT][SQL][SCALA] Add PlanId to Scala Client

2023-02-20 Thread via GitHub


zhengruifeng commented on PR #40094:
URL: https://github.com/apache/spark/pull/40094#issuecomment-1437738671

   I think we should ignore the `plan_id` when comparing the plans, since it 
depends on the order of invocation and is not stable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions

2023-02-20 Thread via GitHub


hvanhovell commented on PR #40089:
URL: https://github.com/apache/spark/pull/40089#issuecomment-1437726276

   I have updated the since tags.
   
   A couple of things I would like to highlight. The scala clients tries to be 
as compatible as possible (both source and binary) with the exiting API. That 
is why this is different that API that is built on top of the scala API, that 
was new, this however is not, it should be or at least aspires to be the same 
API. Hence my argument that you could also use the original since tags. Anyway 
I am fine with updating them, like I did in other PRs, I just missed it in this 
case.
   
   As for the removal of deprecated methods. Again we want to be as compatible 
as possible, so I want to given users access to them. We aspire that in the 
best case a user can migrate their code to connect without any issues 
(preferably no recompilation). I don't think we should ever remove these 
methods, they are poorly named that is all. They do not inhibit us in any way 
to keep evolving Spark since they are just aliases. I have kept out a few 
deprecated methods in the current connect Dataset, but I do intent to bring 
them back once we can support them.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #40097: [WIP][CONNECT][ML] Extract the common classes to mllib-common

2023-02-20 Thread via GitHub


zhengruifeng opened a new pull request, #40097:
URL: https://github.com/apache/spark/pull/40097

   ### What changes were proposed in this pull request?
   Extract the common classes to mllib-common
   
   ### Why are the changes needed?
   to support MLLIB atop Spark-Connect
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   updated tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40049:
URL: https://github.com/apache/spark/pull/40049#issuecomment-1437705253

   I closed my PR because the failure seems to happen more earlier than this 
commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40096: [SPARK-XXX][SQL][TESTS] Reduce the degree of concurrency during ORC schema merge conflict tests

2023-02-20 Thread via GitHub


dongjoon-hyun closed pull request #40096: [SPARK-XXX][SQL][TESTS] Reduce the 
degree of concurrency during ORC schema merge conflict tests
URL: https://github.com/apache/spark/pull/40096


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests

2023-02-20 Thread via GitHub


dongjoon-hyun closed pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce 
the degree of concurrency during ORC schema merge conflict tests
URL: https://github.com/apache/spark/pull/40095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40095:
URL: https://github.com/apache/spark/pull/40095#issuecomment-1437703777

   Unfortunately, this seems to be unable to mitigate it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4

2023-02-20 Thread via GitHub


HyukjinKwon commented on PR #40079:
URL: https://github.com/apache/spark/pull/40079#issuecomment-1437697727

   Let's probably don't add it to branch-3.4 ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40079:
URL: https://github.com/apache/spark/pull/40079#issuecomment-1437668926

   I'm not sure new 3.6.4 Zookeepr is urgent or safe in branch-3.4. According 
to the Maven Central, 3.6.4 is the least adopted (unverified) release among the 
recent versions. Do we have a test case to support that we need this?
   
   ![Screenshot 2023-02-20 at 3 30 59 
PM](https://user-images.githubusercontent.com/9700541/220212850-0ae350c0-bf7d-4b9d-a844-7780bd6362e6.png)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request, #40096: [SPARK-XXX][SQL][TESTS] Reduce the degree of concurrency during ORC schema merge conflict tests

2023-02-20 Thread via GitHub


dongjoon-hyun opened a new pull request, #40096:
URL: https://github.com/apache/spark/pull/40096

   ### What changes were proposed in this pull request?
   
   This PR aims to reduce the degree of concurrency during ORC schema merging 
testing.
   
   ### Why are the changes needed?
   
   #40049 seems to expose a few flaky tests in `branch-3.4` CI . Note that 
`master` branch looks fine.
   - https://github.com/apache/spark/runs/11463120795
   - https://github.com/apache/spark/runs/11463886897
   - https://github.com/apache/spark/runs/11467827738
   - https://github.com/apache/spark/runs/11471484144
   - https://github.com/apache/spark/runs/11471507531
   - https://github.com/apache/spark/runs/11474764316
   
   ![Screenshot 2023-02-20 at 12 30 19 
PM](https://user-images.githubusercontent.com/9700541/220193503-6d6ce2ce-3fd6-4b01-b91c-bc1ec1f41c03.png)
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Pass the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40049:
URL: https://github.com/apache/spark/pull/40049#issuecomment-1437632660

   Since this happens in GitHub Action currently, I made a WIP PR for further 
investigation. If it's valid, I'll convert it to the official PR separately 
from this PR.
   - https://github.com/apache/spark/pull/40095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request, #40095: WIP

2023-02-20 Thread via GitHub


dongjoon-hyun opened a new pull request, #40095:
URL: https://github.com/apache/spark/pull/40095

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40049:
URL: https://github.com/apache/spark/pull/40049#issuecomment-1437534098

   Interestingly, it passed locally while GitHub Action jobs keep failing.
   ```
   $ build/sbt "sql/testOnly *.OrcSourceV1Suite -- -z SPARK-11412"
   ...
   [info] All tests passed.
   [success] Total time: 23 s, completed Feb 20, 2023, 12:54:28 PM
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-20 Thread via GitHub


dongjoon-hyun commented on code in PR #40049:
URL: https://github.com/apache/spark/pull/40049#discussion_r1112312058


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java:
##
@@ -159,23 +159,38 @@ default boolean tableExists(Identifier ident) {
 }
   }
 
+  /**
+   * Create a table in the catalog.
+   * 
+   * This is deprecated. Please override

Review Comment:
   Could you add a deprecation version explicitly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip opened a new pull request, #40094: [SPARK-41812][SPARK-41823][CONNECT][SQL][SCALA] Add PlanId to Scala Client

2023-02-20 Thread via GitHub


grundprinzip opened a new pull request, #40094:
URL: https://github.com/apache/spark/pull/40094

   ### What changes were proposed in this pull request?
   To support ambiguous join columns in the Scala client, this patch adds the 
unique plan ID to the Scala client and updates the generated test files 
accordingly.
   
   
   ### Why are the changes needed?
   Compatibility
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LucaCanali commented on pull request #39127: [SPARK-41585][YARN] The Spark exclude node functionality for YARN should work independently of dynamic allocation

2023-02-20 Thread via GitHub


LucaCanali commented on PR #39127:
URL: https://github.com/apache/spark/pull/39127#issuecomment-1437434751

   Thank you @tgravescs and @attilapiros for reviewing this.
   As for adding a test for this change, my first comm is that I see that 
existing tests already cover YARN_EXCLUDE_NODES functionality. We could think 
of a specific test that covers the behavior when dynamic allocation is turned 
on and off, however in the current test infrastructure that I see with for 
YarnAllocatorHealthTrackerSuite or YarnAllocatorSuite, it does not look (to me) 
like as easy task. Would you have any further clues?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] viirya commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread via GitHub


viirya commented on code in PR #30:
URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112235433


##
3.3.2/scala2.12-java11-ubuntu/entrypoint.sh:
##
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid 
case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+if [ -w /etc/passwd ] ; then
+echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous 
uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+else
+echo "Container ENTRYPOINT failed to add passwd entry for anonymous 
UID"
+fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+  JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 
'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+  SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+export PYSPARK_DRIVER_PYTHON
+fi

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] viirya commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread via GitHub


viirya commented on code in PR #30:
URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112235135


##
3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh:
##
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+# turn off -e for getent because it will return error code in anonymous uid 
case
+set +e
+uidentry=$(getent passwd $myuid)
+set -e
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+if [ -w /etc/passwd ] ; then
+echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous 
uid}:$SPARK_HOME:/bin/false" >> /etc/passwd
+else
+echo "Container ENTRYPOINT failed to add passwd entry for anonymous 
UID"
+fi
+fi
+
+if [ -z "$JAVA_HOME" ]; then
+  JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 
'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
+readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+  SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z ${PYSPARK_PYTHON+x} ]; then
+export PYSPARK_PYTHON
+fi
+if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then
+export PYSPARK_DRIVER_PYTHON
+fi

Review Comment:
   Do we need PySpark vars for R docker file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions

2023-02-20 Thread via GitHub


LuciferYang commented on PR #40089:
URL: https://github.com/apache/spark/pull/40089#issuecomment-1437382771

   +1, Agree `avoiding this propagation of deprecated methods`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40089:
URL: https://github.com/apache/spark/pull/40089#issuecomment-1437378292

   That's a good question. Maybe, what about avoiding this propagation of 
deprecated methods, @hvanhovell , @HyukjinKwon , @LuciferYang ? Since we 
already deprecated these, it sounds more consistent with what Apache Spark has 
been claiming (not to use this APIs in new use cases).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions

2023-02-20 Thread via GitHub


LuciferYang commented on PR #40089:
URL: https://github.com/apache/spark/pull/40089#issuecomment-1437370469

   Personally, I think it is a little strange to change `@since `. For example, 
 the following function is changed to `@since 3.4.0`, but `deprecated` since 
2.1.0 
   
   
https://github.com/apache/spark/blob/1688a8768fb34060548f8790e77f645027f65db2/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala#L221-L226


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-20 Thread via GitHub


sunchao commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1437366108

   Yes, it's in branch-3.4 as well 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-20 Thread via GitHub


LuciferYang commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1437364337

   @sunchao also need to merge to branch-3.4 ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-20 Thread via GitHub


sunchao commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1437363350

   Thanks! merged to master/branch-3.3/branch-3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao closed pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-20 Thread via GitHub


sunchao closed pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd 
off-heap memory leak as a workaround for PARQUET-2160
URL: https://github.com/apache/spark/pull/40091


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40089:
URL: https://github.com/apache/spark/pull/40089#issuecomment-1437356607

   When we add new languages like Python/R, did you use the same versions of 
Scala API, @hvanhovell ? They were also a wrappers on top of Scala API.
   
   > I forgot to update it. TBH since these are verbatim copies of the existing 
API, I think there is a case to be made to use the original version instead the 
version of when the client was built.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-20 Thread via GitHub


dongjoon-hyun commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1437353508

   Thank you all. And, feel free to merge to land this to the release branches, 
@sunchao .
   
   cc @cloud-fan , @HyukjinKwon , @mridulm , @tgravescs 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40058: [SPARK-39859][SQL] Support v2 DESCRIBE TABLE EXTENDED for columns

2023-02-20 Thread via GitHub


dongjoon-hyun commented on code in PR #40058:
URL: https://github.com/apache/spark/pull/40058#discussion_r1112191165


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/DescribeTableSuite.scala:
##
@@ -149,13 +149,14 @@ class DescribeTableSuite extends 
command.DescribeTableSuiteBase
 }
   }
 
-  // TODO(SPARK-39859): Support v2 `DESCRIBE TABLE EXTENDED` for columns
   test("describe extended (formatted) a column") {

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a diff in pull request #40058: [SPARK-39859][SQL] Support v2 DESCRIBE TABLE EXTENDED for columns

2023-02-20 Thread via GitHub


huaxingao commented on code in PR #40058:
URL: https://github.com/apache/spark/pull/40058#discussion_r1112146233


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala:
##
@@ -329,10 +329,17 @@ class DataSourceV2Strategy(session: SparkSession) extends 
Strategy with Predicat
   }
   DescribeTableExec(output, r.table, isExtended) :: Nil
 
-case DescribeColumn(_: ResolvedTable, column, isExtended, output) =>
+case DescribeColumn(r: ResolvedTable, column, isExtended, output) =>
   column match {
 case c: Attribute =>
-  DescribeColumnExec(output, c, isExtended) :: Nil
+  val colStats =
+
r.table.asReadable.newScanBuilder(CaseInsensitiveStringMap.empty()).build() 
match {

Review Comment:
   @cloud-fan Thanks for your comments. I will have a follow-up to fix this and 
also move the test to parent suite.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40035: [SPARK-41151][FOLLOW-UP][SQL] Improve the doc of `_metadata` generated columns nullability implementation

2023-02-20 Thread via GitHub


cloud-fan commented on code in PR #40035:
URL: https://github.com/apache/spark/pull/40035#discussion_r1112128392


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala:
##
@@ -578,6 +578,13 @@ object FileSourceGeneratedMetadataAttribute {
 
   val FILE_SOURCE_GENERATED_METADATA_COL_ATTR_KEY = 
"__file_source_generated_metadata_col"
 
+  /**
+   * We keep generated metadata attributes nullability configurable here:
+   * 1. Before passing to readers, we create generated metadata attributes as 
nullable;
+   *Because, for row_index, the readers do not consider the column 
required.

Review Comment:
   Ah I see, the row_index can be either persisted in the file, or generated on 
the fly when reading. Do we mention it somewhere in the code comments?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-20 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1112123963


##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -150,6 +150,12 @@ private[storage] class BlockInfoManager extends Logging {
*/
   private[this] val blockInfoWrappers = new ConcurrentHashMap[BlockId, 
BlockInfoWrapper]
 
+  /**
+   * Record visible rdd blocks stored in the block manager, entries will be 
removed
+   * by [[removeBlock()]]
+   */
+  private[spark] val visibleRDDBlocks = ConcurrentHashMap.newKeySet[RDDBlockId]

Review Comment:
   Sounds good, will make the change. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-20 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1112123512


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -77,6 +78,11 @@ class BlockManagerMasterEndpoint(
   // Mapping from block id to the set of block managers that have the block.
   private val blockLocations = new JHashMap[BlockId, 
mutable.HashSet[BlockManagerId]]
 
+  // Mapping from task id to the set of rdd blocks which are generated from 
the task.
+  private val tidToRddBlockIds = new mutable.HashMap[Long, 
mutable.HashSet[RDDBlockId]]
+  // Record the visible RDD blocks which have been generated at least from one 
successful task.

Review Comment:
   Thanks, fixed the comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-20 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1112123108


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -210,6 +219,65 @@ class BlockManagerMasterEndpoint(
 case StopBlockManagerMaster =>
   context.reply(true)
   stop()
+
+case UpdateRDDBlockTaskInfo(blockId, taskId) =>
+  // This is to report the information that a rdd block(with `blockId`) is 
computed
+  // and cached by task(with `taskId`). The happens right after the task 
finished
+  // computing/caching the block only when the block is not visible yet. 
And the rdd
+  // block will be marked as visible only when the corresponding task 
finished successfully.
+  context.reply(updateRDDBlockTaskInfo(blockId, taskId))
+
+case GetRDDBlockVisibility(blockId) =>
+  // Get the visibility status of a specific rdd block.
+  if (!trackingCacheVisibility) {

Review Comment:
   Sure, updated.



##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -210,6 +219,65 @@ class BlockManagerMasterEndpoint(
 case StopBlockManagerMaster =>
   context.reply(true)
   stop()
+
+case UpdateRDDBlockTaskInfo(blockId, taskId) =>
+  // This is to report the information that a rdd block(with `blockId`) is 
computed
+  // and cached by task(with `taskId`). The happens right after the task 
finished

Review Comment:
   Thanks, updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-20 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1112122792


##
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala:
##
@@ -2266,6 +2270,150 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with PrivateMethodTe
 }
   }
 
+  test("SPARK-41497: getOrElseUpdateRDDBlock do compute based on cache 
visibility statue") {
+val store = makeBlockManager(8000, "executor1")
+val blockId = RDDBlockId(rddId = 1, splitIndex = 1)
+var computed: Boolean = false
+val data = Seq(1, 2, 3)
+val makeIterator = () => {
+  computed = true
+  data.iterator
+}
+
+// Cache doesn't exist and is not visible.
+assert(store.getStatus(blockId).isEmpty && 
!store.isRDDBlockVisible(blockId))
+val res1 = store.getOrElseUpdateRDDBlock(
+  1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator)
+// Put cache successfully and reported block task info.
+assert(res1.isLeft && computed)
+verify(master, times(1)).updateRDDBlockTaskInfo(blockId, 1)
+
+// Cache exists but not visible.
+computed = false
+assert(store.getStatus(blockId).nonEmpty && 
!store.isRDDBlockVisible(blockId))
+val res2 = store.getOrElseUpdateRDDBlock(
+  1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator)
+// Load cache successfully and reported block task info.
+assert(res2.isLeft && computed)
+verify(master, times(2)).updateRDDBlockTaskInfo(blockId, 1)
+
+// Cache exists and visible.
+store.blockInfoManager.tryAddVisibleBlock(blockId)
+computed = false
+assert(store.getStatus(blockId).nonEmpty && 
store.isRDDBlockVisible(blockId))
+val res3 = store.getOrElseUpdateRDDBlock(
+  1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator)
+// Load cache successfully but not report block task info.
+assert(res3.isLeft && !computed)
+verify(master, times(2)).updateRDDBlockTaskInfo(blockId, 1)
+  }
+
+  test("add block rdd visibility status") {

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2

2023-02-20 Thread via GitHub


LuciferYang commented on PR #40075:
URL: https://github.com/apache/spark/pull/40075#issuecomment-1437225017

   Move `SaveMode` to catalyst module is a break change, need add 
`ProblemFilters` to `MimaExcludes`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-20 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1112113795


##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -1424,6 +1470,28 @@ private[spark] class BlockManager(
 blockStoreUpdater.save()
   }
 
+  // Check whether a rdd block is visible or not.
+  private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+// Cached blocks are always visible if the feature flag is disabled.
+if (!trackingCacheVisibility) {
+  return true
+}
+
+// If the rdd block visibility information not available in the block 
manager,
+// asking master for the information.
+if (blockInfoManager.isRDDBlockVisible(blockId)) {
+  return true
+}
+
+if(master.isRDDBlockVisible(blockId)) {
+  // Cache the visibility status if block exists.
+  blockInfoManager.tryAddVisibleBlock(blockId)
+  true

Review Comment:
   Yes, I think so. Even though current executor doesn't have the cached block, 
we still can read the cache from a remote executor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2

2023-02-20 Thread via GitHub


LuciferYang commented on code in PR #40075:
URL: https://github.com/apache/spark/pull/40075#discussion_r1112110671


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala:
##
@@ -156,6 +156,19 @@ class ClientE2ETestSuite extends RemoteSparkSession {
 }
   }
 
+  test("write v2") {
+try {
+  spark.range(3).writeTo("myTableV2").using("parquet").create()

Review Comment:
   hmm... when I add `parquet-hadoop` as a test dependency of 
`connect-client-jvm` module, the test passed 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nija-at closed pull request #40066: [SPARK-42498] [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-20 Thread via GitHub


nija-at closed pull request #40066: [SPARK-42498] [CONNECT][PYTHON] Reduce 
spark connect service retries
URL: https://github.com/apache/spark/pull/40066


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] nija-at commented on pull request #40066: [SPARK-42498] [CONNECT][PYTHON] Reduce spark connect service retries

2023-02-20 Thread via GitHub


nija-at commented on PR #40066:
URL: https://github.com/apache/spark/pull/40066#issuecomment-1437189282

   Discussed this with @grundprinzip. Apparently the retries are kept so long 
intentionally. The fix here needs to occur differently. Closing this PR for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] olaky commented on a diff in pull request #40035: [SPARK-41151][FOLLOW-UP][SQL] Improve the doc of `_metadata` generated columns nullability implementation

2023-02-20 Thread via GitHub


olaky commented on code in PR #40035:
URL: https://github.com/apache/spark/pull/40035#discussion_r1112048368


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala:
##
@@ -578,6 +578,13 @@ object FileSourceGeneratedMetadataAttribute {
 
   val FILE_SOURCE_GENERATED_METADATA_COL_ATTR_KEY = 
"__file_source_generated_metadata_col"
 
+  /**
+   * We keep generated metadata attributes nullability configurable here:
+   * 1. Before passing to readers, we create generated metadata attributes as 
nullable;
+   *Because, for row_index, the readers do not consider the column 
required.

Review Comment:
   It can be null in the process so to say (column is generated with nulls 
which are then replaced), but it will not be null in the returned output



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #39996: [SPARK-42423][SQL] Add metadata column file block start and length

2023-02-20 Thread via GitHub


cloud-fan commented on PR #39996:
URL: https://github.com/apache/spark/pull/39996#issuecomment-1437118906

   thanks, merging to master/3.4 (last step to replace builtin functions for 
file metadata with metadata columns)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #39996: [SPARK-42423][SQL] Add metadata column file block start and length

2023-02-20 Thread via GitHub


cloud-fan closed pull request #39996: [SPARK-42423][SQL] Add metadata column 
file block start and length
URL: https://github.com/apache/spark/pull/39996


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-20 Thread via GitHub


HyukjinKwon closed pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete 
Spark Connect API reference
URL: https://github.com/apache/spark/pull/40067


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-20 Thread via GitHub


HyukjinKwon commented on PR #40067:
URL: https://github.com/apache/spark/pull/40067#issuecomment-1437102805

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect

2023-02-20 Thread via GitHub


HyukjinKwon commented on PR #40092:
URL: https://github.com/apache/spark/pull/40092#issuecomment-1437101747

   note for myself, I checked this PR by 
https://mybinder.org/v2/gh/itholic/spark.git/SPARK-42475?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect

2023-02-20 Thread via GitHub


HyukjinKwon commented on code in PR #40092:
URL: https://github.com/apache/spark/pull/40092#discussion_r1112016388


##
python/docs/source/getting_started/quickstart_connect.ipynb:
##
@@ -0,0 +1,1118 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+"# Quickstart: DataFrame with Spark Connect\n",
+"\n",
+"This is a short introduction and quickstart for the DataFrame with Spark 
Connect. A DataFrame with Spark Connect is virtually, conceptually identical to 
an existing [PySpark 
DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html?highlight=dataframe#pyspark.sql.DataFrame),
 so most of the examples from 'Live Notebook: DataFrame' at [the quickstart 
page](https://spark.apache.org/docs/latest/api/python/getting_started/index.html)
 can be reused directly.\n",
+"\n",
+"However, it does not yet support some key features such as 
[RDD](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html?highlight=rdd#pyspark.RDD)
 and 
[SparkSession.conf](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.conf.html#pyspark.sql.SparkSession.conf),
 so you need to consider it when using DataFrame with Spark Connect.\n",
+"\n",
+"This notebook shows the basic usages of the DataFrame with Spark Connect 
geared mainly for those new to Spark Connect, along with comments of which 
features is not supported compare to the existing DataFrame.\n",
+"\n",
+"There is also other useful information in Apache Spark documentation 
site, see the latest version of [Spark SQL and 
DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html).\n",
+"\n",
+"PySpark applications start with initializing `SparkSession` which is the 
entry point of PySpark as below. In case of running it in PySpark shell via 
pyspark executable, the shell automatically creates the session in 
the variable spark for users."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+"# Spark Connect uses SparkSession from `pyspark.sql.connect.session` 
instead of `pyspark.sql.SparkSession`.\n",
+"from pyspark.sql.connect.session import SparkSession\n",
+"\n",
+"spark = SparkSession.builder.getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+"## DataFrame Creation\n",
+"\n",
+"A PySpark DataFrame with Spark Connect can be created via 
`pyspark.sql.connect.session.SparkSession.createDataFrame` typically by passing 
a list of lists, tuples, dictionaries and `pyspark.sql.Row`s, a [pandas 
DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) 
and an RDD consisting of such a list.\n",
+"`pyspark.sql.connect.session.SparkSession.createDataFrame` takes the 
`schema` argument to specify the schema of the DataFrame. When it is omitted, 
PySpark infers the corresponding schema by taking a sample from the data.\n",
+"\n",
+"Firstly, you can create a PySpark DataFrame from a list of rows"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+{
+ "data": {
+  "text/plain": [
+   "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
+  ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+}
+   ],
+   "source": [
+"from datetime import datetime, date\n",
+"import pandas as pd\n",
+"from pyspark.sql import Row\n",
+"\n",
+"df = spark.createDataFrame([\n",
+"Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 
1, 12, 0)),\n",
+"Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 
2, 12, 0)),\n",
+"Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 
3, 12, 0))\n",
+"])\n",
+"df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+"Create a PySpark DataFrame with an explicit schema."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+{
+ "data": {
+  "text/plain": [
+   "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
+  ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+}
+   ],
+   "source": [
+"df = spark.createDataFrame([\n",
+"(1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),\n",
+"(2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),\n",
+"(3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))\n",
+"], schema='a long, b double, c string, d date, e timestamp')\n",
+"df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+"Create a PySpark Data

[GitHub] [spark] HyukjinKwon commented on pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect

2023-02-20 Thread via GitHub


HyukjinKwon commented on PR #40092:
URL: https://github.com/apache/spark/pull/40092#issuecomment-1437098718

   This seems duplicating a lot of the existing quickstart.
   Should we maybe just mention in the original quickstart, and have a separate 
quickstart for connect?
   We should demonstrate the idea of being separate client and server Ideally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect

2023-02-20 Thread via GitHub


HyukjinKwon commented on code in PR #40092:
URL: https://github.com/apache/spark/pull/40092#discussion_r1112013877


##
python/docs/source/getting_started/quickstart_connect.ipynb:
##
@@ -0,0 +1,1118 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+"# Quickstart: DataFrame with Spark Connect\n",
+"\n",
+"This is a short introduction and quickstart for the DataFrame with Spark 
Connect. A DataFrame with Spark Connect is virtually, conceptually identical to 
an existing [PySpark 
DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html?highlight=dataframe#pyspark.sql.DataFrame),
 so most of the examples from 'Live Notebook: DataFrame' at [the quickstart 
page](https://spark.apache.org/docs/latest/api/python/getting_started/index.html)
 can be reused directly.\n",
+"\n",
+"However, it does not yet support some key features such as 
[RDD](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html?highlight=rdd#pyspark.RDD)
 and 
[SparkSession.conf](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.conf.html#pyspark.sql.SparkSession.conf),
 so you need to consider it when using DataFrame with Spark Connect.\n",
+"\n",
+"This notebook shows the basic usages of the DataFrame with Spark Connect 
geared mainly for those new to Spark Connect, along with comments of which 
features is not supported compare to the existing DataFrame.\n",
+"\n",
+"There is also other useful information in Apache Spark documentation 
site, see the latest version of [Spark SQL and 
DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html).\n",
+"\n",
+"PySpark applications start with initializing `SparkSession` which is the 
entry point of PySpark as below. In case of running it in PySpark shell via 
pyspark executable, the shell automatically creates the session in 
the variable spark for users."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+"# Spark Connect uses SparkSession from `pyspark.sql.connect.session` 
instead of `pyspark.sql.SparkSession`.\n",
+"from pyspark.sql.connect.session import SparkSession\n",
+"\n",
+"spark = SparkSession.builder.getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+"## DataFrame Creation\n",
+"\n",
+"A PySpark DataFrame with Spark Connect can be created via 
`pyspark.sql.connect.session.SparkSession.createDataFrame` typically by passing 
a list of lists, tuples, dictionaries and `pyspark.sql.Row`s, a [pandas 
DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) 
and an RDD consisting of such a list.\n",
+"`pyspark.sql.connect.session.SparkSession.createDataFrame` takes the 
`schema` argument to specify the schema of the DataFrame. When it is omitted, 
PySpark infers the corresponding schema by taking a sample from the data.\n",
+"\n",
+"Firstly, you can create a PySpark DataFrame from a list of rows"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+{
+ "data": {
+  "text/plain": [
+   "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
+  ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+}
+   ],
+   "source": [
+"from datetime import datetime, date\n",
+"import pandas as pd\n",
+"from pyspark.sql import Row\n",
+"\n",
+"df = spark.createDataFrame([\n",
+"Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 
1, 12, 0)),\n",
+"Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 
2, 12, 0)),\n",
+"Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 
3, 12, 0))\n",
+"])\n",
+"df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+"Create a PySpark DataFrame with an explicit schema."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+{
+ "data": {
+  "text/plain": [
+   "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]"
+  ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+}
+   ],
+   "source": [
+"df = spark.createDataFrame([\n",
+"(1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),\n",
+"(2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),\n",
+"(3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))\n",
+"], schema='a long, b double, c string, d date, e timestamp')\n",
+"df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+"Create a PySpark Data

  1   2   >