[GitHub] [spark] zhengruifeng commented on pull request #40100: [SPARK-42506][SQL] Fix Sort's maxRowsPerPartition if maxRows does not exist
zhengruifeng commented on PR #40100: URL: https://github.com/apache/spark/pull/40100#issuecomment-1437975144 I'm not sure whether we can use `child.maxRowsPerPartition` in a global sort, for example, the partition sizes maybe [5, 5, 5] in child, but after global sort the distribution maybe [3, 3, 3, 3, 3] cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn opened a new pull request, #40102: [MINOR][TESTS] Avoid NPE in an anonym SparkListener in DataFrameReaderWriterSuite
yaooqinn opened a new pull request, #40102: URL: https://github.com/apache/spark/pull/40102 ### What changes were proposed in this pull request? Avoid the following NPE in an anonym SparkListener in DataFrameReaderWriterSuite, as job desc may be absent ``` java.lang.NullPointerException at java.util.concurrent.ConcurrentLinkedQueue.checkNotNull(ConcurrentLinkedQueue.java:920) at java.util.concurrent.ConcurrentLinkedQueue.offer(ConcurrentLinkedQueue.java:327) at java.util.concurrent.ConcurrentLinkedQueue.add(ConcurrentLinkedQueue.java:297) at org.apache.spark.sql.test.DataFrameReaderWriterSuite$$anon$2.onJobStart(DataFrameReaderWriterSuite.scala:1151) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1462) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) ``` ### Why are the changes needed? Test Improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang closed pull request #40099: [WIP][CONNECT] Scala client add collection functions
LuciferYang closed pull request #40099: [WIP][CONNECT] Scala client add collection functions URL: https://github.com/apache/spark/pull/40099 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests
dongjoon-hyun commented on PR #40095: URL: https://github.com/apache/spark/pull/40095#issuecomment-1437963238 Hi, @xinrong-meng . The branch-3.4 CI is still broken and I made the second PR to mitigate it. - https://github.com/apache/spark/pull/40101 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40101: [SPARK-42507][SQL][TESTS] Simplify ORC schema merging conflict error check
dongjoon-hyun commented on PR #40101: URL: https://github.com/apache/spark/pull/40101#issuecomment-1437962223 Hi, @xinrong-meng , @HyukjinKwon , @cloud-fan I decided to simply the test case to be robust in terms of the order of merged schema. This will unblock Apache Spark 3.4.0 RC1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation
beliefer commented on PR #38799: URL: https://github.com/apache/spark/pull/38799#issuecomment-1437962007 @cloud-fan @zhengruifeng @LuciferYang @ulysses-you Thank you for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39954: [SPARK-42289][SQL] DS V2 pushdown could let JDBC dialect decide to push down offset and limit
cloud-fan commented on code in PR #39954: URL: https://github.com/apache/spark/pull/39954#discussion_r1112636420 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala: ## @@ -135,15 +135,16 @@ case class JDBCScanBuilder( } override def pushLimit(limit: Int): Boolean = { -if (jdbcOptions.pushDownLimit) { +if (jdbcOptions.pushDownLimit && JdbcDialects.get(jdbcOptions.url).supportsLimit) { Review Comment: shall we create a dialect instance as a field of `JDBCScanBuilder`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #40101: [SPARK-42507][SQL][TESTS] Simplify ORC schema merging conflict error check
dongjoon-hyun opened a new pull request, #40101: URL: https://github.com/apache/spark/pull/40101 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation
cloud-fan closed pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation URL: https://github.com/apache/spark/pull/38799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38799: [SPARK-37099][SQL] Introduce the group limit of Window for rank-based filter to optimize top-k computation
cloud-fan commented on PR #38799: URL: https://github.com/apache/spark/pull/38799#issuecomment-1437957471 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect
itholic commented on code in PR #40092: URL: https://github.com/apache/spark/pull/40092#discussion_r1112629589 ## python/docs/source/conf.py: ## @@ -89,6 +89,8 @@ .. _binder_df: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb .. |binder_ps| replace:: Live Notebook: pandas API on Spark .. _binder_ps: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb +.. |binder_connect| replace:: Live Notebook: DataFrame with Spark Connect +.. _binder_connect: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb Review Comment: Yes, this link is completed when building the docs, so it's not working as is. I manually checked it shows the Notebook properly after building the docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect
itholic commented on code in PR #40092: URL: https://github.com/apache/spark/pull/40092#discussion_r1112629589 ## python/docs/source/conf.py: ## @@ -89,6 +89,8 @@ .. _binder_df: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb .. |binder_ps| replace:: Live Notebook: pandas API on Spark .. _binder_ps: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb +.. |binder_connect| replace:: Live Notebook: DataFrame with Spark Connect +.. _binder_connect: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb Review Comment: Yes, this link is completed when building the docs, so it's not working as is. I manually checked it shows the Notebook properly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #40100: [SPARK-42506][SQL] Fix Sort's maxRowsPerPartition if maxRows does not exist
wangyum commented on PR #40100: URL: https://github.com/apache/spark/pull/40100#issuecomment-1437946552 cc @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum opened a new pull request, #40100: [SPARK-42506][SQL] Fix Sort's maxRowsPerPartition if maxRows does not exist
wangyum opened a new pull request, #40100: URL: https://github.com/apache/spark/pull/40100 ### What changes were proposed in this pull request? Fix `Sort`'s maxRowsPerPartition if maxRows does not exist. ### Why are the changes needed? `LimitPushDown` may be use this value. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40099: [WIP][CONNECT] Scala client collection functions
LuciferYang commented on PR #40099: URL: https://github.com/apache/spark/pull/40099#issuecomment-1437930816 Adding test cases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #40099: [WIP][CONNECT] Scala client collection functions
LuciferYang opened a new pull request, #40099: URL: https://github.com/apache/spark/pull/40099 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikf commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory
Yikf commented on PR #40064: URL: https://github.com/apache/spark/pull/40064#issuecomment-1437928464 updated, verified w/ kyuubi on spark 3.3.2 and all tests passed ``` build/mvn clean test -pl :kyuubi-spark-connector-hive_2.12 -am -Pspark-3.3 -Dspark.version=3.3.2 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun opened a new pull request, #31: [SPARK-42505] Apply entrypoint template change to 3.3.0/3.3.1
Yikun opened a new pull request, #31: URL: https://github.com/apache/spark-docker/pull/31 ### What changes were proposed in this pull request? Apply entrypoint template change to 3.3.0/3.3.1 ### Why are the changes needed? We remove the redundant PySpark related vars in https://github.com/apache/spark-docker/commit/e8f5b0a1151c349d9c7fdb09cf76300b42a6946b . This change also should be apply to 3.3.0/3.3.1. ### Does this PR introduce _any_ user-facing change? No, because the image hasn't plublished yet. ### How was this patch tested? CI for 3.3.0/3.3.1 passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
Yikun commented on PR #30: URL: https://github.com/apache/spark-docker/pull/30#issuecomment-1437921002 @viirya @HyukjinKwon Merged, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun closed pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
Yikun closed pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2 URL: https://github.com/apache/spark-docker/pull/30 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikf commented on a diff in pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory
Yikf commented on code in PR #40064: URL: https://github.com/apache/spark/pull/40064#discussion_r1112588510 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala: ## @@ -30,7 +30,7 @@ case class FileWriterFactory ( description: WriteJobDescription, committer: FileCommitProtocol) extends DataWriterFactory { - private val jobId = SparkHadoopWriterUtils.createJobID(new Date, 0) Review Comment: sound good to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory
cloud-fan commented on code in PR #40064: URL: https://github.com/apache/spark/pull/40064#discussion_r1112579465 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala: ## @@ -30,7 +30,7 @@ case class FileWriterFactory ( description: WriteJobDescription, committer: FileCommitProtocol) extends DataWriterFactory { - private val jobId = SparkHadoopWriterUtils.createJobID(new Date, 0) Review Comment: shall we simply mark it as `@transient lazy val`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2
LuciferYang commented on code in PR #40075: URL: https://github.com/apache/spark/pull/40075#discussion_r1112571085 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala: ## @@ -156,6 +156,19 @@ class ClientE2ETestSuite extends RemoteSparkSession { } } + test("write v2") { +try { + spark.range(3).writeTo("myTableV2").using("parquet").create() Review Comment: One problem is that I'm not sure why the test dependency `parquet-hadoop` need to be added to the `connect-client-jvm ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #40076: [SPARK-42048][PYTHON][CONNECT] Fix the alias name for numpy literals
xinrong-meng commented on PR #40076: URL: https://github.com/apache/spark/pull/40076#issuecomment-1437884897 Sorry late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikf commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory
Yikf commented on PR #40064: URL: https://github.com/apache/spark/pull/40064#issuecomment-1437856558 @cloud-fan This case is the error that Apache kyuubi encountered when upgrading from spark 3.3.1 to 3.3.2, can see this [link](https://github.com/apache/kyuubi/actions/runs/4192366930/jobs/7268919556#step:6:2611) to find the error stacktrace. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests
xinrong-meng commented on PR #40095: URL: https://github.com/apache/spark/pull/40095#issuecomment-1437846634 Sorry I may be out of context. Do we happen to have an overview of the situation of 3.4 CI? I removed the RC tag just now. Let me know if there is something I could help with. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4
xinrong-meng commented on PR #40079: URL: https://github.com/apache/spark/pull/40079#issuecomment-1437841242 Thanks @dongjoon-hyun! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4
dongjoon-hyun commented on PR #40079: URL: https://github.com/apache/spark/pull/40079#issuecomment-1437840964 IMO, we don't need to wait for this PR, @xinrong-meng . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39062: [SPARK-41516] [SQL] Allow jdbc dialects to override the query used to create a table
cloud-fan commented on code in PR #39062: URL: https://github.com/apache/spark/pull/39062#discussion_r1112517669 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala: ## @@ -135,6 +135,38 @@ abstract class JdbcDialect extends Serializable with Logging { s$colName } + /** + * Get the SQL that should be used to create a table. Dialects can + * override this method to return a SQL that works best in a particular database. + * To allow certain options to append when create a new table, which can be + * table_options or partition_options. + * E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8" + * @param tableName The name of the table. + * @param strSchema The scheme string corresponding to the dialect, the schemaString method of + * JdbcUtils has provided the function of calculating the scheme string + * for different databases + * @param createTableOptions The string of the options that to allow certain options to append + * when create a new table, which can be table_options + * or partition_options. + */ + def createTable( + statement: Statement, + tableName: String, + strSchema: String, + options: JdbcOptionsInWrite): Unit = { +val createTableQuery = getCreateTableQuery(tableName, strSchema, options.createTableOptions) Review Comment: `getAddColumnQuery` is used by `alterTable`. The idea is, `alterTable` has many parts and it's too much to override `alterTable` if only the ADD COLUMN syntax needs to be adjusted. Can we follow `createSchema` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4
xinrong-meng commented on PR #40079: URL: https://github.com/apache/spark/pull/40079#issuecomment-1437840387 Since branch-3.4 CI is down I may have to re-create the tag later. Please let me know if we shall wait for this PR or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39062: [SPARK-41516] [SQL] Allow jdbc dialects to override the query used to create a table
cloud-fan commented on code in PR #39062: URL: https://github.com/apache/spark/pull/39062#discussion_r1112517669 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala: ## @@ -135,6 +135,38 @@ abstract class JdbcDialect extends Serializable with Logging { s$colName } + /** + * Get the SQL that should be used to create a table. Dialects can + * override this method to return a SQL that works best in a particular database. + * To allow certain options to append when create a new table, which can be + * table_options or partition_options. + * E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8" + * @param tableName The name of the table. + * @param strSchema The scheme string corresponding to the dialect, the schemaString method of + * JdbcUtils has provided the function of calculating the scheme string + * for different databases + * @param createTableOptions The string of the options that to allow certain options to append + * when create a new table, which can be table_options + * or partition_options. + */ + def createTable( + statement: Statement, + tableName: String, + strSchema: String, + options: JdbcOptionsInWrite): Unit = { +val createTableQuery = getCreateTableQuery(tableName, strSchema, options.createTableOptions) Review Comment: `getAddColumnQuery` is used by `alterTable`. The idea is, `alterTable` has too many parts and it's too much to override `alterTable` if only the ADD COLUMN syntax needs to be adjusted. Can we follow `createSchema` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40064: [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory
cloud-fan commented on PR #40064: URL: https://github.com/apache/spark/pull/40064#issuecomment-1437838332 @Yikf can you provide a test case, or at least the error stacktrace you hit in your environment? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you opened a new pull request, #40098: [SPARK-42504][SQL] NestedColumnAliasing support pruning adjacent projects
ulysses-you opened a new pull request, #40098: URL: https://github.com/apache/spark/pull/40098 ### What changes were proposed in this pull request? This pr improves the `NestedColumnAliasing` to support prune the adjacent project nodes. ### Why are the changes needed? CollapseProject won't combine adjacent projects into one, e.g. non-cheap expression has been accessed more than once with the below project. Then there would be possible to appear some adjacent project nodes that `NestedColumnAliasing` does not support pruning. It's a common case with lateral column alias that it would push down an extra project, e.g. ```sql CREATE TABLE t (c struct) USING PARQUET; SELECT c.c1, c.c1 + 1 as x, x + 1 FROM t; -- before *(1) Project [c#15.c1 AS c1#53, x#47, (x#47 + 1) AS (lateralAliasReference(x) + 1)#55] +- *(1) Project [c#15, (c#15.c1 + 1) AS x#47] +- *(1) ColumnarToRow +- FileScan parquet spark_catalog.default.t[c#15] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[], PartitionFilters: [], PushedFilters: [], ReadSchema: struct> -- after *(1) Project [_extract_c1#38 AS c1#34, x#27, (x#27 + 1) AS (lateralAliasReference(x) + 1)#37] +- *(1) Project [c#33.c1 AS _extract_c1#38, (c#33.c1 + 1) AS x#27] +- *(1) ColumnarToRow +- FileScan parquet spark_catalog.default.t[c#33] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[], PartitionFilters: [], PushedFilters: [], ReadSchema: struct> ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add new test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #37588: [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2
panbingkun commented on code in PR #37588: URL: https://github.com/apache/spark/pull/37588#discussion_r993130 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala: ## @@ -2791,4 +2791,17 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase { "location" -> toSQLValue(location.toString, StringType), "identifier" -> toSQLId(tableId.nameParts))) } + + def showTableExtendedMultiPartitionUnsupportedError(tableName: String): Throwable = { Review Comment: I will refactor it later -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [WIP][CONNECT][ML] Extract the common .ml classes to `mllib-common`
zhengruifeng commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1112495582 ## connector/connect/client/jvm/pom.xml: ## @@ -62,6 +62,30 @@ + + org.apache.spark + spark-mllib-common_${scala.binary.version} + ${project.version} + provided Review Comment: yes, scala client will depend on both `spark-mllib-common` and `spark-mllib-local`; and the server will depend on `spark-mllib` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yabola commented on pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader
yabola commented on PR #39950: URL: https://github.com/apache/spark/pull/39950#issuecomment-1437794387 @sunchao Thank you for your reply! Yes, I also noticed this, just before is for minimal changes. In the original implementation: The first `footerFileMetaData` use `SKIP_ROW_GROUPS` option (`SkipMetadataFilter`, return meta without rowGroup); The second `footerFileMetaData` use `RangeMetadataFilter`(return meta with rowGroup info). Actually the second `footerFileMetaData` contains all information used in the first `footerFileMetaData`(the detail implementation difference can see `ParquetMetadataConverter#readParquetMetadata`) So when in case that we need filter pushdown and also `enableVectorizedReader`, we can only create one `ParquetFileReader` and read parquet footer only once. Other situations can also be optimized to read footer. This needs to modify some more codes, do you think it is suitable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect
amaliujia commented on code in PR #40092: URL: https://github.com/apache/spark/pull/40092#discussion_r1112479064 ## python/docs/source/conf.py: ## @@ -89,6 +89,8 @@ .. _binder_df: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb .. |binder_ps| replace:: Live Notebook: pandas API on Spark .. _binder_ps: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb +.. |binder_connect| replace:: Live Notebook: DataFrame with Spark Connect +.. _binder_connect: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb Review Comment: This links seems to be broken to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #40093: [SPARK-42500][SQL] ConstantPropagation supports more cases
wangyum commented on PR #40093: URL: https://github.com/apache/spark/pull/40093#issuecomment-1437782372 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2
LuciferYang commented on PR #40075: URL: https://github.com/apache/spark/pull/40075#issuecomment-1437772288 > > Move `SaveMode` to catalyst module is a break change, need add `ProblemFilters` to `MimaExcludes` > > I think we shouldn't make such breaking change? best to avoid -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2
amaliujia commented on PR #40075: URL: https://github.com/apache/spark/pull/40075#issuecomment-1437771626 > Move `SaveMode` to catalyst module is a break change, need add `ProblemFilters` to `MimaExcludes` I think we shouldn't make such breaking change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on pull request #40035: [SPARK-41151][FOLLOW-UP][SQL] Improve the doc of `_metadata` generated columns nullability implementation
Yaohua628 commented on PR #40035: URL: https://github.com/apache/spark/pull/40035#issuecomment-1437769149 @cloud-fan @olaky Added more clarifications, let me know WDYT! Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40097: [WIP][CONNECT][ML] Extract the common .ml classes to `mllib-common`
amaliujia commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1112464074 ## connector/connect/client/jvm/pom.xml: ## @@ -62,6 +62,30 @@ + + org.apache.spark + spark-mllib-common_${scala.binary.version} + ${project.version} + provided Review Comment: I am pretty new to pom here so just try to understand: this means that the scala client needs runtime dependencies (jars) in class path on `spark-mllib-common` and `spark-mllib-local`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #40093: [SPARK-42500][SQL] ConstantPropagation supports more cases
wangyum commented on code in PR #40093: URL: https://github.com/apache/spark/pull/40093#discussion_r1112460665 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -200,14 +200,20 @@ object ConstantPropagation extends Rule[LogicalPlan] { private def replaceConstants(condition: Expression, equalityPredicates: EqualityPredicates) : Expression = { -val constantsMap = AttributeMap(equalityPredicates.map(_._1)) -val predicates = equalityPredicates.map(_._2).toSet -def replaceConstants0(expression: Expression) = expression transform { +val allConstantsMap = AttributeMap(equalityPredicates.map(_._1)) +val allPredicates = equalityPredicates.map(_._2).toSet +def replaceConstants0( +expression: Expression, constantsMap: AttributeMap[Literal]) = expression transform { case a: AttributeReference => constantsMap.getOrElse(a, a) } condition transform { - case e @ EqualTo(_, _) if !predicates.contains(e) => replaceConstants0(e) - case e @ EqualNullSafe(_, _) if !predicates.contains(e) => replaceConstants0(e) + case b: BinaryComparison => +if (!allPredicates.contains(b)) { + replaceConstants0(b, allConstantsMap) +} else { + val excludedEqualityPredicates = equalityPredicates.filterNot(_._2.semanticEquals(b)) Review Comment: Exclude current binary comparison to support the following case: ``` a = 1 and a = 2 ==> 2 = 1 and 1 = 2 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #40093: [SPARK-42500][SQL] ConstantPropagation supports more cases
wangyum commented on code in PR #40093: URL: https://github.com/apache/spark/pull/40093#discussion_r1112458407 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -200,14 +200,20 @@ object ConstantPropagation extends Rule[LogicalPlan] { private def replaceConstants(condition: Expression, equalityPredicates: EqualityPredicates) : Expression = { -val constantsMap = AttributeMap(equalityPredicates.map(_._1)) -val predicates = equalityPredicates.map(_._2).toSet -def replaceConstants0(expression: Expression) = expression transform { +val allConstantsMap = AttributeMap(equalityPredicates.map(_._1)) +val allPredicates = equalityPredicates.map(_._2).toSet +def replaceConstants0( +expression: Expression, constantsMap: AttributeMap[Literal]) = expression transform { case a: AttributeReference => constantsMap.getOrElse(a, a) } condition transform { - case e @ EqualTo(_, _) if !predicates.contains(e) => replaceConstants0(e) - case e @ EqualNullSafe(_, _) if !predicates.contains(e) => replaceConstants0(e) + case b: BinaryComparison => Review Comment: To support other binary comparisons. For example: `>`, `<`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
Ngone51 commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1112457906 ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -502,7 +536,10 @@ private[storage] class BlockInfoManager extends Logging { throw new IllegalStateException( s"Task $taskAttemptId called remove() on block $blockId without a write lock") } else { -blockInfoWrappers.remove(blockId) +invisibleRDDBlocks.synchronized { + blockInfoWrappers.remove(blockId) Review Comment: `blockInfoWrappers` isn't necessary to be protected by `invisibleRDDBlocks.synchronized`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
Ngone51 commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1112456693 ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -399,7 +426,14 @@ private[storage] class BlockInfoManager extends Logging { try { val wrapper = new BlockInfoWrapper(newBlockInfo, lock) while (true) { -val previous = blockInfoWrappers.putIfAbsent(blockId, wrapper) +val previous = invisibleRDDBlocks.synchronized { + val res = blockInfoWrappers.putIfAbsent(blockId, wrapper) + if (res == null && trackingCacheVisibility) { +// Added to invisible blocks if it doesn't exist before. +blockId.asRDDId.foreach(invisibleRDDBlocks.add) Review Comment: I'd suggest updating `invisibleRDDBlocks` within `lockForWriting()` when the write locking is acquired. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on a diff in pull request #40035: [SPARK-41151][FOLLOW-UP][SQL] Improve the doc of `_metadata` generated columns nullability implementation
Yaohua628 commented on code in PR #40035: URL: https://github.com/apache/spark/pull/40035#discussion_r1112452613 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala: ## @@ -578,6 +578,13 @@ object FileSourceGeneratedMetadataAttribute { val FILE_SOURCE_GENERATED_METADATA_COL_ATTR_KEY = "__file_source_generated_metadata_col" + /** + * We keep generated metadata attributes nullability configurable here: + * 1. Before passing to readers, we create generated metadata attributes as nullable; + *Because, for row_index, the readers do not consider the column required. Review Comment: Thanks for the clarification @olaky! @cloud-fan I can further improve the doc here and add more clarifications in some other places -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
Yikun commented on code in PR #30: URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920 ## 3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh: ## @@ -0,0 +1,114 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +# turn off -e for getent because it will return error code in anonymous uid case +set +e +uidentry=$(getent passwd $myuid) +set -e + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then +if [ -w /etc/passwd ] ; then +echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd +else +echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" +fi +fi + +if [ -z "$JAVA_HOME" ]; then + JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}') +fi + +SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*" +env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt +readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt + +if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then + SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH" +fi + +if ! [ -z ${PYSPARK_PYTHON+x} ]; then +export PYSPARK_PYTHON +fi +if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then +export PYSPARK_DRIVER_PYTHON +fi Review Comment: Good point, I also checked the original changes [1], it's for PySpark only. I will get this merged first (3.2.2 dockerfiles). And submit a new PR to support entrypoint.template to cleanup these vars in all other versions. [1] https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
Yikun commented on code in PR #30: URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920 ## 3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh: ## @@ -0,0 +1,114 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +# turn off -e for getent because it will return error code in anonymous uid case +set +e +uidentry=$(getent passwd $myuid) +set -e + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then +if [ -w /etc/passwd ] ; then +echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd +else +echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" +fi +fi + +if [ -z "$JAVA_HOME" ]; then + JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}') +fi + +SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*" +env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt +readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt + +if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then + SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH" +fi + +if ! [ -z ${PYSPARK_PYTHON+x} ]; then +export PYSPARK_PYTHON +fi +if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then +export PYSPARK_DRIVER_PYTHON +fi Review Comment: Good point, I also checked the original changes [1], it's for PySpark only. I will get this merged first (3.2.2 dockerfiles). And submit a new PR to support entrypoint.template to cleanup these vars in all versions. [1] https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
Yikun commented on code in PR #30: URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920 ## 3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh: ## @@ -0,0 +1,114 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +# turn off -e for getent because it will return error code in anonymous uid case +set +e +uidentry=$(getent passwd $myuid) +set -e + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then +if [ -w /etc/passwd ] ; then +echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd +else +echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" +fi +fi + +if [ -z "$JAVA_HOME" ]; then + JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}') +fi + +SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*" +env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt +readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt + +if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then + SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH" +fi + +if ! [ -z ${PYSPARK_PYTHON+x} ]; then +export PYSPARK_PYTHON +fi +if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then +export PYSPARK_DRIVER_PYTHON +fi Review Comment: Good point, I also checked the original changes [1], it's for PySpark only. I will get this merged first (3.2.2 dockerfiles). And submit a new PR to support entrypoint.template to cleanup these vars [1] https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
Yikun commented on code in PR #30: URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920 ## 3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh: ## @@ -0,0 +1,114 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +# turn off -e for getent because it will return error code in anonymous uid case +set +e +uidentry=$(getent passwd $myuid) +set -e + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then +if [ -w /etc/passwd ] ; then +echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd +else +echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" +fi +fi + +if [ -z "$JAVA_HOME" ]; then + JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}') +fi + +SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*" +env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt +readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt + +if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then + SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH" +fi + +if ! [ -z ${PYSPARK_PYTHON+x} ]; then +export PYSPARK_PYTHON +fi +if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then +export PYSPARK_DRIVER_PYTHON +fi Review Comment: Good point, I also checked the original changes [1], it's for PySpark only. I will get this merged first (3.2.2 dockerfiles). And submit a new PR to support entrypoint.template [1] https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
Yikun commented on code in PR #30: URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112448920 ## 3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh: ## @@ -0,0 +1,114 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +# turn off -e for getent because it will return error code in anonymous uid case +set +e +uidentry=$(getent passwd $myuid) +set -e + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then +if [ -w /etc/passwd ] ; then +echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd +else +echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" +fi +fi + +if [ -z "$JAVA_HOME" ]; then + JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}') +fi + +SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*" +env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt +readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt + +if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then + SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH" +fi + +if ! [ -z ${PYSPARK_PYTHON+x} ]; then +export PYSPARK_PYTHON +fi +if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then +export PYSPARK_DRIVER_PYTHON +fi Review Comment: Good point, I also check the original changes [1], it's for PySpark only. I will get this merged first (3.2.2 dockerfiles). And submit a new PR to support entrypoint.template [1] https://github.com/apache/spark/commit/a99a47ca1df689377dbfbf4dd7258f59aee2be44 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40094: [SPARK-41812][SPARK-41823][CONNECT][SQL][SCALA] Add PlanId to Scala Client
zhengruifeng commented on PR #40094: URL: https://github.com/apache/spark/pull/40094#issuecomment-1437738671 I think we should ignore the `plan_id` when comparing the plans, since it depends on the order of invocation and is not stable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions
hvanhovell commented on PR #40089: URL: https://github.com/apache/spark/pull/40089#issuecomment-1437726276 I have updated the since tags. A couple of things I would like to highlight. The scala clients tries to be as compatible as possible (both source and binary) with the exiting API. That is why this is different that API that is built on top of the scala API, that was new, this however is not, it should be or at least aspires to be the same API. Hence my argument that you could also use the original since tags. Anyway I am fine with updating them, like I did in other PRs, I just missed it in this case. As for the removal of deprecated methods. Again we want to be as compatible as possible, so I want to given users access to them. We aspire that in the best case a user can migrate their code to connect without any issues (preferably no recompilation). I don't think we should ever remove these methods, they are poorly named that is all. They do not inhibit us in any way to keep evolving Spark since they are just aliases. I have kept out a few deprecated methods in the current connect Dataset, but I do intent to bring them back once we can support them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #40097: [WIP][CONNECT][ML] Extract the common classes to mllib-common
zhengruifeng opened a new pull request, #40097: URL: https://github.com/apache/spark/pull/40097 ### What changes were proposed in this pull request? Extract the common classes to mllib-common ### Why are the changes needed? to support MLLIB atop Spark-Connect ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
dongjoon-hyun commented on PR #40049: URL: https://github.com/apache/spark/pull/40049#issuecomment-1437705253 I closed my PR because the failure seems to happen more earlier than this commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #40096: [SPARK-XXX][SQL][TESTS] Reduce the degree of concurrency during ORC schema merge conflict tests
dongjoon-hyun closed pull request #40096: [SPARK-XXX][SQL][TESTS] Reduce the degree of concurrency during ORC schema merge conflict tests URL: https://github.com/apache/spark/pull/40096 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests
dongjoon-hyun closed pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests URL: https://github.com/apache/spark/pull/40095 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40095: [SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests
dongjoon-hyun commented on PR #40095: URL: https://github.com/apache/spark/pull/40095#issuecomment-1437703777 Unfortunately, this seems to be unable to mitigate it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4
HyukjinKwon commented on PR #40079: URL: https://github.com/apache/spark/pull/40079#issuecomment-1437697727 Let's probably don't add it to branch-3.4 ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40079: [SPARK-42486][BUILD] Upgrade `ZooKeeper` to 3.6.4
dongjoon-hyun commented on PR #40079: URL: https://github.com/apache/spark/pull/40079#issuecomment-1437668926 I'm not sure new 3.6.4 Zookeepr is urgent or safe in branch-3.4. According to the Maven Central, 3.6.4 is the least adopted (unverified) release among the recent versions. Do we have a test case to support that we need this? ![Screenshot 2023-02-20 at 3 30 59 PM](https://user-images.githubusercontent.com/9700541/220212850-0ae350c0-bf7d-4b9d-a844-7780bd6362e6.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #40096: [SPARK-XXX][SQL][TESTS] Reduce the degree of concurrency during ORC schema merge conflict tests
dongjoon-hyun opened a new pull request, #40096: URL: https://github.com/apache/spark/pull/40096 ### What changes were proposed in this pull request? This PR aims to reduce the degree of concurrency during ORC schema merging testing. ### Why are the changes needed? #40049 seems to expose a few flaky tests in `branch-3.4` CI . Note that `master` branch looks fine. - https://github.com/apache/spark/runs/11463120795 - https://github.com/apache/spark/runs/11463886897 - https://github.com/apache/spark/runs/11467827738 - https://github.com/apache/spark/runs/11471484144 - https://github.com/apache/spark/runs/11471507531 - https://github.com/apache/spark/runs/11474764316 ![Screenshot 2023-02-20 at 12 30 19 PM](https://user-images.githubusercontent.com/9700541/220193503-6d6ce2ce-3fd6-4b01-b91c-bc1ec1f41c03.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
dongjoon-hyun commented on PR #40049: URL: https://github.com/apache/spark/pull/40049#issuecomment-1437632660 Since this happens in GitHub Action currently, I made a WIP PR for further investigation. If it's valid, I'll convert it to the official PR separately from this PR. - https://github.com/apache/spark/pull/40095 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #40095: WIP
dongjoon-hyun opened a new pull request, #40095: URL: https://github.com/apache/spark/pull/40095 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
dongjoon-hyun commented on PR #40049: URL: https://github.com/apache/spark/pull/40049#issuecomment-1437534098 Interestingly, it passed locally while GitHub Action jobs keep failing. ``` $ build/sbt "sql/testOnly *.OrcSourceV1Suite -- -z SPARK-11412" ... [info] All tests passed. [success] Total time: 23 s, completed Feb 20, 2023, 12:54:28 PM ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
dongjoon-hyun commented on code in PR #40049: URL: https://github.com/apache/spark/pull/40049#discussion_r1112312058 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java: ## @@ -159,23 +159,38 @@ default boolean tableExists(Identifier ident) { } } + /** + * Create a table in the catalog. + * + * This is deprecated. Please override Review Comment: Could you add a deprecation version explicitly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip opened a new pull request, #40094: [SPARK-41812][SPARK-41823][CONNECT][SQL][SCALA] Add PlanId to Scala Client
grundprinzip opened a new pull request, #40094: URL: https://github.com/apache/spark/pull/40094 ### What changes were proposed in this pull request? To support ambiguous join columns in the Scala client, this patch adds the unique plan ID to the Scala client and updates the generated test files accordingly. ### Why are the changes needed? Compatibility ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LucaCanali commented on pull request #39127: [SPARK-41585][YARN] The Spark exclude node functionality for YARN should work independently of dynamic allocation
LucaCanali commented on PR #39127: URL: https://github.com/apache/spark/pull/39127#issuecomment-1437434751 Thank you @tgravescs and @attilapiros for reviewing this. As for adding a test for this change, my first comm is that I see that existing tests already cover YARN_EXCLUDE_NODES functionality. We could think of a specific test that covers the behavior when dynamic allocation is turned on and off, however in the current test infrastructure that I see with for YarnAllocatorHealthTrackerSuite or YarnAllocatorSuite, it does not look (to me) like as easy task. Would you have any further clues? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] viirya commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
viirya commented on code in PR #30: URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112235433 ## 3.3.2/scala2.12-java11-ubuntu/entrypoint.sh: ## @@ -0,0 +1,114 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +# turn off -e for getent because it will return error code in anonymous uid case +set +e +uidentry=$(getent passwd $myuid) +set -e + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then +if [ -w /etc/passwd ] ; then +echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd +else +echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" +fi +fi + +if [ -z "$JAVA_HOME" ]; then + JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}') +fi + +SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*" +env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt +readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt + +if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then + SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH" +fi + +if ! [ -z ${PYSPARK_PYTHON+x} ]; then +export PYSPARK_PYTHON +fi +if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then +export PYSPARK_DRIVER_PYTHON +fi Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] viirya commented on a diff in pull request #30: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
viirya commented on code in PR #30: URL: https://github.com/apache/spark-docker/pull/30#discussion_r1112235135 ## 3.3.2/scala2.12-java11-r-ubuntu/entrypoint.sh: ## @@ -0,0 +1,114 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +# turn off -e for getent because it will return error code in anonymous uid case +set +e +uidentry=$(getent passwd $myuid) +set -e + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then +if [ -w /etc/passwd ] ; then +echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd +else +echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" +fi +fi + +if [ -z "$JAVA_HOME" ]; then + JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}') +fi + +SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*" +env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt +readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt + +if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then + SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH" +fi + +if ! [ -z ${PYSPARK_PYTHON+x} ]; then +export PYSPARK_PYTHON +fi +if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then +export PYSPARK_DRIVER_PYTHON +fi Review Comment: Do we need PySpark vars for R docker file? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions
LuciferYang commented on PR #40089: URL: https://github.com/apache/spark/pull/40089#issuecomment-1437382771 +1, Agree `avoiding this propagation of deprecated methods` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions
dongjoon-hyun commented on PR #40089: URL: https://github.com/apache/spark/pull/40089#issuecomment-1437378292 That's a good question. Maybe, what about avoiding this propagation of deprecated methods, @hvanhovell , @HyukjinKwon , @LuciferYang ? Since we already deprecated these, it sounds more consistent with what Apache Spark has been claiming (not to use this APIs in new use cases). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions
LuciferYang commented on PR #40089: URL: https://github.com/apache/spark/pull/40089#issuecomment-1437370469 Personally, I think it is a little strange to change `@since `. For example, the following function is changed to `@since 3.4.0`, but `deprecated` since 2.1.0 https://github.com/apache/spark/blob/1688a8768fb34060548f8790e77f645027f65db2/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala#L221-L226 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160
sunchao commented on PR #40091: URL: https://github.com/apache/spark/pull/40091#issuecomment-1437366108 Yes, it's in branch-3.4 as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160
LuciferYang commented on PR #40091: URL: https://github.com/apache/spark/pull/40091#issuecomment-1437364337 @sunchao also need to merge to branch-3.4 ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160
sunchao commented on PR #40091: URL: https://github.com/apache/spark/pull/40091#issuecomment-1437363350 Thanks! merged to master/branch-3.3/branch-3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao closed pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160
sunchao closed pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160 URL: https://github.com/apache/spark/pull/40091 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40089: [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions
dongjoon-hyun commented on PR #40089: URL: https://github.com/apache/spark/pull/40089#issuecomment-1437356607 When we add new languages like Python/R, did you use the same versions of Scala API, @hvanhovell ? They were also a wrappers on top of Scala API. > I forgot to update it. TBH since these are verbatim copies of the existing API, I think there is a case to be made to use the original version instead the version of when the client was built. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40091: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160
dongjoon-hyun commented on PR #40091: URL: https://github.com/apache/spark/pull/40091#issuecomment-1437353508 Thank you all. And, feel free to merge to land this to the release branches, @sunchao . cc @cloud-fan , @HyukjinKwon , @mridulm , @tgravescs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40058: [SPARK-39859][SQL] Support v2 DESCRIBE TABLE EXTENDED for columns
dongjoon-hyun commented on code in PR #40058: URL: https://github.com/apache/spark/pull/40058#discussion_r1112191165 ## sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/DescribeTableSuite.scala: ## @@ -149,13 +149,14 @@ class DescribeTableSuite extends command.DescribeTableSuiteBase } } - // TODO(SPARK-39859): Support v2 `DESCRIBE TABLE EXTENDED` for columns test("describe extended (formatted) a column") { Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a diff in pull request #40058: [SPARK-39859][SQL] Support v2 DESCRIBE TABLE EXTENDED for columns
huaxingao commented on code in PR #40058: URL: https://github.com/apache/spark/pull/40058#discussion_r1112146233 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala: ## @@ -329,10 +329,17 @@ class DataSourceV2Strategy(session: SparkSession) extends Strategy with Predicat } DescribeTableExec(output, r.table, isExtended) :: Nil -case DescribeColumn(_: ResolvedTable, column, isExtended, output) => +case DescribeColumn(r: ResolvedTable, column, isExtended, output) => column match { case c: Attribute => - DescribeColumnExec(output, c, isExtended) :: Nil + val colStats = + r.table.asReadable.newScanBuilder(CaseInsensitiveStringMap.empty()).build() match { Review Comment: @cloud-fan Thanks for your comments. I will have a follow-up to fix this and also move the test to parent suite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40035: [SPARK-41151][FOLLOW-UP][SQL] Improve the doc of `_metadata` generated columns nullability implementation
cloud-fan commented on code in PR #40035: URL: https://github.com/apache/spark/pull/40035#discussion_r1112128392 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala: ## @@ -578,6 +578,13 @@ object FileSourceGeneratedMetadataAttribute { val FILE_SOURCE_GENERATED_METADATA_COL_ATTR_KEY = "__file_source_generated_metadata_col" + /** + * We keep generated metadata attributes nullability configurable here: + * 1. Before passing to readers, we create generated metadata attributes as nullable; + *Because, for row_index, the readers do not consider the column required. Review Comment: Ah I see, the row_index can be either persisted in the file, or generated on the fly when reading. Do we mention it somewhere in the code comments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1112123963 ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -150,6 +150,12 @@ private[storage] class BlockInfoManager extends Logging { */ private[this] val blockInfoWrappers = new ConcurrentHashMap[BlockId, BlockInfoWrapper] + /** + * Record visible rdd blocks stored in the block manager, entries will be removed + * by [[removeBlock()]] + */ + private[spark] val visibleRDDBlocks = ConcurrentHashMap.newKeySet[RDDBlockId] Review Comment: Sounds good, will make the change. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1112123512 ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -77,6 +78,11 @@ class BlockManagerMasterEndpoint( // Mapping from block id to the set of block managers that have the block. private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]] + // Mapping from task id to the set of rdd blocks which are generated from the task. + private val tidToRddBlockIds = new mutable.HashMap[Long, mutable.HashSet[RDDBlockId]] + // Record the visible RDD blocks which have been generated at least from one successful task. Review Comment: Thanks, fixed the comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1112123108 ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -210,6 +219,65 @@ class BlockManagerMasterEndpoint( case StopBlockManagerMaster => context.reply(true) stop() + +case UpdateRDDBlockTaskInfo(blockId, taskId) => + // This is to report the information that a rdd block(with `blockId`) is computed + // and cached by task(with `taskId`). The happens right after the task finished + // computing/caching the block only when the block is not visible yet. And the rdd + // block will be marked as visible only when the corresponding task finished successfully. + context.reply(updateRDDBlockTaskInfo(blockId, taskId)) + +case GetRDDBlockVisibility(blockId) => + // Get the visibility status of a specific rdd block. + if (!trackingCacheVisibility) { Review Comment: Sure, updated. ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -210,6 +219,65 @@ class BlockManagerMasterEndpoint( case StopBlockManagerMaster => context.reply(true) stop() + +case UpdateRDDBlockTaskInfo(blockId, taskId) => + // This is to report the information that a rdd block(with `blockId`) is computed + // and cached by task(with `taskId`). The happens right after the task finished Review Comment: Thanks, updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1112122792 ## core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala: ## @@ -2266,6 +2270,150 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with PrivateMethodTe } } + test("SPARK-41497: getOrElseUpdateRDDBlock do compute based on cache visibility statue") { +val store = makeBlockManager(8000, "executor1") +val blockId = RDDBlockId(rddId = 1, splitIndex = 1) +var computed: Boolean = false +val data = Seq(1, 2, 3) +val makeIterator = () => { + computed = true + data.iterator +} + +// Cache doesn't exist and is not visible. +assert(store.getStatus(blockId).isEmpty && !store.isRDDBlockVisible(blockId)) +val res1 = store.getOrElseUpdateRDDBlock( + 1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator) +// Put cache successfully and reported block task info. +assert(res1.isLeft && computed) +verify(master, times(1)).updateRDDBlockTaskInfo(blockId, 1) + +// Cache exists but not visible. +computed = false +assert(store.getStatus(blockId).nonEmpty && !store.isRDDBlockVisible(blockId)) +val res2 = store.getOrElseUpdateRDDBlock( + 1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator) +// Load cache successfully and reported block task info. +assert(res2.isLeft && computed) +verify(master, times(2)).updateRDDBlockTaskInfo(blockId, 1) + +// Cache exists and visible. +store.blockInfoManager.tryAddVisibleBlock(blockId) +computed = false +assert(store.getStatus(blockId).nonEmpty && store.isRDDBlockVisible(blockId)) +val res3 = store.getOrElseUpdateRDDBlock( + 1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator) +// Load cache successfully but not report block task info. +assert(res3.isLeft && !computed) +verify(master, times(2)).updateRDDBlockTaskInfo(blockId, 1) + } + + test("add block rdd visibility status") { Review Comment: Updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2
LuciferYang commented on PR #40075: URL: https://github.com/apache/spark/pull/40075#issuecomment-1437225017 Move `SaveMode` to catalyst module is a break change, need add `ProblemFilters` to `MimaExcludes` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1112113795 ## core/src/main/scala/org/apache/spark/storage/BlockManager.scala: ## @@ -1424,6 +1470,28 @@ private[spark] class BlockManager( blockStoreUpdater.save() } + // Check whether a rdd block is visible or not. + private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = { +// Cached blocks are always visible if the feature flag is disabled. +if (!trackingCacheVisibility) { + return true +} + +// If the rdd block visibility information not available in the block manager, +// asking master for the information. +if (blockInfoManager.isRDDBlockVisible(blockId)) { + return true +} + +if(master.isRDDBlockVisible(blockId)) { + // Cache the visibility status if block exists. + blockInfoManager.tryAddVisibleBlock(blockId) + true Review Comment: Yes, I think so. Even though current executor doesn't have the cached block, we still can read the cache from a remote executor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2
LuciferYang commented on code in PR #40075: URL: https://github.com/apache/spark/pull/40075#discussion_r1112110671 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala: ## @@ -156,6 +156,19 @@ class ClientE2ETestSuite extends RemoteSparkSession { } } + test("write v2") { +try { + spark.range(3).writeTo("myTableV2").using("parquet").create() Review Comment: hmm... when I add `parquet-hadoop` as a test dependency of `connect-client-jvm` module, the test passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nija-at closed pull request #40066: [SPARK-42498] [CONNECT][PYTHON] Reduce spark connect service retries
nija-at closed pull request #40066: [SPARK-42498] [CONNECT][PYTHON] Reduce spark connect service retries URL: https://github.com/apache/spark/pull/40066 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nija-at commented on pull request #40066: [SPARK-42498] [CONNECT][PYTHON] Reduce spark connect service retries
nija-at commented on PR #40066: URL: https://github.com/apache/spark/pull/40066#issuecomment-1437189282 Discussed this with @grundprinzip. Apparently the retries are kept so long intentionally. The fix here needs to occur differently. Closing this PR for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] olaky commented on a diff in pull request #40035: [SPARK-41151][FOLLOW-UP][SQL] Improve the doc of `_metadata` generated columns nullability implementation
olaky commented on code in PR #40035: URL: https://github.com/apache/spark/pull/40035#discussion_r1112048368 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala: ## @@ -578,6 +578,13 @@ object FileSourceGeneratedMetadataAttribute { val FILE_SOURCE_GENERATED_METADATA_COL_ATTR_KEY = "__file_source_generated_metadata_col" + /** + * We keep generated metadata attributes nullability configurable here: + * 1. Before passing to readers, we create generated metadata attributes as nullable; + *Because, for row_index, the readers do not consider the column required. Review Comment: It can be null in the process so to say (column is generated with nulls which are then replaced), but it will not be null in the returned output -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #39996: [SPARK-42423][SQL] Add metadata column file block start and length
cloud-fan commented on PR #39996: URL: https://github.com/apache/spark/pull/39996#issuecomment-1437118906 thanks, merging to master/3.4 (last step to replace builtin functions for file metadata with metadata columns) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #39996: [SPARK-42423][SQL] Add metadata column file block start and length
cloud-fan closed pull request #39996: [SPARK-42423][SQL] Add metadata column file block start and length URL: https://github.com/apache/spark/pull/39996 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference
HyukjinKwon closed pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference URL: https://github.com/apache/spark/pull/40067 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40067: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference
HyukjinKwon commented on PR #40067: URL: https://github.com/apache/spark/pull/40067#issuecomment-1437102805 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect
HyukjinKwon commented on PR #40092: URL: https://github.com/apache/spark/pull/40092#issuecomment-1437101747 note for myself, I checked this PR by https://mybinder.org/v2/gh/itholic/spark.git/SPARK-42475?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect
HyukjinKwon commented on code in PR #40092: URL: https://github.com/apache/spark/pull/40092#discussion_r1112016388 ## python/docs/source/getting_started/quickstart_connect.ipynb: ## @@ -0,0 +1,1118 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Quickstart: DataFrame with Spark Connect\n", +"\n", +"This is a short introduction and quickstart for the DataFrame with Spark Connect. A DataFrame with Spark Connect is virtually, conceptually identical to an existing [PySpark DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html?highlight=dataframe#pyspark.sql.DataFrame), so most of the examples from 'Live Notebook: DataFrame' at [the quickstart page](https://spark.apache.org/docs/latest/api/python/getting_started/index.html) can be reused directly.\n", +"\n", +"However, it does not yet support some key features such as [RDD](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html?highlight=rdd#pyspark.RDD) and [SparkSession.conf](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.conf.html#pyspark.sql.SparkSession.conf), so you need to consider it when using DataFrame with Spark Connect.\n", +"\n", +"This notebook shows the basic usages of the DataFrame with Spark Connect geared mainly for those new to Spark Connect, along with comments of which features is not supported compare to the existing DataFrame.\n", +"\n", +"There is also other useful information in Apache Spark documentation site, see the latest version of [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html).\n", +"\n", +"PySpark applications start with initializing `SparkSession` which is the entry point of PySpark as below. In case of running it in PySpark shell via pyspark executable, the shell automatically creates the session in the variable spark for users." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ +"# Spark Connect uses SparkSession from `pyspark.sql.connect.session` instead of `pyspark.sql.SparkSession`.\n", +"from pyspark.sql.connect.session import SparkSession\n", +"\n", +"spark = SparkSession.builder.getOrCreate()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"## DataFrame Creation\n", +"\n", +"A PySpark DataFrame with Spark Connect can be created via `pyspark.sql.connect.session.SparkSession.createDataFrame` typically by passing a list of lists, tuples, dictionaries and `pyspark.sql.Row`s, a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and an RDD consisting of such a list.\n", +"`pyspark.sql.connect.session.SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the DataFrame. When it is omitted, PySpark infers the corresponding schema by taking a sample from the data.\n", +"\n", +"Firstly, you can create a PySpark DataFrame from a list of rows" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"from datetime import datetime, date\n", +"import pandas as pd\n", +"from pyspark.sql import Row\n", +"\n", +"df = spark.createDataFrame([\n", +"Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),\n", +"Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),\n", +"Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))\n", +"])\n", +"df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"Create a PySpark DataFrame with an explicit schema." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"df = spark.createDataFrame([\n", +"(1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),\n", +"(2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),\n", +"(3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))\n", +"], schema='a long, b double, c string, d date, e timestamp')\n", +"df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"Create a PySpark Data
[GitHub] [spark] HyukjinKwon commented on pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect
HyukjinKwon commented on PR #40092: URL: https://github.com/apache/spark/pull/40092#issuecomment-1437098718 This seems duplicating a lot of the existing quickstart. Should we maybe just mention in the original quickstart, and have a separate quickstart for connect? We should demonstrate the idea of being separate client and server Ideally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40092: [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect
HyukjinKwon commented on code in PR #40092: URL: https://github.com/apache/spark/pull/40092#discussion_r1112013877 ## python/docs/source/getting_started/quickstart_connect.ipynb: ## @@ -0,0 +1,1118 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Quickstart: DataFrame with Spark Connect\n", +"\n", +"This is a short introduction and quickstart for the DataFrame with Spark Connect. A DataFrame with Spark Connect is virtually, conceptually identical to an existing [PySpark DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html?highlight=dataframe#pyspark.sql.DataFrame), so most of the examples from 'Live Notebook: DataFrame' at [the quickstart page](https://spark.apache.org/docs/latest/api/python/getting_started/index.html) can be reused directly.\n", +"\n", +"However, it does not yet support some key features such as [RDD](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html?highlight=rdd#pyspark.RDD) and [SparkSession.conf](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.conf.html#pyspark.sql.SparkSession.conf), so you need to consider it when using DataFrame with Spark Connect.\n", +"\n", +"This notebook shows the basic usages of the DataFrame with Spark Connect geared mainly for those new to Spark Connect, along with comments of which features is not supported compare to the existing DataFrame.\n", +"\n", +"There is also other useful information in Apache Spark documentation site, see the latest version of [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html).\n", +"\n", +"PySpark applications start with initializing `SparkSession` which is the entry point of PySpark as below. In case of running it in PySpark shell via pyspark executable, the shell automatically creates the session in the variable spark for users." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ +"# Spark Connect uses SparkSession from `pyspark.sql.connect.session` instead of `pyspark.sql.SparkSession`.\n", +"from pyspark.sql.connect.session import SparkSession\n", +"\n", +"spark = SparkSession.builder.getOrCreate()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"## DataFrame Creation\n", +"\n", +"A PySpark DataFrame with Spark Connect can be created via `pyspark.sql.connect.session.SparkSession.createDataFrame` typically by passing a list of lists, tuples, dictionaries and `pyspark.sql.Row`s, a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and an RDD consisting of such a list.\n", +"`pyspark.sql.connect.session.SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the DataFrame. When it is omitted, PySpark infers the corresponding schema by taking a sample from the data.\n", +"\n", +"Firstly, you can create a PySpark DataFrame from a list of rows" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"from datetime import datetime, date\n", +"import pandas as pd\n", +"from pyspark.sql import Row\n", +"\n", +"df = spark.createDataFrame([\n", +"Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),\n", +"Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),\n", +"Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))\n", +"])\n", +"df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"Create a PySpark DataFrame with an explicit schema." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"df = spark.createDataFrame([\n", +"(1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),\n", +"(2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),\n", +"(3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))\n", +"], schema='a long, b double, c string, d date, e timestamp')\n", +"df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"Create a PySpark Data