[GitHub] [spark] cloud-fan commented on pull request #38358: [SPARK-40588] FileFormatWriter materializes AQE plan before accessing outputOrdering
cloud-fan commented on PR #38358: URL: https://github.com/apache/spark/pull/38358#issuecomment-1308354718 thanks, merging to 3.3/3.2! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL
amaliujia commented on code in PR #38566: URL: https://github.com/apache/spark/pull/38566#discussion_r1017553189 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala: ## @@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, command: proto.Command) session.udf.registerPython(cf.getPartsList.asScala.head, udf) } + def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): Unit = { +val viewType = if (createView.getIsGlobal) GlobalTempView else LocalTempView + +val tableIdentifier = + try { +session.sessionState.sqlParser.parseTableIdentifier(createView.getName) Review Comment: ``` IDENTIFIER : (LETTER | DIGIT | '_')+ ; ``` The passing in view name might contains more than the defined grammar above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL
amaliujia commented on code in PR #38566: URL: https://github.com/apache/spark/pull/38566#discussion_r1017551706 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala: ## @@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, command: proto.Command) session.udf.registerPython(cf.getPartsList.asScala.head, udf) } + def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): Unit = { +val viewType = if (createView.getIsGlobal) GlobalTempView else LocalTempView + +val tableIdentifier = + try { +session.sessionState.sqlParser.parseTableIdentifier(createView.getName) Review Comment: This is from Dataset API implementation and I guess it was to pre-validate if the view name is in a legal format. If later there is a place to validate it then we can skip this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38573: [SPARK-41061][CONNECT] Support SelectExpr which applies Projection by expressions in Strings in Connect DSL
cloud-fan commented on PR #38573: URL: https://github.com/apache/spark/pull/38573#issuecomment-1308348049 shall we add the python client API in this PR as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL
cloud-fan commented on code in PR #38566: URL: https://github.com/apache/spark/pull/38566#discussion_r1017549065 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala: ## @@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, command: proto.Command) session.udf.registerPython(cf.getPartsList.asScala.head, udf) } + def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): Unit = { +val viewType = if (createView.getIsGlobal) GlobalTempView else LocalTempView + +val tableIdentifier = + try { +session.sessionState.sqlParser.parseTableIdentifier(createView.getName) Review Comment: do we need to parse it? It should be the plain view name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL
cloud-fan closed pull request #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL URL: https://github.com/apache/spark/pull/38475 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38475: [SPARK-40992][CONNECT] Support toDF(columnNames) in Connect DSL
cloud-fan commented on PR #38475: URL: https://github.com/apache/spark/pull/38475#issuecomment-1308344423 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully
wangyum commented on code in PR #38577: URL: https://github.com/apache/spark/pull/38577#discussion_r1017537591 ## dev/make-distribution.sh: ## @@ -161,7 +161,7 @@ fi # Build uber fat JAR cd "$SPARK_HOME" -export MAVEN_OPTS="${MAVEN_OPTS:--Xss128m -Xmx4g -Xmx4g -XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=128m}" +export MAVEN_OPTS="${MAVEN_OPTS:--Xss128m -Xmx4g -Xmx4g -XX:ReservedCodeCacheSize=128m}" Review Comment: Thanks for the fix. I can reproduce this issue through github action: https://github.com/wangyum/make-distribution.sh/actions/runs/3425964379/jobs/5707275344 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38570: [SPARK-41056][R] Fix new R_LIBS_SITE behavior introduced in R 4.2
HyukjinKwon closed pull request #38570: [SPARK-41056][R] Fix new R_LIBS_SITE behavior introduced in R 4.2 URL: https://github.com/apache/spark/pull/38570 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38570: [SPARK-41056][R] Fix new R_LIBS_SITE behavior introduced in R 4.2
HyukjinKwon commented on PR #38570: URL: https://github.com/apache/spark/pull/38570#issuecomment-1308338748 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38570: [SPARK-41056][R] Fix new R_LIBS_SITE behavior introduced in R 4.2
HyukjinKwon commented on PR #38570: URL: https://github.com/apache/spark/pull/38570#issuecomment-1308337531 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengchenyu commented on pull request #37949: [SPARK-40504][YARN] Make yarn appmaster load config from client
zhengchenyu commented on PR #37949: URL: https://github.com/apache/spark/pull/37949#issuecomment-1308320736 @dongjoon-hyun @srowen Can you please review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38511: [SPARK-41017][SQL] Support column pruning with multiple nondeterministic Filters
cloud-fan commented on PR #38511: URL: https://github.com/apache/spark/pull/38511#issuecomment-1308318576 cc @viirya @sigmod @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38573: [SPARK-41061][CONNECT] Support SelectExpr which applies Projection by expressions in Strings in Connect DSL
amaliujia commented on PR #38573: URL: https://github.com/apache/spark/pull/38573#issuecomment-1308314938 R: @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully
LuciferYang commented on code in PR #38577: URL: https://github.com/apache/spark/pull/38577#discussion_r1017502280 ## dev/make-distribution.sh: ## @@ -161,7 +161,7 @@ fi # Build uber fat JAR cd "$SPARK_HOME" -export MAVEN_OPTS="${MAVEN_OPTS:--Xss128m -Xmx4g -Xmx4g -XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=128m}" Review Comment: I see, but there still seems to be build fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully
LuciferYang commented on PR #38577: URL: https://github.com/apache/spark/pull/38577#issuecomment-1308298746 > As you mentioned in the PR description, did you hit this only at master branch issue? Let me double check this > Which java version are you using now? test on linux: ``` openjdk version "1.8.0_352" OpenJDK Runtime Environment (Zulu 8.66.0.15-CA-linux64) (build 1.8.0_352-b08) OpenJDK 64-Bit Server VM (Zulu 8.66.0.15-CA-linux64) (build 25.352-b08, mixed mode) ``` test on MacOS: ``` openjdk version "1.8.0_352" OpenJDK Runtime Environment (Zulu 8.66.0.15-CA-linux64) (build 1.8.0_352-b08) OpenJDK 64-Bit Server VM (Zulu 8.66.0.15-CA-linux64) (build 25.352-b08, mixed mode) ``` Wait me do more test with Java 11/17. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully
viirya commented on code in PR #38577: URL: https://github.com/apache/spark/pull/38577#discussion_r1017499034 ## dev/make-distribution.sh: ## @@ -161,7 +161,7 @@ fi # Build uber fat JAR cd "$SPARK_HOME" -export MAVEN_OPTS="${MAVEN_OPTS:--Xss128m -Xmx4g -Xmx4g -XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=128m}" Review Comment: It was added by https://github.com/apache/spark/pull/37510. Seems it was added also because `make-distributions.sh` failed to build. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully
pan3793 commented on PR #38577: URL: https://github.com/apache/spark/pull/38577#issuecomment-1308297255 I reproduced the issue w/ openjdk 1.8.0_332 macos-aarch64 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully
dongjoon-hyun commented on PR #38577: URL: https://github.com/apache/spark/pull/38577#issuecomment-1308293543 To @LuciferYang , - As you mentioned in the PR description, did you hit this only at `master` branch issue? - Which java version are you using now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL
amaliujia commented on code in PR #38566: URL: https://github.com/apache/spark/pull/38566#discussion_r1017494208 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala: ## @@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, command: proto.Command) session.udf.registerPython(cf.getPartsList.asScala.head, udf) } + def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): Unit = { +val viewType = if (createView.getIsGlobal) GlobalTempView else LocalTempView + +val tableIdentifier = + try { +session.sessionState.sqlParser.parseTableIdentifier(createView.getName) + } catch { +case _: ParseException => + throw QueryCompilationErrors.invalidViewNameError(createView.getName) + } + +val plan = CreateViewCommand( + name = tableIdentifier, + userSpecifiedColumns = Nil, + comment = None, + properties = Map.empty, + originalText = None, + plan = new SparkConnectPlanner(createView.getInput, session).transform(), Review Comment: In fact it is not... CommandPlanner and those LogicalPlanPlanner are separated, but Command's input usually are logical plan. I am thinking to refactor this somehow later to better have those logical contained together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully
dongjoon-hyun commented on PR #38577: URL: https://github.com/apache/spark/pull/38577#issuecomment-1308290786 cc @viirya too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38506: [SPARK-41010][CONNECT][PYTHON] Complete Support for Except and Intersect in Python client
zhengruifeng commented on code in PR #38506: URL: https://github.com/apache/spark/pull/38506#discussion_r1017492795 ## python/pyspark/sql/connect/dataframe.py: ## @@ -317,7 +319,83 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) -> if other._plan is None: raise ValueError("Argument to UnionByName does not contain a valid plan.") return DataFrame.withPlan( -plan.UnionAll(self._plan, other._plan, allowMissingColumns), session=self._session +plan.SetOperation( +self._plan, other._plan, "union", is_all=True, by_name=allowMissingColumns +), +session=self._session, +) + +def exceptAll(self, other: "DataFrame") -> "DataFrame": +"""Return a new :class:`DataFrame` containing rows in this :class:`DataFrame` but +not in another :class:`DataFrame` while preserving duplicates. + +This is equivalent to `EXCEPT ALL` in SQL. +As standard in SQL, this function resolves columns by position (not by name). + +.. versionadded:: 2.4.0 + +Parameters +-- +other : :class:`DataFrame` +The other :class:`DataFrame` to compare to. + +Returns +--- +:class:`DataFrame` +""" +return DataFrame.withPlan( +plan.SetOperation(self._plan, other._plan, "except", is_all=True), session=self._session +) + +def intersect(self, other: "DataFrame") -> "DataFrame": +"""Return a new :class:`DataFrame` containing rows only in +both this :class:`DataFrame` and another :class:`DataFrame`. +Note that any duplicates are removed. To preserve duplicates +use :func:`intersectAll`. + +.. versionadded:: 3.4.0 + +Parameters +-- +other : :class:`DataFrame` +Another :class:`DataFrame` that needs to be combined. + +Returns +--- +:class:`DataFrame` +Combined DataFrame. + +Notes +- +This is equivalent to `INTERSECT` in SQL. +""" +return DataFrame.withPlan( +plan.SetOperation(self._plan, other._plan, "intersect", is_all=False), +session=self._session, +) + +def intersectAll(self, other: "DataFrame") -> "DataFrame": +"""Return a new :class:`DataFrame` containing rows in both this :class:`DataFrame` +and another :class:`DataFrame` while preserving duplicates. + +This is equivalent to `INTERSECT ALL` in SQL. As standard in SQL, this function +resolves columns by position (not by name). + +.. versionadded:: 2.4.0 Review Comment: ```suggestion .. versionadded:: 3.4.0 ``` ## python/pyspark/sql/connect/dataframe.py: ## @@ -317,7 +319,83 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) -> if other._plan is None: raise ValueError("Argument to UnionByName does not contain a valid plan.") return DataFrame.withPlan( -plan.UnionAll(self._plan, other._plan, allowMissingColumns), session=self._session +plan.SetOperation( +self._plan, other._plan, "union", is_all=True, by_name=allowMissingColumns +), +session=self._session, +) + +def exceptAll(self, other: "DataFrame") -> "DataFrame": +"""Return a new :class:`DataFrame` containing rows in this :class:`DataFrame` but +not in another :class:`DataFrame` while preserving duplicates. + +This is equivalent to `EXCEPT ALL` in SQL. +As standard in SQL, this function resolves columns by position (not by name). + +.. versionadded:: 2.4.0 Review Comment: ```suggestion .. versionadded:: 3.4.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL
zhengruifeng commented on code in PR #38566: URL: https://github.com/apache/spark/pull/38566#discussion_r1017490854 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala: ## @@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, command: proto.Command) session.udf.registerPython(cf.getPartsList.asScala.head, udf) } + def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): Unit = { +val viewType = if (createView.getIsGlobal) GlobalTempView else LocalTempView + +val tableIdentifier = + try { +session.sessionState.sqlParser.parseTableIdentifier(createView.getName) + } catch { +case _: ParseException => + throw QueryCompilationErrors.invalidViewNameError(createView.getName) + } + +val plan = CreateViewCommand( + name = tableIdentifier, + userSpecifiedColumns = Nil, + comment = None, + properties = Map.empty, + originalText = None, + plan = new SparkConnectPlanner(createView.getInput, session).transform(), Review Comment: `this.transform()` or `transformRelation(createView.getInput)` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38566: [SPARK-41046][CONNECT] Support CreateView in Connect DSL
zhengruifeng commented on code in PR #38566: URL: https://github.com/apache/spark/pull/38566#discussion_r1017490854 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala: ## @@ -79,6 +85,32 @@ class SparkConnectCommandPlanner(session: SparkSession, command: proto.Command) session.udf.registerPython(cf.getPartsList.asScala.head, udf) } + def handleCreateViewCommand(createView: proto.CreateDataFrameViewCommand): Unit = { +val viewType = if (createView.getIsGlobal) GlobalTempView else LocalTempView + +val tableIdentifier = + try { +session.sessionState.sqlParser.parseTableIdentifier(createView.getName) + } catch { +case _: ParseException => + throw QueryCompilationErrors.invalidViewNameError(createView.getName) + } + +val plan = CreateViewCommand( + name = tableIdentifier, + userSpecifiedColumns = Nil, + comment = None, + properties = Map.empty, + originalText = None, + plan = new SparkConnectPlanner(createView.getInput, session).transform(), Review Comment: `this.transform()` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38491: [SPARK-41058][CONNECT] Remove unused import in commands.proto
cloud-fan commented on PR #38491: URL: https://github.com/apache/spark/pull/38491#issuecomment-1308284185 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38491: [SPARK-41058][CONNECT] Remove unused import in commands.proto
cloud-fan closed pull request #38491: [SPARK-41058][CONNECT] Remove unused import in commands.proto URL: https://github.com/apache/spark/pull/38491 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38490: [SPARK-41009][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1070` to `LOCATION_ALREADY_EXISTS`
cloud-fan commented on code in PR #38490: URL: https://github.com/apache/spark/pull/38490#discussion_r101748 ## core/src/main/resources/error/error-classes.json: ## @@ -668,6 +668,23 @@ } } }, + "LOCATION_ALREADY_EXISTS" : { +"message" : [ + "Cannot create the location because it already exists." Review Comment: that's for a specific legacy behavior and will go away eventually. Can we give it a dedicated error class? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38546: [SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching
amaliujia commented on code in PR #38546: URL: https://github.com/apache/spark/pull/38546#discussion_r1017484449 ## python/pyspark/sql/connect/dataframe.py: ## @@ -139,11 +139,9 @@ def columns(self) -> List[str]: if self._plan is None: return [] if "columns" not in self._cache and self._plan is not None: -pdd = self.limit(0).toPandas() Review Comment: I would prefer to not depend on the underly API when doing caching... E.g. what if someday the cache on the schema is gone but this API is not aware of it, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38546: [SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching
amaliujia commented on code in PR #38546: URL: https://github.com/apache/spark/pull/38546#discussion_r1017484449 ## python/pyspark/sql/connect/dataframe.py: ## @@ -139,11 +139,9 @@ def columns(self) -> List[str]: if self._plan is None: return [] if "columns" not in self._cache and self._plan is not None: -pdd = self.limit(0).toPandas() Review Comment: I would prefer to not depend on the underly API when doing caching... E.g. what if someday the cache on the schema is gone but this API is not aware of it, etc. Basically do not make assumptions :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38575: [WIP][SPARK-40948][SQL][FOLLOWUP] Restore PATH_NOT_FOUND
LuciferYang commented on code in PR #38575: URL: https://github.com/apache/spark/pull/38575#discussion_r1017484528 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -2330,7 +2330,7 @@ class DataFrameSuite extends QueryTest new File(uuid, "file3").getAbsolutePath, uuid).rdd } - assert(e.getMessage.startsWith("Path does not exist")) + assert(e.getErrorClass == "PATH_NOT_FOUND") Review Comment: nit: Can the empty `finally` block in this case be cleaned together in this pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38546: [SPARK-41036][CONNECT][PYTHON] `columns` API should use `schema` API to avoid data fetching
zhengruifeng commented on code in PR #38546: URL: https://github.com/apache/spark/pull/38546#discussion_r1017482919 ## python/pyspark/sql/connect/dataframe.py: ## @@ -139,11 +139,9 @@ def columns(self) -> List[str]: if self._plan is None: return [] if "columns" not in self._cache and self._plan is not None: -pdd = self.limit(0).toPandas() Review Comment: maybe we don't need to cache `columns`, just return `self.schema.names`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38091: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Fix flaky test case
LuciferYang commented on PR #38091: URL: https://github.com/apache/spark/pull/38091#issuecomment-1308279804 finally fixed, thanks @wankunde -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38578: [SPARK-41064][CONNECT][PYTHON] Implement `DataFrame.crosstab` and `DataFrame.stat.crosstab`
zhengruifeng opened a new pull request, #38578: URL: https://github.com/apache/spark/pull/38578 ### What changes were proposed in this pull request? Implement `DataFrame.crosstab` and `DataFrame.stat.crosstab` ### Why are the changes needed? for api coverage ### Does this PR introduce _any_ user-facing change? yes, new api ### How was this patch tested? added ut -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully
LuciferYang commented on PR #38577: URL: https://github.com/apache/spark/pull/38577#issuecomment-1308270119 cc @HyukjinKwon @wangyum @dongjoon-hyun @srowen @pan3793 @panbingkun Can you reproduce the issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38577: [SPARK-41071][BUILD] Remove `MaxMetaspaceSize` option from `make-distribution.sh` to make it run successfully
LuciferYang opened a new pull request, #38577: URL: https://github.com/apache/spark/pull/38577 ### What changes were proposed in this pull request? Run ``` dev/make-distribution.sh --tgz -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive ``` on the master branch fails as follows: ``` [ERROR] ## Exception when compiling 19 sources to /spark-source/connector/avro/target/scala-2.12/classes java.lang.OutOfMemoryError: Metaspace java.lang.ClassLoader.defineClass1(Native Method) java.lang.ClassLoader.defineClass(ClassLoader.java:757) java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) java.net.URLClassLoader.defineClass(URLClassLoader.java:473) java.net.URLClassLoader.access$100(URLClassLoader.java:74) java.net.URLClassLoader$1.run(URLClassLoader.java:369) java.net.URLClassLoader$1.run(URLClassLoader.java:363) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:362) java.lang.ClassLoader.loadClass(ClassLoader.java:419) scala_maven.ScalaCompilerLoader.loadClass(ScalaCompilerLoader.java:44) java.lang.ClassLoader.loadClass(ClassLoader.java:352) scala.collection.immutable.Set$Set2.$plus(Set.scala:170) scala.collection.immutable.Set$Set2.$plus(Set.scala:164) scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:28) scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:24) scala.collection.TraversableLike$grouper$1$.apply(TraversableLike.scala:467) scala.collection.TraversableLike$grouper$1$.apply(TraversableLike.scala:455) scala.collection.mutable.HashMap$$anon$1.$anonfun$foreach$2(HashMap.scala:153) scala.collection.mutable.HashMap$$anon$1$$Lambda$68804/294594059.apply(Unknown Source) scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:153) scala.collection.TraversableLike.groupBy(TraversableLike.scala:524) scala.collection.TraversableLike.groupBy$(TraversableLike.scala:454) scala.collection.AbstractTraversable.groupBy(Traversable.scala:108) scala.tools.nsc.settings.Warnings.$init$(Warnings.scala:91) scala.tools.nsc.settings.MutableSettings.(MutableSettings.scala:28) scala.tools.nsc.Settings.(Settings.scala:19) scala.tools.nsc.Settings.(Settings.scala:20) xsbt.CachedCompiler0.(CompilerBridge.scala:79) ``` So this pr remove `MaxMetaspaceSize` option from `make-distribution.sh` to make `MaxMetaspaceSize` can use more memory resources and make `make-distribution.sh` run successfully. ### Why are the changes needed? Make `make-distribution.sh` run successfully. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test: ``` dev/make-distribution.sh --tgz -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive ``` - Before: java.lang.OutOfMemoryError: Metaspace - After: successfully -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #38545: [MINOR][DOCS] Fix links in the sql-pyspark-pandas-with-arrow
panbingkun commented on PR #38545: URL: https://github.com/apache/spark/pull/38545#issuecomment-1308245910 cc @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #38576: [SPARK-41062][SQL] Rename `UNSUPPORTED_CORRELATED_REFERENCE` to `CORRELATED_REFERENCE`
itholic opened a new pull request, #38576: URL: https://github.com/apache/spark/pull/38576 ### What changes were proposed in this pull request? This PR proposes to rename `UNSUPPORTED_CORRELATED_REFERENCE` to `CORRELATED_REFERENCE`. ### Why are the changes needed? The sub-error class name is duplicated with its main class, `UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY`. We should make the all error class name clear and briefly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` ./build/sbt “sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*” ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38575: [WIP][SPARK-40948][SQL][FOLLOWUP] Restore PATH_NOT_FOUND
itholic commented on code in PR #38575: URL: https://github.com/apache/spark/pull/38575#discussion_r1017450594 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -2330,7 +2330,7 @@ class DataFrameSuite extends QueryTest new File(uuid, "file3").getAbsolutePath, uuid).rdd } - assert(e.getMessage.startsWith("Path does not exist")) + assert(e.getErrorClass == "PATH_NOT_FOUND") Review Comment: The current `checkError` seems not allow pattern matching for `parameter` checking. Let me just check the errorClass for now. Otherwise, it fails on different test envs: https://github.com/apache/spark/pull/38422#discussion_r1016116865 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38575: [WIP][SPARK-40948][SQL][FOLLOWUP] Restore PATH_NOT_FOUND
itholic commented on code in PR #38575: URL: https://github.com/apache/spark/pull/38575#discussion_r1017450594 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -2330,7 +2330,7 @@ class DataFrameSuite extends QueryTest new File(uuid, "file3").getAbsolutePath, uuid).rdd } - assert(e.getMessage.startsWith("Path does not exist")) + assert(e.getErrorClass == "PATH_NOT_FOUND") Review Comment: The current `checkError` seems not allow pattern matching for `parameter` checking. Let me just check the errorClass for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL commented on a diff in pull request #38503: [SPARK-40940] Remove Multi-stateful operator checkers for streaming queries.
WweiL commented on code in PR #38503: URL: https://github.com/apache/spark/pull/38503#discussion_r1017447511 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala: ## @@ -240,25 +240,30 @@ class FlatMapGroupsInPandasWithStateSuite extends StateStoreMetricsTest { .groupBy("key") .count() -testStream(result, Complete)( - AddData(inputData, "a"), - CheckNewAnswer(("a", 1)), - AddData(inputData, "a", "b"), - // mapGroups generates ("a", "2"), ("b", "1"); so increases counts of a and b by 1 - CheckNewAnswer(("a", 2), ("b", 1)), - StopStream, - StartStream(), - AddData(inputData, "a", "b"), - // mapGroups should remove state for "a" and generate ("a", "-1"), ("b", "2") ; - // so increment a and b by 1 - CheckNewAnswer(("a", 3), ("b", 2)), - StopStream, - StartStream(), - AddData(inputData, "a", "c"), - // mapGroups should recreate state for "a" and generate ("a", "1"), ("c", "1") ; - // so increment a and c by 1 - CheckNewAnswer(("a", 4), ("b", 2), ("c", 1)) -) +// As of [SPARK-40940], multiple state operator with Complete mode is disabled by default +val exp = intercept[AnalysisException] { Review Comment: This is the test I was mentioning @alex-balikov Just to confirm, in the above suite, sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala, there is a query which is flatMapGroupsWithState followed by an agg in complete mode, it also seems to be supported before. But here after the change it throws because there are two stateful ops in complete mode. Should we also allow this case? I'm not sure here because I remember you mentioned flatMapGroupsWithState could change the eventtime so we should disallow... So is the test wrong..? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL commented on a diff in pull request #38503: [SPARK-40940] Remove Multi-stateful operator checkers for streaming queries.
WweiL commented on code in PR #38503: URL: https://github.com/apache/spark/pull/38503#discussion_r1017447320 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala: ## @@ -190,20 +190,25 @@ class StreamingDeduplicationSuite extends StateStoreMetricsTest { .agg(sum("num")) .as[(String, Long)] -testStream(result, Update)( - AddData(inputData, "a" -> 1), - CheckLastBatch("a" -> 1L), - assertNumStateRows(total = Seq(1L, 1L), updated = Seq(1L, 1L)), - AddData(inputData, "a" -> 1), // Dropped - CheckLastBatch(), - assertNumStateRows(total = Seq(1L, 1L), updated = Seq(0L, 0L)), - AddData(inputData, "a" -> 2), - CheckLastBatch("a" -> 3L), - assertNumStateRows(total = Seq(1L, 2L), updated = Seq(1L, 1L)), - AddData(inputData, "b" -> 1), - CheckLastBatch("b" -> 1L), - assertNumStateRows(total = Seq(2L, 3L), updated = Seq(1L, 1L)) -) +// As of [SPARK-40940], multiple state operator with Complete mode is disabled by default Review Comment: I see. No problem at all! Just to confirm, in the above suite, sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsInPandasWithStateSuite.scala, there is a query which is flatMapGroupsWithState followed by an agg in complete mode, it also seems to be supported before. But here after the change it throws because there are two stateful ops in complete mode. Should we also allow this case? I'm not sure here because I remember you mentioned flatMapGroupsWithState could change the eventtime so we should disallow... So is the test wrong..? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #38575: [WIP][SPARK-40948][SQL][FOLLOWUP] Restore PATH_NOT_FOUND
itholic opened a new pull request, #38575: URL: https://github.com/apache/spark/pull/38575 ### What changes were proposed in this pull request? The original PR to introduce the error class `PATH_NOT_FOUND` was reverted since it breaks the tests in different test env. This PR proposes to restore it back. ### Why are the changes needed? Restoring the reverted changes with proper fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The existing CI should pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38535: [SPARK-41001] [CONNECT] Make `user_id` optional in SparkRemoteSession.
AmplabJenkins commented on PR #38535: URL: https://github.com/apache/spark/pull/38535#issuecomment-1308229533 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
gengliangwang commented on PR #38567: URL: https://github.com/apache/spark/pull/38567#issuecomment-1308227478 > Or part of a larger set of changes Yes, I created Jira https://issues.apache.org/jira/browse/SPARK-41053, and this one is just the beginning. > Maintaining state at driver on disk backed store and copying that to dfs has a few things which impact it - particularly for larger applications. Yes, but it worthies starting the effort. We can keep developing it until it is ready. I run a benchmark with Protobuf serializer, and the results are promising. For large applications, the current in-memory live UI can bring memory pressure which affects the driver stability as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
gengliangwang commented on PR #38567: URL: https://github.com/apache/spark/pull/38567#issuecomment-1308222974 > It seems to not match what is described in the pr description or add equivalent functionality which was reverted I believe it covers the reverted PR https://github.com/apache/spark/pull/38542. The previous approach is to create an optional disk store for storing a subset of the `SQLExecutionUIData`. There are also two shortages: * There are no evictions for the UI data on the disk * It requires extra memory for reading/writing LevelDB/RocksDB -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL commented on a diff in pull request #38503: [SPARK-40940] Remove Multi-stateful operator checkers for streaming queries.
WweiL commented on code in PR #38503: URL: https://github.com/apache/spark/pull/38503#discussion_r1017435393 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala: ## @@ -188,17 +194,26 @@ class UnsupportedOperationsSuite extends SparkFunSuite with SQLHelper { expectedMsgs = Seq("Complete")) // FlatMapGroupsWithState(Update) in streaming with aggregation - for (outputMode <- Seq(Append, Update, Complete)) { + for (outputMode <- Seq(Update, Complete)) { assertNotSupportedInStreamingPlan( "flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming relation " + s"with aggregation in $outputMode mode", TestFlatMapGroupsWithState( null, att, att, Seq(att), Seq(att), att, null, Update, isMapGroupsWithState = false, null, Aggregate(Seq(attributeWithWatermark), aggExprs("c"), streamRelation)), outputMode = outputMode, - expectedMsgs = Seq("flatMapGroupsWithState in update mode", "with aggregation")) + expectedMsgs = Seq("Multiple stateful operators", "Update", "Complete")) } + assertNotSupportedInStreamingPlan( +"flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming relation " + + s"with aggregation in Append mode", +TestFlatMapGroupsWithState( + null, att, att, Seq(att), Seq(att), att, null, Update, isMapGroupsWithState = false, null, Review Comment: Sorry I didn't really get it and I lack the context here. Do you mean flatMapGroupWithState(Update) with aggregation and output mode Append doesn't make sense? I think before there is a test on each output mode: https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala#L191 The reason I separate (Update, Complete) and Append is with the change, (Update, Complete) will be captured before the original check, so their error message is changed. But in Append mode it is still blocked by the original error message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] alex-balikov commented on a diff in pull request #38503: [SPARK-40940] Remove Multi-stateful operator checkers for streaming queries.
alex-balikov commented on code in PR #38503: URL: https://github.com/apache/spark/pull/38503#discussion_r1017413562 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala: ## @@ -190,20 +190,25 @@ class StreamingDeduplicationSuite extends StateStoreMetricsTest { .agg(sum("num")) .as[(String, Long)] -testStream(result, Update)( - AddData(inputData, "a" -> 1), - CheckLastBatch("a" -> 1L), - assertNumStateRows(total = Seq(1L, 1L), updated = Seq(1L, 1L)), - AddData(inputData, "a" -> 1), // Dropped - CheckLastBatch(), - assertNumStateRows(total = Seq(1L, 1L), updated = Seq(0L, 0L)), - AddData(inputData, "a" -> 2), - CheckLastBatch("a" -> 3L), - assertNumStateRows(total = Seq(1L, 2L), updated = Seq(1L, 1L)), - AddData(inputData, "b" -> 1), - CheckLastBatch("b" -> 1L), - assertNumStateRows(total = Seq(2L, 3L), updated = Seq(1L, 1L)) -) +// As of [SPARK-40940], multiple state operator with Complete mode is disabled by default Review Comment: hmm, the above scenario dropDuplicates -> aggregation was supported before. So I was wrong - dropDuplicates and also inner join with timestamp equality condition can be followed by a stateful operator in any mode - these operators - dropDuplicates and inner equality join do not delay the output records. I apologize for the randomization, I think the above scenario is important to continue supporting. ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala: ## @@ -215,20 +220,25 @@ class StreamingDeduplicationSuite extends StateStoreMetricsTest { .agg(sum("num")) .as[(String, Long)] -testStream(result, Complete)( - AddData(inputData, "a" -> 1), - CheckLastBatch("a" -> 1L), - assertNumStateRows(total = Seq(1L, 1L), updated = Seq(1L, 1L)), - AddData(inputData, "a" -> 1), // Dropped - CheckLastBatch("a" -> 1L), - assertNumStateRows(total = Seq(1L, 1L), updated = Seq(0L, 0L)), - AddData(inputData, "a" -> 2), - CheckLastBatch("a" -> 3L), - assertNumStateRows(total = Seq(1L, 2L), updated = Seq(1L, 1L)), - AddData(inputData, "b" -> 1), - CheckLastBatch("a" -> 3L, "b" -> 1L), - assertNumStateRows(total = Seq(2L, 3L), updated = Seq(1L, 1L)) -) +// As of [SPARK-40940], multiple state operator with Complete mode is disabled by default +val exp = intercept[AnalysisException] { Review Comment: same here - the above scenario should work ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala: ## @@ -188,17 +194,26 @@ class UnsupportedOperationsSuite extends SparkFunSuite with SQLHelper { expectedMsgs = Seq("Complete")) // FlatMapGroupsWithState(Update) in streaming with aggregation - for (outputMode <- Seq(Append, Update, Complete)) { + for (outputMode <- Seq(Update, Complete)) { assertNotSupportedInStreamingPlan( "flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming relation " + s"with aggregation in $outputMode mode", TestFlatMapGroupsWithState( null, att, att, Seq(att), Seq(att), att, null, Update, isMapGroupsWithState = false, null, Aggregate(Seq(attributeWithWatermark), aggExprs("c"), streamRelation)), outputMode = outputMode, - expectedMsgs = Seq("flatMapGroupsWithState in update mode", "with aggregation")) + expectedMsgs = Seq("Multiple stateful operators", "Update", "Complete")) } + assertNotSupportedInStreamingPlan( +"flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming relation " + + s"with aggregation in Append mode", +TestFlatMapGroupsWithState( + null, att, att, Seq(att), Seq(att), att, null, Update, isMapGroupsWithState = false, null, Review Comment: I am not sure what is the semantics of setting output mode to Update on flatMapGroupsWithState but it does not match the Append output mode below. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] 19Serhii99 opened a new pull request, #38574: [SPARK-41060] [K8S] Made the spark submitter generate new names for driver and executor config maps
19Serhii99 opened a new pull request, #38574: URL: https://github.com/apache/spark/pull/38574 ### What changes were proposed in this pull request? There's a problem with submitting spark jobs to K8s cluster: the library generates and reuses the same name for config maps (for drivers and executors). Ideally, for each job 2 config maps should created: for a driver and an executor. However, the library creates only one driver config map for all jobs (in some cases it generates only one executor map for all jobs). So, if I run 5 jobs, then only one driver config map will be generated and used for every job. During those runs we experience issues when deleting pods from the cluster: executors pods are endlessly created and immediately terminated overloading cluster resources. This problem occurs because of the **KubernetesClientUtils** class in which we have **configMapNameExecutor** and **configMapNameDriver** as constants. It seems to be incorrect and should be urgently fixed. I've prepared some changes for review to fix the issue (tested in the cluster of our project). ### Why are the changes needed? To make the spark submitter generate new names for driver and executor config maps to let jobs complete successfully, not overloading cluster resources. ### Does this PR introduce _any_ user-facing change? Executors pods should stop being generated and terminated endlessly. ### How was this patch tested? Modified the unit tests in the module. Tested with the unit tests and by submitting jobs to the K8S cluster of our project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB
mridulm commented on PR #38064: URL: https://github.com/apache/spark/pull/38064#issuecomment-1308159592 The test failures look like due to the memory requirements for the test 'collect data with single partition larger than 2GB bytes array limit' is too high - causing OOM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38573: [SPARK-41061][CONNECT] Support SelectExpr which applies Projection by expressions in Strings in Connect DSL
amaliujia opened a new pull request, #38573: URL: https://github.com/apache/spark/pull/38573 ### What changes were proposed in this pull request? 1. support `def selectExpr(exprs: String*)` in Connect DSL. 2. Server side supports translation Expressions in Strings. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #38572: [SPARK-41059][SQL] Rename `_LEGACY_ERROR_TEMP_2420` to `NESTED_AGGREGATE_FUNCTION`
itholic opened a new pull request, #38572: URL: https://github.com/apache/spark/pull/38572 ### What changes were proposed in this pull request? This PR proposes to rename `_LEGACY_ERROR_TEMP_2420` to `NESTED_AGGREGATE_FUNCTION` ### Why are the changes needed? We should rename the LEGACY errors properly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` ./build/sbt “sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*” ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38558: [SPARK-41048][SQL] Improve output partitioning and ordering with AQE cache
ulysses-you commented on code in PR #38558: URL: https://github.com/apache/spark/pull/38558#discussion_r1017369252 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -209,6 +209,19 @@ case class AdaptiveSparkPlanExec( override def output: Seq[Attribute] = inputPlan.output + // Try our best to give a stable output partitioning and ordering. Review Comment: Unfortunately we hint this.. per my experience, user always caches an arbitrary df and use the cached df to build an another arbitrary df. So why can't we preserve the partitioning/ordering of the cached plan ? If you really feel inconsistency in AdaptiveSparkPlanExec, we can probably move to InMemoryRelationExec. My original idea is to do the both but feel a little overkill (requiredOrdering should be inferred separately like https://github.com/apache/spark/pull/35924) ```scala requiredDistribution.map(_.createPartitioning(conf.shufflePartitions)).getOrElse { if (isFinalPlan) { executedPlan.outputPartitioning } else { super.outputPartitioning } } ``` A useful distribution before caching is few in production since `repartition(col)` will introduce skew -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
srielau commented on code in PR #38569: URL: https://github.com/apache/spark/pull/38569#discussion_r1017369097 ## core/src/main/resources/error/error-classes.json: ## @@ -469,6 +469,11 @@ "Grouping sets size cannot be greater than " ] }, + "GROUP_BY_AGGREGATE" : { Review Comment: I like it that they are distinct. GROUP_BY_POS_... might be fat fingering, GROUP_BY_AGGREGATE is: You don't know how GROUP BY works... Have reviewed? GROUP_BY_POS_REFERS_AGG_EXPR... that's not very pretty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38550: [SPARK-41039][BUILD] Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13
LuciferYang commented on PR #38550: URL: https://github.com/apache/spark/pull/38550#issuecomment-1308146770 Thanks @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #34815: [SPARK-37555][SQL] spark-sql should pass last unclosed comment to backend
AngersZh commented on code in PR #34815: URL: https://github.com/apache/spark/pull/34815#discussion_r1017365581 ## sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala: ## @@ -620,4 +620,17 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with Logging { |""".stripMargin -> "SELECT 1" ) } + + test("SPARK-37555: spark-sql should pass last unclosed comment to backend") { Review Comment: > @AngersZh shall we increase the timeout? The github action machines are not stable. Yea, the logic should be ok... Add below PR and we can trigger it many times to see if it's stable? https://github.com/apache/spark/pull/38571 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38550: [SPARK-41039][BUILD] Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13
srowen commented on PR #38550: URL: https://github.com/apache/spark/pull/38550#issuecomment-1308144543 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #38550: [SPARK-41039][BUILD] Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13
srowen closed pull request #38550: [SPARK-41039][BUILD] Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13 URL: https://github.com/apache/spark/pull/38550 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu opened a new pull request, #38571: [SPARK-37555][TEST][FOLLOWUP] Increase timeout of CLI test `spark-sql should pass last unclosed comment to backend`
AngersZh opened a new pull request, #38571: URL: https://github.com/apache/spark/pull/38571 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries
cloud-fan commented on code in PR #38557: URL: https://github.com/apache/spark/pull/38557#discussion_r1017361159 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -89,10 +88,8 @@ case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic buildKeys: Seq[Attribute], pruningKeys: Seq[Attribute]): Expression = { -val buildQuery = Project(buildKeys, matchingRowsPlan) -val dynamicPruningSubqueries = pruningKeys.zipWithIndex.map { case (key, index) => - DynamicPruningSubquery(key, buildQuery, buildKeys, index, onlyInBroadcast = false) -} -dynamicPruningSubqueries.reduce(And) +val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan) Review Comment: My rationale is, what we really need is a subquery here. This is completely different from dynamic partition pruning. One limitation is DS v2 runtime filter pushdown only applies to `DynamicPruningExpression`. We can probably fix that and accept normal non-correlated subqueries as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38491: [SPARK-41058][CONNECT] Remove unused import in commands.proto
amaliujia commented on PR #38491: URL: https://github.com/apache/spark/pull/38491#issuecomment-1308133299 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #38570: [SPARK-41056][R] Fix new R_LIBS_SITE behavior introduced in R 4.2
HyukjinKwon opened a new pull request, #38570: URL: https://github.com/apache/spark/pull/38570 ### What changes were proposed in this pull request? This PR proposes to keep the `R_LIBS_SITE` as was. It has been changed from R 4.2. ### Why are the changes needed? To keep the behaviour same as the previous. This especially affects the external libraries installed in SparkR worker sides. Especially this can break the user-installed libraries. See the paths below: **R 4.2** ``` # R > Sys.getenv("R_LIBS_SITE") [1] "/usr/local/lib/R/site-library/:/usr/lib/R/site-library:/usr/lib/R/library'" # R --vanilla > Sys.getenv("R_LIBS_SITE") [1] "/usr/lib/R/site-library" ``` **R 4.1** ``` # R > Sys.getenv("R_LIBS_SITE") [1] "/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library" # R --vanilla > Sys.getenv("R_LIBS_SITE") [1] "/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library" ``` ### Does this PR introduce _any_ user-facing change? Yes. With R 4.2, user-installed libraries won't be found in SparkR workers. ### How was this patch tested? Manually tested, unittest added. It's difficult to add an e2e tests. I also manually tested `getROptions` in Scala shall. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
itholic commented on code in PR #38569: URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734 ## core/src/main/resources/error/error-classes.json: ## @@ -469,6 +469,11 @@ "Grouping sets size cannot be greater than " ] }, + "GROUP_BY_AGGREGATE" : { Review Comment: I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires ``, but `GROUP_BY_AGGREGATE ` only cares if the aggregate function is used in the group by cause or not. Maybe we can manually pass the missing value for `` in `GROUP_BY_AGGREGATE`, but not sure how much it makes sense. WDYT, @srielau @MaxGekk ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
itholic commented on code in PR #38569: URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734 ## core/src/main/resources/error/error-classes.json: ## @@ -469,6 +469,11 @@ "Grouping sets size cannot be greater than " ] }, + "GROUP_BY_AGGREGATE" : { Review Comment: I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires ``, but `GROUP_BY_AGGREGATE ` only cares if the aggregate function is used in the group by cause or not. Maybe we can manually pass the missing value for `` in `GROUP_BY_AGGREGATE`, but not sure it makes sense or not. WDYT, @srielau @MaxGekk ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries
aokolnychyi commented on code in PR #38557: URL: https://github.com/apache/spark/pull/38557#discussion_r1017352098 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -89,10 +88,8 @@ case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic buildKeys: Seq[Attribute], pruningKeys: Seq[Attribute]): Expression = { -val buildQuery = Project(buildKeys, matchingRowsPlan) -val dynamicPruningSubqueries = pruningKeys.zipWithIndex.map { case (key, index) => - DynamicPruningSubquery(key, buildQuery, buildKeys, index, onlyInBroadcast = false) -} -dynamicPruningSubqueries.reduce(And) +val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan) Review Comment: Got it. I was originally worried we could miss some future optimizations given that dynamic pruning for row-level operations would go through a different route compared to the normal DPP. One alternative could be to extend `DynamicPruningSubquery` with a flag whether it should be optimized or not. Up to you, though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
itholic commented on code in PR #38569: URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734 ## core/src/main/resources/error/error-classes.json: ## @@ -469,6 +469,11 @@ "Grouping sets size cannot be greater than " ] }, + "GROUP_BY_AGGREGATE" : { Review Comment: I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires ``, but `GROUP_BY_AGGREGATE ` only cares if the aggregate function is used in the group by cause. We can manually pass the missing value for ``, but not sure it makes sense or not. WDYT, @srielau @MaxGekk ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
itholic commented on code in PR #38569: URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734 ## core/src/main/resources/error/error-classes.json: ## @@ -469,6 +469,11 @@ "Grouping sets size cannot be greater than " ] }, + "GROUP_BY_AGGREGATE" : { Review Comment: I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires ``, but it's only care if the aggregate function is used in the group by cause. We can manually pass the missing value for ``, but not sure it makes sense or not. WDYT, @srielau @MaxGekk ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
itholic commented on code in PR #38569: URL: https://github.com/apache/spark/pull/38569#discussion_r1017351734 ## core/src/main/resources/error/error-classes.json: ## @@ -469,6 +469,11 @@ "Grouping sets size cannot be greater than " ] }, + "GROUP_BY_AGGREGATE" : { Review Comment: I'm not pretty sure since `GROUP_BY_POS_REFERS_AGG_EXPR` requires , but it's only care if the aggregate function is used in the group by cause. We can manually pass the missing value for , but not sure it makes sense or not. WDYT, @srielau @MaxGekk ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
LuciferYang commented on code in PR #38569: URL: https://github.com/apache/spark/pull/38569#discussion_r1017347010 ## core/src/main/resources/error/error-classes.json: ## @@ -469,6 +469,11 @@ "Grouping sets size cannot be greater than " ] }, + "GROUP_BY_AGGREGATE" : { Review Comment: Could we re-use `GROUP_BY_POS_REFERS_AGG_EXPR ` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
LuciferYang commented on code in PR #38569: URL: https://github.com/apache/spark/pull/38569#discussion_r1017346760 ## core/src/main/resources/error/error-classes.json: ## @@ -469,6 +469,11 @@ "Grouping sets size cannot be greater than " ] }, + "GROUP_BY_AGGREGATE" : { Review Comment: looks similar to `GROUP_BY_POS_REFERS_AGG_EXPR` https://github.com/apache/spark/blob/0add57a1c0290a158666027afb3e035728d2dcee/core/src/main/resources/error/error-classes.json#L478-L483 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38491: [MINOR][CONNECT] Remove unused import in commands.proto
amaliujia commented on PR #38491: URL: https://github.com/apache/spark/pull/38491#issuecomment-1308120960 @dengziming I have missed this PR because there was no JIRA created under https://issues.apache.org/jira/browse/SPARK-39375 (I was monitoring works happened there). Since you are doing this change, can you create a JIRA for removing unused code (then update PR title with the JIRA included) and BTW remove some other unused code in connect project? I am seeing one example at https://github.com/apache/spark/blob/0add57a1c0290a158666027afb3e035728d2dcee/connector/connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala#L61. There are also unused imports probably existing that we can further clean up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
mridulm commented on PR #38567: URL: https://github.com/apache/spark/pull/38567#issuecomment-1308118224 Also note that we cannot avoid parsing event files at history server with db generated at driver - unless the configs match for both (retained stages, tasks, queries, etc): particularly as these can be specified by users (for example, to aggressively cleanup for perf sensitive apps). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38491: [MINOR][CONNECT] Remove unused import in commands.proto
cloud-fan commented on PR #38491: URL: https://github.com/apache/spark/pull/38491#issuecomment-1308116015 @amaliujia can you take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
mridulm commented on PR #38567: URL: https://github.com/apache/spark/pull/38567#issuecomment-1308113372 To comment on proposal in description, based on past prototypes I have worked on/seen: Maintaining state at driver on disk backed store and copying that to dfs has a few things which impact it - particularly for larger applications. They are not very robust to application crashes, interact in nontrivial ways with shutdown hook (hdfs failures) and increase application termination time during graceful shutdown. Depending on application characteristics, the impact of disk backed store can positively or negatively impact driver performance (positively - as updates are faster due to index, which was lacking in in memory store (when I added index, memory requirements increased :-( ), negatively due to increased disk activity): was difficult to predict. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
itholic commented on PR #38569: URL: https://github.com/apache/spark/pull/38569#issuecomment-1308111274 cc @MaxGekk @srielau FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #38569: [SPARK-41055][SQL] Rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE`
itholic opened a new pull request, #38569: URL: https://github.com/apache/spark/pull/38569 ### What changes were proposed in this pull request? This PR proposes to rename `_LEGACY_ERROR_TEMP_2424` to `GROUP_BY_AGGREGATE` ### Why are the changes needed? To use proper error class name. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` ./build/sbt “sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*” ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
mridulm commented on PR #38567: URL: https://github.com/apache/spark/pull/38567#issuecomment-1308103370 This pr is mostly adding ability to use a disk backed store in addition to in memory store. It seems to not match what is described in the pr description - is this wip ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes
LuciferYang commented on PR #38507: URL: https://github.com/apache/spark/pull/38507#issuecomment-130810 GA passed, @MaxGekk could you help to review this pr again, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38550: [SPARK-41039][BUILD] Upgrade `scala-parallel-collections` to 1.0.4 for Scala 2.13
LuciferYang commented on PR #38550: URL: https://github.com/apache/spark/pull/38550#issuecomment-1308100945 maven test all UTs with Scala 2.13 and this pr passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maryannxue commented on a diff in pull request #38558: [SPARK-41048][SQL] Improve output partitioning and ordering with AQE cache
maryannxue commented on code in PR #38558: URL: https://github.com/apache/spark/pull/38558#discussion_r1017328200 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -209,6 +209,19 @@ case class AdaptiveSparkPlanExec( override def output: Seq[Attribute] = inputPlan.output + // Try our best to give a stable output partitioning and ordering. Review Comment: This would be super limited use... and cause inconsistency. I'd only return output partitioning if there is a user repartition op in the end. In other words, only if AQE plan is required to preserve user specified partitioning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'
cloud-fan commented on code in PR #38263: URL: https://github.com/apache/spark/pull/38263#discussion_r1017320620 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.commons.codec.digest.DigestUtils + +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(str) - Returns hash of the given value. The hash is consistent and can be used to join masked values together across tables", Review Comment: Since we already have `sha2`, this PR is just adding an alias. We need to prove that this alias is common in the industry. Except for Hive, do other systems/products use these names? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries
cloud-fan commented on code in PR #38557: URL: https://github.com/apache/spark/pull/38557#discussion_r1017318193 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -320,6 +320,9 @@ abstract class Optimizer(catalogManager: CatalogManager) } def apply(plan: LogicalPlan): LogicalPlan = plan.transformAllExpressionsWithPruning( _.containsPattern(PLAN_EXPRESSION), ruleId) { + // Do not optimize DPP subquery, as it was created from optimized plan and we should not + // optimize it again, to save optimization time and avoid breaking broadcast/subquery reuse. + case d: DynamicPruningSubquery => d Review Comment: Yes, because this PR adds `OptimizeSubqueries` to the batch `PartitionPruning` and we should not break https://github.com/apache/spark/pull/33664 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bersprockets commented on pull request #38565: [SPARK-41035][SQL] Don't patch foldable children of aggregate functions in `RewriteDistinctAggregates`
bersprockets commented on PR #38565: URL: https://github.com/apache/spark/pull/38565#issuecomment-1308083533 @HyukjinKwon Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #34815: [SPARK-37555][SQL] spark-sql should pass last unclosed comment to backend
cloud-fan commented on code in PR #34815: URL: https://github.com/apache/spark/pull/34815#discussion_r1017317387 ## sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala: ## @@ -620,4 +620,17 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with Logging { |""".stripMargin -> "SELECT 1" ) } + + test("SPARK-37555: spark-sql should pass last unclosed comment to backend") { Review Comment: @AngersZh shall we increase the timeout? The github action machines are not stable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38419: [SPARK-40945][SQL] Support built-in function to truncate numbers
cloud-fan commented on PR #38419: URL: https://github.com/apache/spark/pull/38419#issuecomment-1308082603 ceil/floor also takes a second parameter for num of digits to retain. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries
viirya commented on code in PR #38557: URL: https://github.com/apache/spark/pull/38557#discussion_r1017317077 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -66,7 +65,7 @@ case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic } // optimize subqueries to rewrite them as joins and trigger job planning Review Comment: This comment can be removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries
viirya commented on code in PR #38557: URL: https://github.com/apache/spark/pull/38557#discussion_r1017316276 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -320,6 +320,9 @@ abstract class Optimizer(catalogManager: CatalogManager) } def apply(plan: LogicalPlan): LogicalPlan = plan.transformAllExpressionsWithPruning( _.containsPattern(PLAN_EXPRESSION), ruleId) { + // Do not optimize DPP subquery, as it was created from optimized plan and we should not + // optimize it again, to save optimization time and avoid breaking broadcast/subquery reuse. + case d: DynamicPruningSubquery => d Review Comment: This makes sense. Just wondering that is this particularly related to SPARK-38959? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38565: [SPARK-41035][SQL] Don't patch foldable children of aggregate functions in `RewriteDistinctAggregates`
HyukjinKwon commented on PR #38565: URL: https://github.com/apache/spark/pull/38565#issuecomment-1308081567 Merged to master, branch-3.3, and branch-3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun closed pull request #21: [SPARK-40569][TESTS] Add smoke test in standalone cluster for spark-docker
Yikun closed pull request #21: [SPARK-40569][TESTS] Add smoke test in standalone cluster for spark-docker URL: https://github.com/apache/spark-docker/pull/21 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on pull request #21: [SPARK-40569][TESTS] Add smoke test in standalone cluster for spark-docker
Yikun commented on PR #21: URL: https://github.com/apache/spark-docker/pull/21#issuecomment-1308072699 Merge to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on pull request #21: [SPARK-40569][TESTS] Add smoke test in standalone cluster for spark-docker
Yikun commented on PR #21: URL: https://github.com/apache/spark-docker/pull/21#issuecomment-1308072327 ``` testing/run_tests.sh --image-url ghcr.io/yikun/spark-docker/spark:python3 --scala-version 2.12 --spark-version 3.3.0 ===> Smoke test for ghcr.io/yikun/spark-docker/spark:python3 as root ===> Starting spark-master ===> Starting spark-worker ===> Waiting for spark-master to be ready... ===> spark-master is ready. ===> Waiting for spark-worker to be ready... ===> spark-worker is ready. ===> Starting spark-submit // ... Pi is roughly 3.140891570445785 ==> Killing 2 orphaned container(s)... done. ===> Smoke test for ghcr.io/yikun/spark-docker/spark:python3 as non-root ===> Starting spark-master ===> Starting spark-worker ===> Waiting for spark-master to be ready... ===> spark-master is ready. ===> Waiting for spark-worker to be ready... ===> spark-worker is ready. ===> Starting spark-submit // ... Pi is roughly 3.140739570369785 // ... ==> Killing 2 orphaned container(s)... done. Test successfully finished ``` Also retest it in my arm env. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38541: [SPARK-41034][CONNECT][PYTHON] Connect DataFrame should require a RemoteSparkSession
AmplabJenkins commented on PR #38541: URL: https://github.com/apache/spark/pull/38541#issuecomment-1308071934 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38565: [SPARK-41035][SQL] Don't patch foldable children of aggregate functions in `RewriteDistinctAggregates`
HyukjinKwon closed pull request #38565: [SPARK-41035][SQL] Don't patch foldable children of aggregate functions in `RewriteDistinctAggregates` URL: https://github.com/apache/spark/pull/38565 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Narcasserun opened a new pull request, #38568: [SSPARK-41051][CORE] Optimize ProcfsMetrics file acquisition
Narcasserun opened a new pull request, #38568: URL: https://github.com/apache/spark/pull/38568 What changes were proposed in this pull request? Reuse variables from declared procfs files instead of duplicate code Why are the changes needed? The cost of looking up the config is often insignificant, but there reduce some duplicate code Does this PR introduce any user-facing change? No. How was this patch tested? Existing unit tests. Closes #41051 from sus/[SPARK-41051](https://issues.apache.org/jira/browse/SPARK-41051). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries
cloud-fan commented on code in PR #38557: URL: https://github.com/apache/spark/pull/38557#discussion_r1017300121 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -89,10 +88,8 @@ case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic buildKeys: Seq[Attribute], pruningKeys: Seq[Attribute]): Expression = { -val buildQuery = Project(buildKeys, matchingRowsPlan) -val dynamicPruningSubqueries = pruningKeys.zipWithIndex.map { case (key, index) => - DynamicPruningSubquery(key, buildQuery, buildKeys, index, onlyInBroadcast = false) -} -dynamicPruningSubqueries.reduce(And) +val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan) Review Comment: I don't see any downside. We can only reuse broadcast if the DPP filter is derived from a join, which doesn't apply here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38557: [SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries
cloud-fan commented on code in PR #38557: URL: https://github.com/apache/spark/pull/38557#discussion_r1017300121 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -89,10 +88,8 @@ case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic buildKeys: Seq[Attribute], pruningKeys: Seq[Attribute]): Expression = { -val buildQuery = Project(buildKeys, matchingRowsPlan) -val dynamicPruningSubqueries = pruningKeys.zipWithIndex.map { case (key, index) => - DynamicPruningSubquery(key, buildQuery, buildKeys, index, onlyInBroadcast = false) -} -dynamicPruningSubqueries.reduce(And) +val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan) Review Comment: I don't see any downside. We can only reuse broadcast if the DPP filter is derived from a join, which doesn't apply here. What's better, now this is a normal subquery and we can trigger subquery reuse which is not possible for DPP subqueries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`
zhengruifeng commented on PR #38318: URL: https://github.com/apache/spark/pull/38318#issuecomment-1308066292 thanks @cloud-fan @HyukjinKwon @amaliujia for reviews! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38558: [SPARK-41048][SQL] Improve output partitioning and ordering with AQE cache
ulysses-you commented on code in PR #38558: URL: https://github.com/apache/spark/pull/38558#discussion_r1017299064 ## sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala: ## @@ -209,6 +209,19 @@ case class AdaptiveSparkPlanExec( override def output: Seq[Attribute] = inputPlan.output + // Try our best to give a stable output partitioning and ordering. Review Comment: yes, in general the first action for a cached plan is `count`, e.g. `CacheTableAsSelectExec`, so I think it is a not big issue that we can not optimize the shuffle/sort for the first action. The usage of the cache is: user wants to reference it multi-times, then this optimization will help a lot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`
cloud-fan closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary` URL: https://github.com/apache/spark/pull/38318 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #38542: Revert "[SPARK-38550][SQL][CORE] Use a disk-based store to save more debug information for live UI"
gengliangwang closed pull request #38542: Revert "[SPARK-38550][SQL][CORE] Use a disk-based store to save more debug information for live UI" URL: https://github.com/apache/spark/pull/38542 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`
cloud-fan commented on PR #38318: URL: https://github.com/apache/spark/pull/38318#issuecomment-1308063610 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org